Files
claudetools/tmp/r5-gemini.txt
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

48 lines
4.0 KiB
Plaintext

[INFO] ROUND 5 DIAGNOSIS
# Verdict: NPM state/buffer exhaustion (H1) is the primary suspect. The "Three-Site" evidence effectively refutes Site-Gateway (H3) as a root cause.
## 1. Hypothesis Re-ranking & Shared Attribute Analysis
The fact that three different ISPs and physical sites (Site A, B, C) exhibit the same failure signature—specifically the loss of server->agent Text frames while pings persist—makes a "network path" issue (H3) statistically improbable.
**Re-ranked Hypotheses:**
1. **[SUSPECT] H1: NPM/Nginx Proxy State.** A specific interaction between Nginx's `proxy_buffering` and the WebSocket tunnel, possibly exacerbated by `http2` protocol-switching overhead or "orphan" buffers in the Docker bridge.
2. **[SUSPECT] H2: Agent/Host Environment.** Shared attribute: **Windows Server OS and/or NIC Offloading.**
* *Reasoning:* "PST-SERVER" and "PST-SERVER2" are explicit. If GTS-PEDRO-H is also a Server OS, the shared factor is likely the **Windows TCP Stack** (specifically Receive Window Auto-Tuning or Chimney Offload) reacting to NPM's MTU/buffer handling.
3. **[REFUTED] H3: Site Gateway.** Too many variables (different ISPs/UDRs) for a synchronized failure.
**Shared Attribute Candidate:** Long-lived connection age. If these 3 agents have the highest uptime/connection duration, they may be the only ones that have hit a specific Nginx `proxy_read_timeout` or `keepalive` edge case where the socket is "half-closed" in the proxy's eyes but pings (which are small/control frames) still sneak through.
## 2. HTTP/2 and Proxy Buffering Interaction
`http2 on` combined with `proxy_buffering on` (default) is a known "danger zone" for WebSockets in Nginx.
* **The Problem:** While Nginx *should* disable buffering for upgraded connections, the presence of `proxy_buffering on` can cause Nginx to attempt to buffer "large" frames into temporary files if it perceives the client (agent) as slow.
* **The Failure:** If NPM is using `http2` toward the agent, it is multiplexing the WS stream. If the "Text frame" (command) is large or follows a specific pattern, Nginx might be "waiting" to fill a buffer before flushing the H2 stream, whereas tiny Ping/Pong frames bypass this or fit in the first chunk.
* **Verdict:** This perfectly explains why "Text frames leave origin but never reach agent."
## 3. Recommended Sequence (Rounds 6-7)
**Step 1: Configuration Hardening (Information Gained: High | Risk: Low)**
Modify the NPM Proxy Host config for the RMM entries to include:
```nginx
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
```
*Rationale:* Forcing `proxy_buffering off` is the industry standard for stable WebSockets. It prevents Nginx from sitting on downstream data. A reload is non-destructive.
**Step 2: Socket State Inspection (Information Gained: High | Risk: None)**
Run `ss -tpni` on the NPM host (.20) and filter for the IP of PST-SERVER (`98.190.129.150`).
*Look for:* `unacked` bytes or a collapsed `rcv_space`. If `unacked` is high on the NPM->Agent side, the proxy is trying to send, but the agent's TCP window is closed.
**Step 3: Protocol Downgrade (Information Gained: Medium | Risk: Low)**
Disable `http2` in the NPM UI for this host.
*Rationale:* Eliminates the H2-to-H1.1 translation layer which is often where "frame-stuck" bugs live in Nginx.
## 4. The NPM Restart Probe
**Restarting the NPM container is a high-value one-shot probe.**
* **Result A (Recovery):** If agents recover, it proves **State Corruption** in the NPM/Docker bridge (e.g., conntrack table exhaustion or a hung Nginx worker process buffer).
* **Result B (No Recovery):** If they reconnect and *still* fail commands, it proves a **Protocol/Configuration Incompatibility** (e.g., the way NPM 2.x handles WS frames is permanently incompatible with the Windows Server TCP stack or the Agent's WS implementation).
**Recommendation:** Perform Step 1 (Config tweak) *first*. If that fails, Restart. Config tweaks survive restarts; restarts alone only clear symptoms.