[INFO] ROUND 5 DIAGNOSIS # Verdict: NPM state/buffer exhaustion (H1) is the primary suspect. The "Three-Site" evidence effectively refutes Site-Gateway (H3) as a root cause. ## 1. Hypothesis Re-ranking & Shared Attribute Analysis The fact that three different ISPs and physical sites (Site A, B, C) exhibit the same failure signature—specifically the loss of server->agent Text frames while pings persist—makes a "network path" issue (H3) statistically improbable. **Re-ranked Hypotheses:** 1. **[SUSPECT] H1: NPM/Nginx Proxy State.** A specific interaction between Nginx's `proxy_buffering` and the WebSocket tunnel, possibly exacerbated by `http2` protocol-switching overhead or "orphan" buffers in the Docker bridge. 2. **[SUSPECT] H2: Agent/Host Environment.** Shared attribute: **Windows Server OS and/or NIC Offloading.** * *Reasoning:* "PST-SERVER" and "PST-SERVER2" are explicit. If GTS-PEDRO-H is also a Server OS, the shared factor is likely the **Windows TCP Stack** (specifically Receive Window Auto-Tuning or Chimney Offload) reacting to NPM's MTU/buffer handling. 3. **[REFUTED] H3: Site Gateway.** Too many variables (different ISPs/UDRs) for a synchronized failure. **Shared Attribute Candidate:** Long-lived connection age. If these 3 agents have the highest uptime/connection duration, they may be the only ones that have hit a specific Nginx `proxy_read_timeout` or `keepalive` edge case where the socket is "half-closed" in the proxy's eyes but pings (which are small/control frames) still sneak through. ## 2. HTTP/2 and Proxy Buffering Interaction `http2 on` combined with `proxy_buffering on` (default) is a known "danger zone" for WebSockets in Nginx. * **The Problem:** While Nginx *should* disable buffering for upgraded connections, the presence of `proxy_buffering on` can cause Nginx to attempt to buffer "large" frames into temporary files if it perceives the client (agent) as slow. * **The Failure:** If NPM is using `http2` toward the agent, it is multiplexing the WS stream. If the "Text frame" (command) is large or follows a specific pattern, Nginx might be "waiting" to fill a buffer before flushing the H2 stream, whereas tiny Ping/Pong frames bypass this or fit in the first chunk. * **Verdict:** This perfectly explains why "Text frames leave origin but never reach agent." ## 3. Recommended Sequence (Rounds 6-7) **Step 1: Configuration Hardening (Information Gained: High | Risk: Low)** Modify the NPM Proxy Host config for the RMM entries to include: ```nginx proxy_buffering off; proxy_read_timeout 3600s; proxy_send_timeout 3600s; ``` *Rationale:* Forcing `proxy_buffering off` is the industry standard for stable WebSockets. It prevents Nginx from sitting on downstream data. A reload is non-destructive. **Step 2: Socket State Inspection (Information Gained: High | Risk: None)** Run `ss -tpni` on the NPM host (.20) and filter for the IP of PST-SERVER (`98.190.129.150`). *Look for:* `unacked` bytes or a collapsed `rcv_space`. If `unacked` is high on the NPM->Agent side, the proxy is trying to send, but the agent's TCP window is closed. **Step 3: Protocol Downgrade (Information Gained: Medium | Risk: Low)** Disable `http2` in the NPM UI for this host. *Rationale:* Eliminates the H2-to-H1.1 translation layer which is often where "frame-stuck" bugs live in Nginx. ## 4. The NPM Restart Probe **Restarting the NPM container is a high-value one-shot probe.** * **Result A (Recovery):** If agents recover, it proves **State Corruption** in the NPM/Docker bridge (e.g., conntrack table exhaustion or a hung Nginx worker process buffer). * **Result B (No Recovery):** If they reconnect and *still* fail commands, it proves a **Protocol/Configuration Incompatibility** (e.g., the way NPM 2.x handles WS frames is permanently incompatible with the Windows Server TCP stack or the Agent's WS implementation). **Recommendation:** Perform Step 1 (Config tweak) *first*. If that fails, Restart. Config tweaks survive restarts; restarts alone only clear symptoms.