# GuruRMM command-delivery diagnosis — ROUND 2 (topology-corrected) The revelation that `rmm-api` is **grey-clouded** (direct to IP) and passes through **two nginx layers** (NPM + Origin) significantly shifts the focus. We can now discard Cloudflare-side buffering or WAF interference as the primary cause. ### 1. REVISED ranked hypotheses | Hypothesis | Status | Reasoning | | :--- | :--- | :--- | | **H1: Nginx-to-Nginx Buffer Clog (Backpressure)** | **UP** [NEW TOP] | With two nginx layers in series, the "write-side" of the chain is more complex. If the Origin nginx has `proxy_buffering on` (default) and a small buffer, but NPM is slow to consume from the LAN, or the Agent is slow to consume from the WAN, the Origin nginx can "clog." The Rust server sees a successful write to the local socket, but the frame sits in an nginx buffer that never flushes to the next hop. | | **H2: Application-Layer Deadlock (Agent Rx Loop)** | **STABLE** | Still highly plausible. If the Agent's message-processing task (Tokio) hangs or panics during the handling of a specific message type (e.g., `ConfigUpdate` or a specific `Command` structure), the WebSocket read-loop might stay alive (responding to Pings at the library level) while the application logic stops processing `Text` frames. | | **H3: Path-Specific TCP/TLS Policing (UniFi/UDR)** | **UP** | Since there is no CDN normalizing the traffic, the endpoint gateway (UniFi UDR) is the first "intelligent" hop. If "Deep Packet Inspection" (DPI) or "IDS/IPS" is enabled on the UDR, it may be flagging the specific entropy/pattern of your encrypted WebSocket Text frames while ignoring the predictable patterns of Ping/Pong control frames. | | **H4: Protocol Desync (Nginx `proxy_http_version`)** | **DOWN** | The Origin config explicitly sets `proxy_http_version 1.1`, which is correct for WebSockets. Unless NPM is downgrading the connection to 1.0 (highly unlikely for a WS-enabled proxy), this is less likely than a buffering issue. | | **H5: MTU / Fragmentation** | **DROPPED** | As established in Round 1, the failure of ~80B commands (`hostname`) effectively refutes MTU as the primary culprit. | --- ### 2. The Verdict on the Architecture The **double-nginx hop** is a major red flag for WebSocket stability. WebSockets are long-lived and stateful; each proxy layer adds a potential point of "silent failure" where the TCP connection remains `ESTABLISHED` but the application-level framing is stuck. **Critical Observation:** The fact that one command succeeded *immediately* after a reconnect, followed by total silence, strongly suggests a **buffer saturation** or **one-time logic crash**. If it were a protocol mismatch, the first command would likely have failed too. --- ### 3. The SINGLE Most Diagnostic Test **Action:** Run a dual-ended packet capture on the **Origin Host (.30)** simultaneously on the **Incoming LAN interface** (traffic from NPM .20) and the **Loopback interface** (traffic to Rust server :3001). ```bash # Terminal 1: Capture from NPM to Origin (Inbound to .30) tcpdump -i eth0 -n -A src 172.16.3.20 and port 80 # Terminal 2: Capture from Origin to Rust (Inbound to :3001) tcpdump -i lo -n -A port 3001 ``` **Why this is the "Golden" test:** By comparing the two captures during a command dispatch, you identify exactly where the "Text" frame disappears in the chain: * **Outcome A: Command appears in BOTH captures.** * **Meaning:** The entire server-side infrastructure is working. The command was handed to the OS to be sent over the WAN. * **Verdict:** The issue is **H2 (Agent Deadlock)** or **H3 (ISP/Gateway interference)**. * **Outcome B: Command appears in Loopback (:3001) but NOT on eth0.** * **Meaning:** The Rust server sent it, the Origin nginx received it, but the Origin nginx failed to forward it to NPM. * **Verdict:** **H1 (Nginx Buffering)**. You likely need `proxy_buffering off;` on the Origin nginx. * **Outcome C: Command appears in NEITHER capture.** * **Meaning:** The Rust server's `send` call is lying or the frame is stuck in the Rust server's internal `tokio-tungstenite` sink. * **Verdict:** Internal Rust server-side concurrency/async issue. **Recommendation:** While running the capture, dispatch a **Ping** as well. If the Ping passes through both captures but the Command (Text) does not, you have confirmed **Outcome B** and isolated it to an Nginx `Text` frame handling/buffering issue.