Files
claudetools/tmp/r2-gemini.txt
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

51 lines
4.4 KiB
Plaintext

# GuruRMM command-delivery diagnosis — ROUND 2 (topology-corrected)
The revelation that `rmm-api` is **grey-clouded** (direct to IP) and passes through **two nginx layers** (NPM + Origin) significantly shifts the focus. We can now discard Cloudflare-side buffering or WAF interference as the primary cause.
### 1. REVISED ranked hypotheses
| Hypothesis | Status | Reasoning |
| :--- | :--- | :--- |
| **H1: Nginx-to-Nginx Buffer Clog (Backpressure)** | **UP** [NEW TOP] | With two nginx layers in series, the "write-side" of the chain is more complex. If the Origin nginx has `proxy_buffering on` (default) and a small buffer, but NPM is slow to consume from the LAN, or the Agent is slow to consume from the WAN, the Origin nginx can "clog." The Rust server sees a successful write to the local socket, but the frame sits in an nginx buffer that never flushes to the next hop. |
| **H2: Application-Layer Deadlock (Agent Rx Loop)** | **STABLE** | Still highly plausible. If the Agent's message-processing task (Tokio) hangs or panics during the handling of a specific message type (e.g., `ConfigUpdate` or a specific `Command` structure), the WebSocket read-loop might stay alive (responding to Pings at the library level) while the application logic stops processing `Text` frames. |
| **H3: Path-Specific TCP/TLS Policing (UniFi/UDR)** | **UP** | Since there is no CDN normalizing the traffic, the endpoint gateway (UniFi UDR) is the first "intelligent" hop. If "Deep Packet Inspection" (DPI) or "IDS/IPS" is enabled on the UDR, it may be flagging the specific entropy/pattern of your encrypted WebSocket Text frames while ignoring the predictable patterns of Ping/Pong control frames. |
| **H4: Protocol Desync (Nginx `proxy_http_version`)** | **DOWN** | The Origin config explicitly sets `proxy_http_version 1.1`, which is correct for WebSockets. Unless NPM is downgrading the connection to 1.0 (highly unlikely for a WS-enabled proxy), this is less likely than a buffering issue. |
| **H5: MTU / Fragmentation** | **DROPPED** | As established in Round 1, the failure of ~80B commands (`hostname`) effectively refutes MTU as the primary culprit. |
---
### 2. The Verdict on the Architecture
The **double-nginx hop** is a major red flag for WebSocket stability. WebSockets are long-lived and stateful; each proxy layer adds a potential point of "silent failure" where the TCP connection remains `ESTABLISHED` but the application-level framing is stuck.
**Critical Observation:** The fact that one command succeeded *immediately* after a reconnect, followed by total silence, strongly suggests a **buffer saturation** or **one-time logic crash**. If it were a protocol mismatch, the first command would likely have failed too.
---
### 3. The SINGLE Most Diagnostic Test
**Action:** Run a dual-ended packet capture on the **Origin Host (.30)** simultaneously on the **Incoming LAN interface** (traffic from NPM .20) and the **Loopback interface** (traffic to Rust server :3001).
```bash
# Terminal 1: Capture from NPM to Origin (Inbound to .30)
tcpdump -i eth0 -n -A src 172.16.3.20 and port 80
# Terminal 2: Capture from Origin to Rust (Inbound to :3001)
tcpdump -i lo -n -A port 3001
```
**Why this is the "Golden" test:**
By comparing the two captures during a command dispatch, you identify exactly where the "Text" frame disappears in the chain:
* **Outcome A: Command appears in BOTH captures.**
* **Meaning:** The entire server-side infrastructure is working. The command was handed to the OS to be sent over the WAN.
* **Verdict:** The issue is **H2 (Agent Deadlock)** or **H3 (ISP/Gateway interference)**.
* **Outcome B: Command appears in Loopback (:3001) but NOT on eth0.**
* **Meaning:** The Rust server sent it, the Origin nginx received it, but the Origin nginx failed to forward it to NPM.
* **Verdict:** **H1 (Nginx Buffering)**. You likely need `proxy_buffering off;` on the Origin nginx.
* **Outcome C: Command appears in NEITHER capture.**
* **Meaning:** The Rust server's `send` call is lying or the frame is stuck in the Rust server's internal `tokio-tungstenite` sink.
* **Verdict:** Internal Rust server-side concurrency/async issue.
**Recommendation:** While running the capture, dispatch a **Ping** as well. If the Ping passes through both captures but the Command (Text) does not, you have confirmed **Outcome B** and isolated it to an Nginx `Text` frame handling/buffering issue.