35 lines
4.1 KiB
Plaintext
35 lines
4.1 KiB
Plaintext
# GuruRMM — agent command delivery failure: diagnostic dossier (neutral)
|
|
|
|
The evidence suggests a highly selective delivery failure where **outbound (Server -> Agent) WebSocket Text frames** are being dropped, mangled, or ignored, while **WebSocket Control frames (Ping)** and **inbound (Agent -> Server) traffic** remain functional.
|
|
|
|
### 1. Candidate Root-Cause Hypotheses
|
|
|
|
| Hypothesis | Reasoning / Evidence FOR | Reasoning / Evidence AGAINST |
|
|
| :--- | :--- | :--- |
|
|
| **H1: MTU / Fragmentation (Large Frame Blackhole)** | Common in site-to-site tunnels (UniFi UDR). Large commands (4KB) + overhead might exceed path MTU, causing fragmentation. If the path drops fragments, the Text frame never arrives. | **Refuted by T2/T3/T9:** Tiny `hostname` commands (~80B) also failed. MTU issues usually spare small packets. |
|
|
| **H2: Application-Layer Deadlock (Agent Rx Loop)** | If the Agent's `handle_server_message` or the internal `tx` (to the executor) blocks or panics without crashing the read loop, the agent stops processing Text frames but continues to receive Pings (keeping the 90s deadline alive). | **Weakened by T7/T8:** A server-forced eviction and fresh reconnect (fresh Agent state) did not restore functionality. |
|
|
| **H3: Intermediate Proxy / Firewall (Text-Frame Filtering)** | Cloudflare or a local "Security Gateway" (UniFi) may be inspecting WebSocket payloads. If the Command JSON pattern matches a signature or if the `Upgrade` handshake is being "half-policed," Text frames might be dropped. | The fleet baseline (40 agents working) suggests it's not a global Cloudflare issue, but could be specific to "enhanced" security profiles at certain sites. |
|
|
| **H4: Protocol Desync (ConfigUpdate Serialization)** | On connect, the Server sends a `ConfigUpdate` (multi-KB). If this frame is malformed or if `tokio-tungstenite` on the server-side has a buffer issue with specific frame sequences, the Agent's parser might be stuck/errored for that specific stream. | **Supported by T8:** T8 confirms that even after a reconnect, delivery fails. This points to a failure during or immediately after the handshake/ConfigUpdate sequence. |
|
|
| **H5: Nginx/Proxy Buffer "Clogging"** | `proxy_read_timeout` is long, but `proxy_buffers` or internal memory limits on the origin/nginx could be hit for specific connections if the socket becomes "write-blocked" from the origin to Nginx. | **Against:** Nginx logs would typically show 499s or buffer overflows; `send_to` on the server returns `success`, meaning it's reaching the OS/loopback buffer. |
|
|
|
|
---
|
|
|
|
### 2. The Verdict on the Evidence
|
|
|
|
* **Inference Check:** The claim that "the agent was receiving SOME inbound frames" (T4) is an **unverified inference**. While the 90s timeout didn't trigger, we cannot distinguish between "the agent received a Ping" and "the agent's internal timer is somehow stalled/incorrect."
|
|
* **Contradiction:** T8 is the most troubling. If a fresh connection (after eviction) doesn't fix it, the issue is either **environmental (Network/Path)** or **intrinsic to the Agent's 0.6.63 startup logic** (e.g., it crashes its own Text handler immediately upon receiving the first `ConfigUpdate` or `Command`).
|
|
|
|
---
|
|
|
|
### 3. The SINGLE Most Diagnostic Test
|
|
|
|
**Action:** Dispatch a **WebSocket Ping from the Server** to a stuck agent and **explicitly log the receipt of the corresponding Pong**.
|
|
|
|
**Reasoning:**
|
|
* **If Pong is received:** We confirm the path is fully transparent to WebSocket *Control* frames. This isolates the failure to **Text-frame serialization, size, or content inspection** (H3/H4).
|
|
* **If Pong is NOT received:** The agent is "ghosting" the server entirely. This means the 90s timeout in the agent (T4) is the actual failure point — the agent *should* have reconnected but didn't. This points to an **Agent-side Rx loop deadlock** (H2).
|
|
|
|
**Outcome-based path:**
|
|
* **Pong Received:** Inspect Nginx/Cloudflare for "Web Application Firewall" (WAF) drops or "WebSocket inspection" settings.
|
|
* **Pong NOT Received:** The agent is dead-locked. Investigate the `tokio` task management in the Agent's 0.6.63 read loop.
|