# GuruRMM — agent command delivery failure: diagnostic dossier (neutral) The evidence suggests a highly selective delivery failure where **outbound (Server -> Agent) WebSocket Text frames** are being dropped, mangled, or ignored, while **WebSocket Control frames (Ping)** and **inbound (Agent -> Server) traffic** remain functional. ### 1. Candidate Root-Cause Hypotheses | Hypothesis | Reasoning / Evidence FOR | Reasoning / Evidence AGAINST | | :--- | :--- | :--- | | **H1: MTU / Fragmentation (Large Frame Blackhole)** | Common in site-to-site tunnels (UniFi UDR). Large commands (4KB) + overhead might exceed path MTU, causing fragmentation. If the path drops fragments, the Text frame never arrives. | **Refuted by T2/T3/T9:** Tiny `hostname` commands (~80B) also failed. MTU issues usually spare small packets. | | **H2: Application-Layer Deadlock (Agent Rx Loop)** | If the Agent's `handle_server_message` or the internal `tx` (to the executor) blocks or panics without crashing the read loop, the agent stops processing Text frames but continues to receive Pings (keeping the 90s deadline alive). | **Weakened by T7/T8:** A server-forced eviction and fresh reconnect (fresh Agent state) did not restore functionality. | | **H3: Intermediate Proxy / Firewall (Text-Frame Filtering)** | Cloudflare or a local "Security Gateway" (UniFi) may be inspecting WebSocket payloads. If the Command JSON pattern matches a signature or if the `Upgrade` handshake is being "half-policed," Text frames might be dropped. | The fleet baseline (40 agents working) suggests it's not a global Cloudflare issue, but could be specific to "enhanced" security profiles at certain sites. | | **H4: Protocol Desync (ConfigUpdate Serialization)** | On connect, the Server sends a `ConfigUpdate` (multi-KB). If this frame is malformed or if `tokio-tungstenite` on the server-side has a buffer issue with specific frame sequences, the Agent's parser might be stuck/errored for that specific stream. | **Supported by T8:** T8 confirms that even after a reconnect, delivery fails. This points to a failure during or immediately after the handshake/ConfigUpdate sequence. | | **H5: Nginx/Proxy Buffer "Clogging"** | `proxy_read_timeout` is long, but `proxy_buffers` or internal memory limits on the origin/nginx could be hit for specific connections if the socket becomes "write-blocked" from the origin to Nginx. | **Against:** Nginx logs would typically show 499s or buffer overflows; `send_to` on the server returns `success`, meaning it's reaching the OS/loopback buffer. | --- ### 2. The Verdict on the Evidence * **Inference Check:** The claim that "the agent was receiving SOME inbound frames" (T4) is an **unverified inference**. While the 90s timeout didn't trigger, we cannot distinguish between "the agent received a Ping" and "the agent's internal timer is somehow stalled/incorrect." * **Contradiction:** T8 is the most troubling. If a fresh connection (after eviction) doesn't fix it, the issue is either **environmental (Network/Path)** or **intrinsic to the Agent's 0.6.63 startup logic** (e.g., it crashes its own Text handler immediately upon receiving the first `ConfigUpdate` or `Command`). --- ### 3. The SINGLE Most Diagnostic Test **Action:** Dispatch a **WebSocket Ping from the Server** to a stuck agent and **explicitly log the receipt of the corresponding Pong**. **Reasoning:** * **If Pong is received:** We confirm the path is fully transparent to WebSocket *Control* frames. This isolates the failure to **Text-frame serialization, size, or content inspection** (H3/H4). * **If Pong is NOT received:** The agent is "ghosting" the server entirely. This means the 90s timeout in the agent (T4) is the actual failure point — the agent *should* have reconnected but didn't. This points to an **Agent-side Rx loop deadlock** (H2). **Outcome-based path:** * **Pong Received:** Inspect Nginx/Cloudflare for "Web Application Firewall" (WAF) drops or "WebSocket inspection" settings. * **Pong NOT Received:** The agent is dead-locked. Investigate the `tokio` task management in the Agent's 0.6.63 read loop.