sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
This commit is contained in:
2026-06-12 05:58:05 -07:00
parent d1e02293c5
commit fd99ee327c
47 changed files with 1825 additions and 1 deletions

34
tmp/r1-gemini.txt Normal file
View File

@@ -0,0 +1,34 @@
# GuruRMM — agent command delivery failure: diagnostic dossier (neutral)
The evidence suggests a highly selective delivery failure where **outbound (Server -> Agent) WebSocket Text frames** are being dropped, mangled, or ignored, while **WebSocket Control frames (Ping)** and **inbound (Agent -> Server) traffic** remain functional.
### 1. Candidate Root-Cause Hypotheses
| Hypothesis | Reasoning / Evidence FOR | Reasoning / Evidence AGAINST |
| :--- | :--- | :--- |
| **H1: MTU / Fragmentation (Large Frame Blackhole)** | Common in site-to-site tunnels (UniFi UDR). Large commands (4KB) + overhead might exceed path MTU, causing fragmentation. If the path drops fragments, the Text frame never arrives. | **Refuted by T2/T3/T9:** Tiny `hostname` commands (~80B) also failed. MTU issues usually spare small packets. |
| **H2: Application-Layer Deadlock (Agent Rx Loop)** | If the Agent's `handle_server_message` or the internal `tx` (to the executor) blocks or panics without crashing the read loop, the agent stops processing Text frames but continues to receive Pings (keeping the 90s deadline alive). | **Weakened by T7/T8:** A server-forced eviction and fresh reconnect (fresh Agent state) did not restore functionality. |
| **H3: Intermediate Proxy / Firewall (Text-Frame Filtering)** | Cloudflare or a local "Security Gateway" (UniFi) may be inspecting WebSocket payloads. If the Command JSON pattern matches a signature or if the `Upgrade` handshake is being "half-policed," Text frames might be dropped. | The fleet baseline (40 agents working) suggests it's not a global Cloudflare issue, but could be specific to "enhanced" security profiles at certain sites. |
| **H4: Protocol Desync (ConfigUpdate Serialization)** | On connect, the Server sends a `ConfigUpdate` (multi-KB). If this frame is malformed or if `tokio-tungstenite` on the server-side has a buffer issue with specific frame sequences, the Agent's parser might be stuck/errored for that specific stream. | **Supported by T8:** T8 confirms that even after a reconnect, delivery fails. This points to a failure during or immediately after the handshake/ConfigUpdate sequence. |
| **H5: Nginx/Proxy Buffer "Clogging"** | `proxy_read_timeout` is long, but `proxy_buffers` or internal memory limits on the origin/nginx could be hit for specific connections if the socket becomes "write-blocked" from the origin to Nginx. | **Against:** Nginx logs would typically show 499s or buffer overflows; `send_to` on the server returns `success`, meaning it's reaching the OS/loopback buffer. |
---
### 2. The Verdict on the Evidence
* **Inference Check:** The claim that "the agent was receiving SOME inbound frames" (T4) is an **unverified inference**. While the 90s timeout didn't trigger, we cannot distinguish between "the agent received a Ping" and "the agent's internal timer is somehow stalled/incorrect."
* **Contradiction:** T8 is the most troubling. If a fresh connection (after eviction) doesn't fix it, the issue is either **environmental (Network/Path)** or **intrinsic to the Agent's 0.6.63 startup logic** (e.g., it crashes its own Text handler immediately upon receiving the first `ConfigUpdate` or `Command`).
---
### 3. The SINGLE Most Diagnostic Test
**Action:** Dispatch a **WebSocket Ping from the Server** to a stuck agent and **explicitly log the receipt of the corresponding Pong**.
**Reasoning:**
* **If Pong is received:** We confirm the path is fully transparent to WebSocket *Control* frames. This isolates the failure to **Text-frame serialization, size, or content inspection** (H3/H4).
* **If Pong is NOT received:** The agent is "ghosting" the server entirely. This means the 90s timeout in the agent (T4) is the actual failure point — the agent *should* have reconnected but didn't. This points to an **Agent-side Rx loop deadlock** (H2).
**Outcome-based path:**
* **Pong Received:** Inspect Nginx/Cloudflare for "Web Application Firewall" (WAF) drops or "WebSocket inspection" settings.
* **Pong NOT Received:** The agent is dead-locked. Investigate the `tokio` task management in the Agent's 0.6.63 read loop.