claudetools/tmp/rmm-diag-dossier.md

# GuruRMM — agent command delivery failure: diagnostic dossier (neutral)

You are one of two independent AI models asked to help diagnose a distributed-systems
issue. Please reason from the raw facts below. Do NOT assume any particular root cause is
already established — several plausible mechanisms exist and we want them all on the table.

At the end, give:
1. Your ranked list of candidate root-cause hypotheses (with the reasoning/evidence for and against each).
2. The SINGLE most diagnostic test or piece of data you want next (something concrete we can run),
   and what each possible outcome would tell you.

Be skeptical and specific. Call out where the evidence is contradictory or where a stated
"fact" might actually be an unverified inference.

---

## 1. System architecture (factual)

- **GuruRMM**: a remote monitoring/management system. A central server manages ~200 agents.
- **Agent**: Rust binary (current fleet version 0.6.63). Runs as a service on Windows/macOS/Linux
  endpoints. Connects OUTBOUND to the server over a WebSocket (WSS).
- **Ingress path** for the agent WebSocket:
  `agent  ->  (endpoint's LAN/NAT/router)  ->  Internet  ->  Cloudflare edge (rmm.azcomputerguru.com, 104.26.x / 2606:4700)  ->  origin host 172.16.3.30  ->  nginx (TLS :443)  ->  proxy_pass http://127.0.0.1:3001 (the Rust server)`
  - nginx `/ws` location config: `proxy_read_timeout 86400;` plus the standard `Upgrade`/`Connection` headers for WebSocket.
  - The origin server (172.16.3.30) has NO direct TCP connection from any endpoint's WAN IP; all agent
    sockets terminate at nginx, which presents them to the Rust server as loopback (127.0.0.1) connections.
- **Server**: Rust (axum + tokio + tokio-tungstenite). Listens 0.0.0.0:3001. One async task per agent connection.

## 2. The agent<->server protocol (factual, from source)

Server -> agent messages (all JSON over WS Text frames) include: AuthAck, ConfigUpdate, Command,
Update, CancelCommand, plus WS-level Ping frames.

Agent -> server messages (JSON over WS Text frames) include: Auth, Heartbeat, Metrics, CommandResult,
CommandAck, UpdateResult, plus WS-level Pong frames (in response to server Ping).

Connection lifecycle:
- Agent connects, sends Auth. Server replies AuthAck, then sends a ConfigUpdate (policy; a multi-hundred-byte
  to few-KB JSON Text frame), then re-offers any pending commands.
- Server runs a per-connection send task that: (a) forwards queued server->agent messages to the socket,
  and (b) sends a WS **Ping** frame every **30 seconds** (NAT keepalive).
- Agent read loop: on EVERY inbound frame it resets a deadline. If it receives **no inbound frame for 90s**,
  it declares the connection dead and reconnects (jittered). It responds to server WS Ping with WS Pong.
- Agent sends a Heartbeat (Text) and Metrics on their own cadences (Heartbeat roughly every ~2 min based on
  server logs; Metrics more often). The server updates the agent's `last_seen` on inbound messages.

Command delivery + durability (shipped earlier today, "comms-durability Phase 1"):
- Command lifecycle in DB: `pending` -> `running` (on dispatch; `delivery_attempts` increments each dispatch)
  -> server stamps `acked_at` when the agent sends **CommandAck** -> `completed`/`failed` on result.
- Agent (0.6.63) sends **CommandAck on RECEIPT of a command, BEFORE executing it** (first thing in its
  command handler), then dedups (won't re-run a command id it's already running/done).
- Server reaper runs every 60s: (1) returns un-acked `running` commands older than 60s to `pending`
  (re-delivery, never fail) for ACK-capable agents; (2) re-offers pending commands on every agent Heartbeat
  AND on (re)connect; (3) fails a command only after 10 delivery attempts (undeliverable) or a real post-ACK
  execution timeout.
- `send_to(agent)` = look up the agent's sender channel in an in-memory map and `tx.send(msg)`; returns true
  if the channel accepted the message.

## 3. The symptom

For a SUBSET of agents (confirmed: PST-SERVER, PST-SERVER2 — same physical site, behind a UniFi UDR
"Cloudflare-Ultra"-class gateway; and GTS-PEDRO-H — a DIFFERENT site), server->agent COMMANDS are not
executed: the command sits in `running`, `acked_at` never gets set, `delivery_attempts` climbs, and it
eventually fails. Meanwhile the SAME agents keep heartbeating and the server keeps showing them online.

Most of the fleet is UNAFFECTED: in the last 6 hours 397 commands completed across 40 distinct agents,
most recent completion timestamped AFTER the symptom window.

## 4. Per-agent observations (raw)

### PST-SERVER (agent id 87293069..., v0.6.63)
- `agent_events` shows it went `offline` at 00:22:00 then `online` at 00:22:03, and then NO further
  online/offline events for the next 30+ minutes (a single continuous connection from the server's view).
- Throughout that window the server's `last_seen` for it stayed current to the second (i.e. the server
  kept receiving inbound messages — heartbeats/metrics — from the agent).
- Commands dispatched in that window:
  - 22:23:40 (before the 00:22 reconnect, on a prior connection): `hostname; whoami` -> **completed**,
    `acked=true`, delivery_attempts=1, ack latency 16 ms. (This was minutes after the agent updated to 0.6.63.)
  - 00:29:17: `powershell.exe -NonInteractive -EncodedCommand <~4 KB base64>` -> never acked, attempts climbed
    to 10, failed.
  - 00:38, 00:47 etc.: same large command shape -> never acked.
  - A **tiny** `hostname` command (~80-byte Text frame) dispatched in the same period -> also never acked.
- Server logs during the stuck window, repeating every cycle:
  `Reaper: re-delivered 1 un-acked command(s)` then `Re-offering 1 pending command(s) to agent 87293069...`.
  The re-offer's `send_to` returns success (no "failed to enqueue" warning).

### PST-SERVER2 (agent id 5d2d7ba0..., v0.6.63, same site)
- Has NEVER successfully acked a command (no `acked_at` ever recorded). Commands fail as "undeliverable".
- `last_seen` is current (server receives its inbound heartbeats/metrics).

### GTS-PEDRO-H (agent id 2f1499f8..., v0.6.63, different site)
- 2026-06-12 01:43: `hostname` (tiny) -> **failed**, acked=false, attempts=1 (failed at the 60s command timeout).
- Last prior successful command: 2026-06-06 (6 days earlier; those completed but with acked=false — note those
  predate the CommandAck feature so acked=false then is expected). No commands attempted between 06-06 and 06-12.

### Fleet baseline
- 397 commands `completed` in the last 6h across 40 distinct agents; most recent completion 01:44:43Z.

## 5. Tests performed and their RAW results

T1. Repeatedly dispatched the large (~4 KB) recon command to PST-SERVER. Result: never acked; attempts 1->10;
    failed with stderr "Command undeliverable — agent unreachable" or "Command timed out".
T2. Dispatched a tiny `hostname` (~80 B Text frame) to PST-SERVER while it was in the stuck state.
    Result: never acked; attempts climbed; did not execute.
T3. Cancelled ALL pending/running commands for PST-SERVER (to clear any queue/head-of-line buildup), waited 35s,
    then dispatched a single tiny `hostname`. Result: still never acked.
T4. Verified PST-SERVER's `agent_events`: exactly one online (00:22:03) over 30+ minutes — i.e. the agent did
    NOT hit its own 90s "no inbound" reconnect during the stuck window (implication: it was receiving SOME
    inbound frames, at least the 30s server Pings, within every 90s window).
T5. Verified the server keeps receiving the agent's inbound: `last_seen` current; the per-Heartbeat
    "Re-offering pending command" log fires, which is triggered by an inbound Heartbeat from the agent.
T6. Inspected ingress: nginx fronts the server; Cloudflare fronts nginx; `proxy_read_timeout 86400` on `/ws`.
    No direct WAN socket from the endpoint on the origin (all agent sockets are loopback-from-nginx there).
T7. Deployed a SERVER change: a background task that force-closes (evicts) a connection once one of its
    un-acked commands has been re-offered >= 3 times, to force the agent to reconnect onto a new connection.
    Verified from logs that eviction fired and the agent reconnected ~12s later:
      `Reaper: evicted non-delivering connection for agent 87293069 (forcing reconnect)`
      `... connection evicted ... closing to force a fresh reconnect`
      `ws: interrupted 1 orphaned commands for agent 87293069`  (reconnect handshake)
      `Sent ConfigUpdate to agent 87293069`                      (reconnect handshake)
      `Agent 87293069 reconnected during grace period`
    So eviction -> reconnect works mechanically.
T8. On the FRESH post-reconnect connection, dispatched a tiny `hostname`. Result: STILL never acked
    (attempts climbed 1->2, no ack). (NOTE: "Sent ConfigUpdate" in T7 only means the server WROTE the frame to
    its socket; we have NO independent confirmation the agent RECEIVED the ConfigUpdate.)
T9. Control test: dispatched a tiny `hostname` to GTS-PEDRO-H (non-PST, 0.6.63). Result: failed, never acked.
T10. Fleet check: 40 distinct agents have completed commands in the last 6h; delivery is NOT universally broken.

## 6. Relevant code excerpts (factual)

Server per-connection send task (one per agent):
```rust
let mut send_task = tokio::spawn(async move {
    let mut ping_interval = tokio::time::interval(Duration::from_secs(30));
    loop {
        tokio::select! {
            msg_opt = rx.recv() => { match msg_opt {
                Some(msg) => { if sender.send(Message::Text(json)).await.is_err() { break; } }
                None => break, } }
            _ = ping_interval.tick() => { if sender.send(Message::Ping(vec![])).await.is_err() { break; } }
        }
    }
});
```

Server receive loop (same connection): handles agent Text via handler; on `Message::Ping` replies an Ack;
`Message::Pong` and other frames fall through to `_ => {}` (ignored).

Agent read loop:
```rust
Some(msg_result) = read.next() => {
    last_incoming = Instant::now();                  // reset on EVERY inbound frame
    match msg_result {
        Ok(Message::Text(text)) => { handle_server_message(&text, ...).await }  // Command handled here -> sends CommandAck first
        Ok(Message::Ping(data)) => { write.send(Message::Pong(data)).await }
        Ok(Message::Pong(_))    => { /* debug */ }
        Ok(Message::Close(_))   => break Ok(()),
        Err(e)                  => break Err(e),
    }
}
_ = sleep_until(last_incoming + Duration::from_secs(90)) => {   // no inbound for 90s
    break Err(anyhow!("Connection timeout: no server activity"));
}
```

Agent command handler (executor), first action on receiving a Command:
```rust
// runs as soon as a ServerMessage::Command is parsed:
tx.send(AgentMessage::CommandAck { command_id }).await   // ACK on receipt, BEFORE executing
// then dedup (cached/in-flight) and execute
```

## 7. What we explicitly do NOT yet know
- Whether the agent at PST-SERVER is actually RECEIVING the server's Command Text frames (we only know it is
  NOT sending CommandAck, and is NOT executing them).
- Whether the agent is receiving the server's 30s Ping frames (inferred ONLY from "it didn't reconnect in 90s").
- Whether the server's ConfigUpdate frame on reconnect actually reached the agent.
- We have NO packet capture and NO agent-side logs (we cannot currently reach these endpoints to pull logs,
  because the only remote-control channel IS the thing that's broken).

Please diagnose.