182 lines
11 KiB
Markdown
182 lines
11 KiB
Markdown
# GuruRMM — agent command delivery failure: diagnostic dossier (neutral)
|
|
|
|
You are one of two independent AI models asked to help diagnose a distributed-systems
|
|
issue. Please reason from the raw facts below. Do NOT assume any particular root cause is
|
|
already established — several plausible mechanisms exist and we want them all on the table.
|
|
|
|
At the end, give:
|
|
1. Your ranked list of candidate root-cause hypotheses (with the reasoning/evidence for and against each).
|
|
2. The SINGLE most diagnostic test or piece of data you want next (something concrete we can run),
|
|
and what each possible outcome would tell you.
|
|
|
|
Be skeptical and specific. Call out where the evidence is contradictory or where a stated
|
|
"fact" might actually be an unverified inference.
|
|
|
|
---
|
|
|
|
## 1. System architecture (factual)
|
|
|
|
- **GuruRMM**: a remote monitoring/management system. A central server manages ~200 agents.
|
|
- **Agent**: Rust binary (current fleet version 0.6.63). Runs as a service on Windows/macOS/Linux
|
|
endpoints. Connects OUTBOUND to the server over a WebSocket (WSS).
|
|
- **Ingress path** for the agent WebSocket:
|
|
`agent -> (endpoint's LAN/NAT/router) -> Internet -> Cloudflare edge (rmm.azcomputerguru.com, 104.26.x / 2606:4700) -> origin host 172.16.3.30 -> nginx (TLS :443) -> proxy_pass http://127.0.0.1:3001 (the Rust server)`
|
|
- nginx `/ws` location config: `proxy_read_timeout 86400;` plus the standard `Upgrade`/`Connection` headers for WebSocket.
|
|
- The origin server (172.16.3.30) has NO direct TCP connection from any endpoint's WAN IP; all agent
|
|
sockets terminate at nginx, which presents them to the Rust server as loopback (127.0.0.1) connections.
|
|
- **Server**: Rust (axum + tokio + tokio-tungstenite). Listens 0.0.0.0:3001. One async task per agent connection.
|
|
|
|
## 2. The agent<->server protocol (factual, from source)
|
|
|
|
Server -> agent messages (all JSON over WS Text frames) include: AuthAck, ConfigUpdate, Command,
|
|
Update, CancelCommand, plus WS-level Ping frames.
|
|
|
|
Agent -> server messages (JSON over WS Text frames) include: Auth, Heartbeat, Metrics, CommandResult,
|
|
CommandAck, UpdateResult, plus WS-level Pong frames (in response to server Ping).
|
|
|
|
Connection lifecycle:
|
|
- Agent connects, sends Auth. Server replies AuthAck, then sends a ConfigUpdate (policy; a multi-hundred-byte
|
|
to few-KB JSON Text frame), then re-offers any pending commands.
|
|
- Server runs a per-connection send task that: (a) forwards queued server->agent messages to the socket,
|
|
and (b) sends a WS **Ping** frame every **30 seconds** (NAT keepalive).
|
|
- Agent read loop: on EVERY inbound frame it resets a deadline. If it receives **no inbound frame for 90s**,
|
|
it declares the connection dead and reconnects (jittered). It responds to server WS Ping with WS Pong.
|
|
- Agent sends a Heartbeat (Text) and Metrics on their own cadences (Heartbeat roughly every ~2 min based on
|
|
server logs; Metrics more often). The server updates the agent's `last_seen` on inbound messages.
|
|
|
|
Command delivery + durability (shipped earlier today, "comms-durability Phase 1"):
|
|
- Command lifecycle in DB: `pending` -> `running` (on dispatch; `delivery_attempts` increments each dispatch)
|
|
-> server stamps `acked_at` when the agent sends **CommandAck** -> `completed`/`failed` on result.
|
|
- Agent (0.6.63) sends **CommandAck on RECEIPT of a command, BEFORE executing it** (first thing in its
|
|
command handler), then dedups (won't re-run a command id it's already running/done).
|
|
- Server reaper runs every 60s: (1) returns un-acked `running` commands older than 60s to `pending`
|
|
(re-delivery, never fail) for ACK-capable agents; (2) re-offers pending commands on every agent Heartbeat
|
|
AND on (re)connect; (3) fails a command only after 10 delivery attempts (undeliverable) or a real post-ACK
|
|
execution timeout.
|
|
- `send_to(agent)` = look up the agent's sender channel in an in-memory map and `tx.send(msg)`; returns true
|
|
if the channel accepted the message.
|
|
|
|
## 3. The symptom
|
|
|
|
For a SUBSET of agents (confirmed: PST-SERVER, PST-SERVER2 — same physical site, behind a UniFi UDR
|
|
"Cloudflare-Ultra"-class gateway; and GTS-PEDRO-H — a DIFFERENT site), server->agent COMMANDS are not
|
|
executed: the command sits in `running`, `acked_at` never gets set, `delivery_attempts` climbs, and it
|
|
eventually fails. Meanwhile the SAME agents keep heartbeating and the server keeps showing them online.
|
|
|
|
Most of the fleet is UNAFFECTED: in the last 6 hours 397 commands completed across 40 distinct agents,
|
|
most recent completion timestamped AFTER the symptom window.
|
|
|
|
## 4. Per-agent observations (raw)
|
|
|
|
### PST-SERVER (agent id 87293069..., v0.6.63)
|
|
- `agent_events` shows it went `offline` at 00:22:00 then `online` at 00:22:03, and then NO further
|
|
online/offline events for the next 30+ minutes (a single continuous connection from the server's view).
|
|
- Throughout that window the server's `last_seen` for it stayed current to the second (i.e. the server
|
|
kept receiving inbound messages — heartbeats/metrics — from the agent).
|
|
- Commands dispatched in that window:
|
|
- 22:23:40 (before the 00:22 reconnect, on a prior connection): `hostname; whoami` -> **completed**,
|
|
`acked=true`, delivery_attempts=1, ack latency 16 ms. (This was minutes after the agent updated to 0.6.63.)
|
|
- 00:29:17: `powershell.exe -NonInteractive -EncodedCommand <~4 KB base64>` -> never acked, attempts climbed
|
|
to 10, failed.
|
|
- 00:38, 00:47 etc.: same large command shape -> never acked.
|
|
- A **tiny** `hostname` command (~80-byte Text frame) dispatched in the same period -> also never acked.
|
|
- Server logs during the stuck window, repeating every cycle:
|
|
`Reaper: re-delivered 1 un-acked command(s)` then `Re-offering 1 pending command(s) to agent 87293069...`.
|
|
The re-offer's `send_to` returns success (no "failed to enqueue" warning).
|
|
|
|
### PST-SERVER2 (agent id 5d2d7ba0..., v0.6.63, same site)
|
|
- Has NEVER successfully acked a command (no `acked_at` ever recorded). Commands fail as "undeliverable".
|
|
- `last_seen` is current (server receives its inbound heartbeats/metrics).
|
|
|
|
### GTS-PEDRO-H (agent id 2f1499f8..., v0.6.63, different site)
|
|
- 2026-06-12 01:43: `hostname` (tiny) -> **failed**, acked=false, attempts=1 (failed at the 60s command timeout).
|
|
- Last prior successful command: 2026-06-06 (6 days earlier; those completed but with acked=false — note those
|
|
predate the CommandAck feature so acked=false then is expected). No commands attempted between 06-06 and 06-12.
|
|
|
|
### Fleet baseline
|
|
- 397 commands `completed` in the last 6h across 40 distinct agents; most recent completion 01:44:43Z.
|
|
|
|
## 5. Tests performed and their RAW results
|
|
|
|
T1. Repeatedly dispatched the large (~4 KB) recon command to PST-SERVER. Result: never acked; attempts 1->10;
|
|
failed with stderr "Command undeliverable — agent unreachable" or "Command timed out".
|
|
T2. Dispatched a tiny `hostname` (~80 B Text frame) to PST-SERVER while it was in the stuck state.
|
|
Result: never acked; attempts climbed; did not execute.
|
|
T3. Cancelled ALL pending/running commands for PST-SERVER (to clear any queue/head-of-line buildup), waited 35s,
|
|
then dispatched a single tiny `hostname`. Result: still never acked.
|
|
T4. Verified PST-SERVER's `agent_events`: exactly one online (00:22:03) over 30+ minutes — i.e. the agent did
|
|
NOT hit its own 90s "no inbound" reconnect during the stuck window (implication: it was receiving SOME
|
|
inbound frames, at least the 30s server Pings, within every 90s window).
|
|
T5. Verified the server keeps receiving the agent's inbound: `last_seen` current; the per-Heartbeat
|
|
"Re-offering pending command" log fires, which is triggered by an inbound Heartbeat from the agent.
|
|
T6. Inspected ingress: nginx fronts the server; Cloudflare fronts nginx; `proxy_read_timeout 86400` on `/ws`.
|
|
No direct WAN socket from the endpoint on the origin (all agent sockets are loopback-from-nginx there).
|
|
T7. Deployed a SERVER change: a background task that force-closes (evicts) a connection once one of its
|
|
un-acked commands has been re-offered >= 3 times, to force the agent to reconnect onto a new connection.
|
|
Verified from logs that eviction fired and the agent reconnected ~12s later:
|
|
`Reaper: evicted non-delivering connection for agent 87293069 (forcing reconnect)`
|
|
`... connection evicted ... closing to force a fresh reconnect`
|
|
`ws: interrupted 1 orphaned commands for agent 87293069` (reconnect handshake)
|
|
`Sent ConfigUpdate to agent 87293069` (reconnect handshake)
|
|
`Agent 87293069 reconnected during grace period`
|
|
So eviction -> reconnect works mechanically.
|
|
T8. On the FRESH post-reconnect connection, dispatched a tiny `hostname`. Result: STILL never acked
|
|
(attempts climbed 1->2, no ack). (NOTE: "Sent ConfigUpdate" in T7 only means the server WROTE the frame to
|
|
its socket; we have NO independent confirmation the agent RECEIVED the ConfigUpdate.)
|
|
T9. Control test: dispatched a tiny `hostname` to GTS-PEDRO-H (non-PST, 0.6.63). Result: failed, never acked.
|
|
T10. Fleet check: 40 distinct agents have completed commands in the last 6h; delivery is NOT universally broken.
|
|
|
|
## 6. Relevant code excerpts (factual)
|
|
|
|
Server per-connection send task (one per agent):
|
|
```rust
|
|
let mut send_task = tokio::spawn(async move {
|
|
let mut ping_interval = tokio::time::interval(Duration::from_secs(30));
|
|
loop {
|
|
tokio::select! {
|
|
msg_opt = rx.recv() => { match msg_opt {
|
|
Some(msg) => { if sender.send(Message::Text(json)).await.is_err() { break; } }
|
|
None => break, } }
|
|
_ = ping_interval.tick() => { if sender.send(Message::Ping(vec![])).await.is_err() { break; } }
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
Server receive loop (same connection): handles agent Text via handler; on `Message::Ping` replies an Ack;
|
|
`Message::Pong` and other frames fall through to `_ => {}` (ignored).
|
|
|
|
Agent read loop:
|
|
```rust
|
|
Some(msg_result) = read.next() => {
|
|
last_incoming = Instant::now(); // reset on EVERY inbound frame
|
|
match msg_result {
|
|
Ok(Message::Text(text)) => { handle_server_message(&text, ...).await } // Command handled here -> sends CommandAck first
|
|
Ok(Message::Ping(data)) => { write.send(Message::Pong(data)).await }
|
|
Ok(Message::Pong(_)) => { /* debug */ }
|
|
Ok(Message::Close(_)) => break Ok(()),
|
|
Err(e) => break Err(e),
|
|
}
|
|
}
|
|
_ = sleep_until(last_incoming + Duration::from_secs(90)) => { // no inbound for 90s
|
|
break Err(anyhow!("Connection timeout: no server activity"));
|
|
}
|
|
```
|
|
|
|
Agent command handler (executor), first action on receiving a Command:
|
|
```rust
|
|
// runs as soon as a ServerMessage::Command is parsed:
|
|
tx.send(AgentMessage::CommandAck { command_id }).await // ACK on receipt, BEFORE executing
|
|
// then dedup (cached/in-flight) and execute
|
|
```
|
|
|
|
## 7. What we explicitly do NOT yet know
|
|
- Whether the agent at PST-SERVER is actually RECEIVING the server's Command Text frames (we only know it is
|
|
NOT sending CommandAck, and is NOT executing them).
|
|
- Whether the agent is receiving the server's 30s Ping frames (inferred ONLY from "it didn't reconnect in 90s").
|
|
- Whether the server's ConfigUpdate frame on reconnect actually reached the agent.
|
|
- We have NO packet capture and NO agent-side logs (we cannot currently reach these endpoints to pull logs,
|
|
because the only remote-control channel IS the thing that's broken).
|
|
|
|
Please diagnose.
|