Files
claudetools/tmp/rmm-diag-round7.md
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

44 lines
3.5 KiB
Markdown

# GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)
Quorum continuing. The operator's SNAT/return-path hypothesis was tested directly. Result below — please
give your final verdict and the recommended action plan.
## Recap of the converging picture
- agent -> site gateway -> internet -> pfSense (multi-WAN, IP-alias 72.194.62.10 on igc0/"wan") -> DNAT to
NPM (.20) -> origin nginx (.30) -> Rust :3001. Affected: PST-SERVER/PST-SERVER2/GTS-PEDRO-H (3 ISPs), v0.6.63.
- Packet capture proved: command Text frame leaves Rust, leaves origin nginx, is forwarded by NPM to the agent,
and is TCP-ACKed by the agent with NO retransmits — i.e. the bytes reach the agent's kernel — yet the agent
never emits a CommandAck or executes, while still answering the server's pings on the same connection.
- It worked for exactly ONE command at 22:23, immediately after the agent process restarted into 0.6.63
(this morning's "comms-durability slice B" added CommandAck-on-receipt + a dedup/results cache to the agent's
command path). A WS reconnect does NOT fix it (server restart + a forced eviction both gave fresh WS
connections, commands still failed). Only the agent PROCESS restart (the update) ever yielded a success.
- Agent code note: the agent's command handler is inline in the read loop (handle_server_message().await),
sending CommandAck as its FIRST line before spawning the run task; a parse error at
serde_json::from_str::<ServerMessage>() is logged-and-ignored, so the read loop survives and keeps answering
pings. So "no CommandAck at all" implies the failure is at/just-after parse/dispatch, not a dead long-lived task.
## THE SNAT TEST (operator's hypothesis, now tested directly)
- Added a pfSense outbound-NAT (SNAT) rule pinning NPM source: `nat on igc0 inet from 172.16.3.20 to any ->
72.194.62.10`. Verified active in `pfctl -sn`. Fleet stayed healthy (149 agents online).
- Killed PST-SERVER's pfSense states (forcing a brand-new TCP/WS connection under the new SNAT rule). PST
reconnected (last_seen fresh).
- Dispatched a fresh `hostname` command on that new connection.
- RESULT: still never ACKed (delivery_attempts climbing, no execution) — identical symptom.
- Caveat we want you to weigh: pfSense replies to an INBOUND DNAT connection use reply-to/state, which
typically BYPASSES outbound NAT. So this SNAT rule may only affect NPM-INITIATED outbound, not the
server->agent reply path — i.e. the test may not have actually changed the reply source IP. However, the
earlier capture already showed the agent ACKing the server->agent frames (so the reply source was already
acceptable to the agent).
## Questions (each model, concise)
1. Does the SNAT test result (no change) — together with the capture showing commands reaching + being ACKed
by the agent — fully close the network/SNAT/proxy hypothesis class? Or is there a residual network angle the
SNAT test did NOT actually exercise (given reply-to bypasses outbound NAT)?
2. Final verdict on the most probable root cause and its location, with confidence.
3. Concrete recommended action plan, ordered: (a) immediate recovery of the 3 agents, (b) how to pin the exact
0.6.63 agent bug given we cannot reach the agents (can we infer it from the slice-B code shape: CommandAck +
dedup cache + per-command spawned task, inline in the read loop?), (c) whether to keep or roll back the SNAT
rule, (d) anything else.
4. Is there any remaining LOW-RISK test that would add real information before we commit to "fix the agent"?