Files
claudetools/tmp/rmm-diag-round7.md
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

3.5 KiB

GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)

Quorum continuing. The operator's SNAT/return-path hypothesis was tested directly. Result below — please give your final verdict and the recommended action plan.

Recap of the converging picture

  • agent -> site gateway -> internet -> pfSense (multi-WAN, IP-alias 72.194.62.10 on igc0/"wan") -> DNAT to NPM (.20) -> origin nginx (.30) -> Rust :3001. Affected: PST-SERVER/PST-SERVER2/GTS-PEDRO-H (3 ISPs), v0.6.63.
  • Packet capture proved: command Text frame leaves Rust, leaves origin nginx, is forwarded by NPM to the agent, and is TCP-ACKed by the agent with NO retransmits — i.e. the bytes reach the agent's kernel — yet the agent never emits a CommandAck or executes, while still answering the server's pings on the same connection.
  • It worked for exactly ONE command at 22:23, immediately after the agent process restarted into 0.6.63 (this morning's "comms-durability slice B" added CommandAck-on-receipt + a dedup/results cache to the agent's command path). A WS reconnect does NOT fix it (server restart + a forced eviction both gave fresh WS connections, commands still failed). Only the agent PROCESS restart (the update) ever yielded a success.
  • Agent code note: the agent's command handler is inline in the read loop (handle_server_message().await), sending CommandAck as its FIRST line before spawning the run task; a parse error at serde_json::from_str::() is logged-and-ignored, so the read loop survives and keeps answering pings. So "no CommandAck at all" implies the failure is at/just-after parse/dispatch, not a dead long-lived task.

THE SNAT TEST (operator's hypothesis, now tested directly)

  • Added a pfSense outbound-NAT (SNAT) rule pinning NPM source: nat on igc0 inet from 172.16.3.20 to any -> 72.194.62.10. Verified active in pfctl -sn. Fleet stayed healthy (149 agents online).
  • Killed PST-SERVER's pfSense states (forcing a brand-new TCP/WS connection under the new SNAT rule). PST reconnected (last_seen fresh).
  • Dispatched a fresh hostname command on that new connection.
  • RESULT: still never ACKed (delivery_attempts climbing, no execution) — identical symptom.
  • Caveat we want you to weigh: pfSense replies to an INBOUND DNAT connection use reply-to/state, which typically BYPASSES outbound NAT. So this SNAT rule may only affect NPM-INITIATED outbound, not the server->agent reply path — i.e. the test may not have actually changed the reply source IP. However, the earlier capture already showed the agent ACKing the server->agent frames (so the reply source was already acceptable to the agent).

Questions (each model, concise)

  1. Does the SNAT test result (no change) — together with the capture showing commands reaching + being ACKed by the agent — fully close the network/SNAT/proxy hypothesis class? Or is there a residual network angle the SNAT test did NOT actually exercise (given reply-to bypasses outbound NAT)?
  2. Final verdict on the most probable root cause and its location, with confidence.
  3. Concrete recommended action plan, ordered: (a) immediate recovery of the 3 agents, (b) how to pin the exact 0.6.63 agent bug given we cannot reach the agents (can we infer it from the slice-B code shape: CommandAck + dedup cache + per-command spawned task, inline in the read loop?), (c) whether to keep or roll back the SNAT rule, (d) anything else.
  4. Is there any remaining LOW-RISK test that would add real information before we commit to "fix the agent"?