Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

3.5 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)

Quorum continuing. The operator's SNAT/return-path hypothesis was tested directly. Result below — please give your final verdict and the recommended action plan.

Recap of the converging picture

agent -> site gateway -> internet -> pfSense (multi-WAN, IP-alias 72.194.62.10 on igc0/"wan") -> DNAT to NPM (.20) -> origin nginx (.30) -> Rust :3001. Affected: PST-SERVER/PST-SERVER2/GTS-PEDRO-H (3 ISPs), v0.6.63.
Packet capture proved: command Text frame leaves Rust, leaves origin nginx, is forwarded by NPM to the agent, and is TCP-ACKed by the agent with NO retransmits — i.e. the bytes reach the agent's kernel — yet the agent never emits a CommandAck or executes, while still answering the server's pings on the same connection.
It worked for exactly ONE command at 22:23, immediately after the agent process restarted into 0.6.63 (this morning's "comms-durability slice B" added CommandAck-on-receipt + a dedup/results cache to the agent's command path). A WS reconnect does NOT fix it (server restart + a forced eviction both gave fresh WS connections, commands still failed). Only the agent PROCESS restart (the update) ever yielded a success.
Agent code note: the agent's command handler is inline in the read loop (handle_server_message().await), sending CommandAck as its FIRST line before spawning the run task; a parse error at serde_json::from_str::() is logged-and-ignored, so the read loop survives and keeps answering pings. So "no CommandAck at all" implies the failure is at/just-after parse/dispatch, not a dead long-lived task.

THE SNAT TEST (operator's hypothesis, now tested directly)

Added a pfSense outbound-NAT (SNAT) rule pinning NPM source: nat on igc0 inet from 172.16.3.20 to any -> 72.194.62.10. Verified active in pfctl -sn. Fleet stayed healthy (149 agents online).
Killed PST-SERVER's pfSense states (forcing a brand-new TCP/WS connection under the new SNAT rule). PST reconnected (last_seen fresh).
Dispatched a fresh hostname command on that new connection.
RESULT: still never ACKed (delivery_attempts climbing, no execution) — identical symptom.
Caveat we want you to weigh: pfSense replies to an INBOUND DNAT connection use reply-to/state, which typically BYPASSES outbound NAT. So this SNAT rule may only affect NPM-INITIATED outbound, not the server->agent reply path — i.e. the test may not have actually changed the reply source IP. However, the earlier capture already showed the agent ACKing the server->agent frames (so the reply source was already acceptable to the agent).

Questions (each model, concise)

Does the SNAT test result (no change) — together with the capture showing commands reaching + being ACKed by the agent — fully close the network/SNAT/proxy hypothesis class? Or is there a residual network angle the SNAT test did NOT actually exercise (given reply-to bypasses outbound NAT)?
Final verdict on the most probable root cause and its location, with confidence.
Concrete recommended action plan, ordered: (a) immediate recovery of the 3 agents, (b) how to pin the exact 0.6.63 agent bug given we cannot reach the agents (can we infer it from the slice-B code shape: CommandAck + dedup cache + per-command spawned task, inline in the read loop?), (c) whether to keep or roll back the SNAT rule, (d) anything else.
Is there any remaining LOW-RISK test that would add real information before we commit to "fix the agent"?

3.5 KiB Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)

Recap of the converging picture

THE SNAT TEST (operator's hypothesis, now tested directly)

Questions (each model, concise)

3.5 KiB

Raw Blame History