44 lines
3.5 KiB
Markdown
44 lines
3.5 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)
|
|
|
|
Quorum continuing. The operator's SNAT/return-path hypothesis was tested directly. Result below — please
|
|
give your final verdict and the recommended action plan.
|
|
|
|
## Recap of the converging picture
|
|
- agent -> site gateway -> internet -> pfSense (multi-WAN, IP-alias 72.194.62.10 on igc0/"wan") -> DNAT to
|
|
NPM (.20) -> origin nginx (.30) -> Rust :3001. Affected: PST-SERVER/PST-SERVER2/GTS-PEDRO-H (3 ISPs), v0.6.63.
|
|
- Packet capture proved: command Text frame leaves Rust, leaves origin nginx, is forwarded by NPM to the agent,
|
|
and is TCP-ACKed by the agent with NO retransmits — i.e. the bytes reach the agent's kernel — yet the agent
|
|
never emits a CommandAck or executes, while still answering the server's pings on the same connection.
|
|
- It worked for exactly ONE command at 22:23, immediately after the agent process restarted into 0.6.63
|
|
(this morning's "comms-durability slice B" added CommandAck-on-receipt + a dedup/results cache to the agent's
|
|
command path). A WS reconnect does NOT fix it (server restart + a forced eviction both gave fresh WS
|
|
connections, commands still failed). Only the agent PROCESS restart (the update) ever yielded a success.
|
|
- Agent code note: the agent's command handler is inline in the read loop (handle_server_message().await),
|
|
sending CommandAck as its FIRST line before spawning the run task; a parse error at
|
|
serde_json::from_str::<ServerMessage>() is logged-and-ignored, so the read loop survives and keeps answering
|
|
pings. So "no CommandAck at all" implies the failure is at/just-after parse/dispatch, not a dead long-lived task.
|
|
|
|
## THE SNAT TEST (operator's hypothesis, now tested directly)
|
|
- Added a pfSense outbound-NAT (SNAT) rule pinning NPM source: `nat on igc0 inet from 172.16.3.20 to any ->
|
|
72.194.62.10`. Verified active in `pfctl -sn`. Fleet stayed healthy (149 agents online).
|
|
- Killed PST-SERVER's pfSense states (forcing a brand-new TCP/WS connection under the new SNAT rule). PST
|
|
reconnected (last_seen fresh).
|
|
- Dispatched a fresh `hostname` command on that new connection.
|
|
- RESULT: still never ACKed (delivery_attempts climbing, no execution) — identical symptom.
|
|
- Caveat we want you to weigh: pfSense replies to an INBOUND DNAT connection use reply-to/state, which
|
|
typically BYPASSES outbound NAT. So this SNAT rule may only affect NPM-INITIATED outbound, not the
|
|
server->agent reply path — i.e. the test may not have actually changed the reply source IP. However, the
|
|
earlier capture already showed the agent ACKing the server->agent frames (so the reply source was already
|
|
acceptable to the agent).
|
|
|
|
## Questions (each model, concise)
|
|
1. Does the SNAT test result (no change) — together with the capture showing commands reaching + being ACKed
|
|
by the agent — fully close the network/SNAT/proxy hypothesis class? Or is there a residual network angle the
|
|
SNAT test did NOT actually exercise (given reply-to bypasses outbound NAT)?
|
|
2. Final verdict on the most probable root cause and its location, with confidence.
|
|
3. Concrete recommended action plan, ordered: (a) immediate recovery of the 3 agents, (b) how to pin the exact
|
|
0.6.63 agent bug given we cannot reach the agents (can we infer it from the slice-B code shape: CommandAck +
|
|
dedup cache + per-command spawned task, inline in the read loop?), (c) whether to keep or roll back the SNAT
|
|
rule, (d) anything else.
|
|
4. Is there any remaining LOW-RISK test that would add real information before we commit to "fix the agent"?
|