Files
claudetools/tmp/r7-grok.txt
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

56 lines
6.6 KiB
Plaintext

**1. Does the SNAT test + capture fully close the network/SNAT/proxy class?**
Yes for practical purposes. The packet capture already showed the Text command frame delivered to the agent's kernel (no retransmits, TCP ACKed by the agent). The same connection continues to answer server pings (bidirectional app-level traffic). The SNAT rule produced zero change in behavior on a freshly negotiated connection.
The reply-to/state bypass caveat is real for classic pfSense inbound DNAT return traffic, but it does not leave a plausible residual network explanation here: the agent is already successfully returning TCP ACKs and pong/ping responses on that exact 5-tuple. If the return source IP or path were unacceptable to the agent or an intermediate, those ACKs and ongoing pings would not work. Proxy/NPM truncation, MTU, or SNAT asymmetry also fail to explain "commands are received and ACKed at TCP but produce zero CommandAck while pings on the identical WS connection are answered."
Network/SNAT/proxy hypothesis class is closed.
**2. Final verdict on root cause and location**
Most probable root cause (confidence ~80): a stateful bug in the v0.6.63 agent (comms-durability slice B) inside `handle_server_message()` / the command arm — specifically the new dedup/results cache + "CommandAck immediately, then spawn" logic.
Signature that only appears after process start: exactly one command succeeds (Ack sent + task spawned), then all subsequent commands on that process (including after WS reconnects) produce no Ack at all, while the read loop continues to handle pings. This is the textbook symptom of per-process in-memory state (cache, a moved channel/sender, a "first command done" guard, a poisoned or never-recreated oneshot, or a dedup key that is not the per-request ID) being clean only at process startup. The first post-restart command "poisons" it for everything after.
Parse failure is also possible but less likely given the ping path stays alive; a silent early return or `if dedup.contains(...) { continue; }` before the Ack line is a stronger fit. The inline read-loop placement plus "Ack first, then spawn" + cache addition makes this the exact place the durability changes could have introduced a one-command-then-dead dispatch path.
Network, NPM, pfSense, origin, and server dispatch are all exonerated by the capture + ping survival + SNAT result.
**3. Concrete recommended action plan**
a) Immediate recovery of the 3 agents (PST-SERVER/PST-SERVER2/GTS-PEDRO-H):
- Do not use the command channel. Use out-of-band access (console/iLO/iDRAC, local RDP/WinRM, scheduled task, or any pre-existing management path).
- Preferred: roll back those three hosts to the last known-good agent version (pre-slice-B / pre-0.6.63). A plain process restart will only buy one working command per the observed pattern, which is useless for sustained recovery.
- If the agent update/installer path is independent of the WS command channel (the fact that the 0.6.63 update itself succeeded suggests it may be), push the rollback package to just those three. Verify with multiple distinct commands after rollback.
- Note the hosts as "on old comms code until fixed build"; do not leave them on the broken version.
b) Pinning the exact 0.6.63 agent bug (no live access to the broken agents):
- Review the exact slice-B diff for the dedup/results cache, the Command arm, cache key construction, any shared state initialized at connect or first-command time, channel moves into the spawn, and guards around the Ack send.
- Highest-yield hypotheses to check first:
- Dedup key is derived from command payload/action/verb instead of the unique request ID (so repeating "hostname" or any same-shape command hits as duplicate).
- Cache is never pruned / a global "seen" flag or inverted contains check.
- A response channel, JoinHandle, or oneshot is consumed on the first command and not recreated.
- The match/if that sends Ack is after a check that becomes permanently true after first use.
- Build a minimal local repro (tokio-tungstenite server loop + the 0.6.63 agent binary or source). Send: Ping, Command (any), Ping, Command (same), Command (different). Observe whether exactly one Command ever gets Acked per process lifetime.
- Add (or review) unit tests for handle_server_message covering first command, second distinct command, duplicate-ID retry, and reconnect-within-process.
- In the next build (even the debug one), make received Text frames, pre/post-parse kind, cache hit/miss, and Ack decision noisy at debug level.
c) Keep or roll back the SNAT rule?
Roll it back. It was a diagnostic, produced no behavior change, and (per the reply-to caveat) probably never exercised the relevant return path anyway. Clean up the pfSense rule, confirm `pfctl -sn`, and move on. No value in keeping it.
d) Anything else:
- Treat the command ingestion path (especially the new durability pieces) as the sole focus until the one-command-then-stuck pattern is reproduced and fixed.
- When you have a candidate fix, deploy it via the update path if possible (the restart-on-update gives a clean process) and immediately test multiple commands + reconnects on a lab or one of the recovered hosts.
- Add defensive logging/metrics in the agent for command receive/Ack/dedup/execute counts so the next time this class of bug appears it is visible without packet captures.
**4. Any remaining low-risk test before committing to "fix the agent"?**
Yes, a few cheap ones that add signal without touching the production broken agents or the SNAT state:
- On any lab/unaffected/can-restart-safely agent (or a local build), do a clean process restart then issue two different commands in quick succession (hostname then whoami/ipconfig or equivalent). If only the very first command after process start ever succeeds, this is near-conclusive for the cache/dispatch state bug and justifies going straight to the code fix.
- Compare server-side logs and the exact JSON for the 22:23 successful command vs later failing ones (same ID format? same structure? any difference in how the command envelope was serialized?).
- If you can get a short capture on a test machine right after restart + first command, watch the agent->server direction for whether a CommandAck Text frame is actually emitted (distinguishes "never sent by agent" vs "sent but lost").
- Quick server-side sanity: confirm the command retry/delivery logic is not itself suppressing re-sends for an ID it thinks it already saw.
If the "two distinct commands after restart" test shows the strict one-success-per-process-lifetime pattern, stop testing the network side and fix the agent. The data already points there with high confidence.