25 lines
2.9 KiB
Plaintext
25 lines
2.9 KiB
Plaintext
# GuruRMM command-delivery diagnosis — ROUND 7 (SNAT test result)
|
|
|
|
[VERDICT] The root cause is an application-level logic failure or state corruption within the v0.6.63 agent.
|
|
Confidence: 95%.
|
|
|
|
[REASONING]
|
|
1. [NETWORK CLOSED] The packet capture confirms Text frames reach the agent; the kernel's TCP ACK confirms delivery. The fact that WS pings (application-level) continue on the same connection proves the WebSocket framing and the agent's main read loop are fully functional.
|
|
2. [STATE SMOKING GUN] The observation that only a process restart (clearing memory/state) allowed a single success—while a WS reconnect (clearing network/proxy state) did not—is the definitive indicator of an internal agent bug.
|
|
3. [PATH ANALYSIS] Since pings work, the code is reaching the `handle_server_message()` dispatch. "No CommandAck" implies the function is exiting (due to a parse error) or blocking (due to a lock/logic error) before the ACK is sent.
|
|
|
|
[ANSWERS]
|
|
|
|
1. [SNAT TEST] The SNAT test result, combined with the capture, fully closes the network hypothesis. While pfSense `reply-to` might have bypassed the SNAT rule for existing inbound states, the agent's TCP ACKs prove the return path was already valid. The failure is not in the bytes moving; it is in the agent's processing of those bytes.
|
|
|
|
2. [FINAL VERDICT] The bug is in the v0.6.63 agent's "Slice B" implementation of `handle_server_message()`.
|
|
Most Probable: A `serde_json` deserialization failure caused by a schema mismatch or a malformed `ServerMessage` payload that only triggers under specific conditions (possibly related to the new dedup/results cache fields). Because the error is logged-and-ignored, the loop survives for pings but "swallows" every command.
|
|
|
|
3. [ACTION PLAN]
|
|
(a) [RECOVERY] Force-restart the `gururmm-agent` service on the 3 affected machines using an out-of-band tool (e.g., ScreenConnect, local shell, or a separate "emergency" updater if it exists).
|
|
(b) [AGENT FIX] Audit `handle_server_message` in the agent code. Move the `CommandAck` to the absolute first line after a successful parse. If the parse itself is failing, inspect the raw JSON being sent by the server for hidden characters or missing fields required by the new v0.6.63 struct.
|
|
(c) [INFRA] Roll back the pfSense SNAT rule. It is confirmed unnecessary and adds "magic" to the routing table that didn't solve the issue.
|
|
(d) [DEDUP AUDIT] Ensure the dedup cache logic does not return *before* sending a `CommandAck`. Even if a command is a duplicate, the agent MUST ACK to stop the server from retrying.
|
|
|
|
4. [LOW-RISK TEST] Send a command with a minimal, primitive payload (e.g., `id`) to the affected agents. If it still fails, the `serde_json` schema itself is likely the culprit. If it succeeds, the issue is payload-size or content-specific. Check server logs for any "broken pipe" or "write timeout" errors specifically when sending the larger `Command` Text frames.
|