Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

5.2 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 6 (decisive capture + reconcile a contradiction)

Quorum continuing. A clean packet capture has produced a result that seems to CONTRADICT the leading network/proxy hypotheses. Please reconcile it and pick the decisive next test.

Path recap (no Cloudflare for agents)

agent -> site gateway -> internet -> pfSense (multi-WAN, public IP blocks 72.194.62.x AND 70.175.28.x; DNAT wan:443 -> NPM 172.16.3.20:18443) -> NPM (Nginx Proxy Manager, terminates TLS) -> origin nginx .30:80 -> Rust server :3001. Affected: PST-SERVER (98.190.129.150), PST-SERVER2 (64.139.88.249), GTS-PEDRO-H (68.230.27.220) — 3 different ISPs/sites, all v0.6.63. ~40 other 0.6.63 agents work fine.

DECISIVE NEW CAPTURE (on the NPM/Jupiter host, of the NPM<->PST-agent leg, during 3 marked commands)

Dispatched 3 tiny commands (CLEANA/B/C) to PST-SERVER at known times.
NPM->agent direction showed exactly THREE 214-byte TLS frames (= the 3 commands; a hostname command is a 193-byte WS Text frame + ~21B TLS overhead) interleaved with 31-byte frames (the server's WS Pings).
The TCP sequence numbers advanced cleanly across the 214B frames with NO retransmissions (the apparent 3x duplication is just tcpdump -i any capturing each packet on veth+bridge+host).
Interpretation: NPM DID forward all three command frames to the agent, and the agent's TCP ACKed them (no retransmit). i.e. the command BYTES reached the agent's kernel/TCP stack.
Yet: the agent never sent a CommandAck and never executed any of them (server-side acked_at stays null).
Earlier captures already proved the frame leaves Rust and leaves origin nginx. And the agent on this same connection continues to answer the server's pings (it does not hit its 90s no-inbound reconnect) and sends heartbeats/metrics that DO reach the server (last_seen stays fresh; ~1346B agent->server frames were seen).

What this seems to rule out

NPM swallowing the frame (it forwarded all 3).
WAN black-hole / wrong-source SNAT on the server->agent direction (a wrong source IP would cause the agent NOT to ACK -> retransmissions; we saw clean ACKs, no retransmits).
The frame never reaching the agent.

pfSense facts (operator suspected an SNAT/return-path issue)

Outbound NAT mode = hybrid; multi-WAN (WAN + FIBER).
DNAT: wan:443 -> 172.16.3.20:18443 (NPM). (There is also a near-duplicate "Emby on Fiber" wan:443->.20:18443.)
Manual outbound-NAT (SNAT) exists for src 172.16.3.10 -> 72.194.62.5, but there is NO explicit SNAT pinning NPM (172.16.3.20) -> 72.194.62.10; NPM replies rely on pf state/reply-to + automatic outbound NAT.
BUT the capture shows the agent ACKing the server->agent command frames, which argues the reply path is currently delivering with an acceptable source IP. So the SNAT config gap, while real, does not appear to be dropping these particular server->agent frames.

Agent self-logs

The agent uploads logs to the server (agent->server works), but the last uploaded batch for PST-SERVER is from 22:21 (the moment it updated to 0.6.63 — at which point it successfully applied a ConfigUpdate, sent inventory, and at 22:23 ONE command succeeded+acked). No fresh logs since (log upload is server-triggered = broken channel). So we cannot see the agent's current-state logs remotely.

The contradiction to resolve

The command bytes provably reach the agent's TCP and are ACKed, the agent processes the server's PING control frames on the SAME connection (so its read loop is alive), yet it does not parse/ack/execute the command TEXT frames. It worked once at 22:23 right after the 0.6.63 update, then stopped.

Questions (each model, concise)

Given the command bytes are TCP-ACKed by the agent but never produce a CommandAck, and the same connection still processes pings: what agent-side mechanisms can make a WS client ACK TCP + handle Ping control frames but NOT deliver Text data frames to the application? (e.g., TLS record vs WS-frame reassembly state; a stuck read between the TLS layer and the WS dispatch; a per-connection WS read buffer that desyncs after a certain point; recv-buffer filling because the app stopped reading while the kernel keeps ACKing; a panic/dropped task in the command dispatch path; etc.) Rank them.
Is "kernel ACKs while the application stopped reading" plausible here, and how would the pings still get answered if the app stopped reading? (Reconcile carefully.)
Could anything OTHER than the agent still explain TCP-ACKed-but-not-processed (e.g., a middlebox at the agent's own site doing TCP proxying/termination that ACKs then mangles; TLS-inspection at the agent's UDR; GRO/LRO/TSO offload on the server-grade NICs corrupting reassembly only for these hosts)? Weigh these.
The single most decisive next test. We CAN: capture at any point in OUR infra (pfSense/NPM/origin), read the server DB, read the agent's LAST uploaded logs, change pfSense/NPM/server config, force the agent to reconnect (server-side eviction), restart NPM. We CANNOT push config to the broken agents or get on-site quickly. What single action best discriminates "agent app" vs "agent-site middlebox/offload" vs "still NPM/pf"?
Does the "worked at 22:23 right after update, never since" timing point anywhere specific?

5.2 KiB Raw Blame History