Files
claudetools/tmp/rmm-diag-round6.md
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

68 lines
5.2 KiB
Markdown

# GuruRMM command-delivery diagnosis — ROUND 6 (decisive capture + reconcile a contradiction)
Quorum continuing. A clean packet capture has produced a result that seems to CONTRADICT the leading
network/proxy hypotheses. Please reconcile it and pick the decisive next test.
## Path recap (no Cloudflare for agents)
agent -> site gateway -> internet -> pfSense (multi-WAN, public IP blocks 72.194.62.x AND 70.175.28.x;
DNAT wan:443 -> NPM 172.16.3.20:18443) -> NPM (Nginx Proxy Manager, terminates TLS) -> origin nginx .30:80
-> Rust server :3001. Affected: PST-SERVER (98.190.129.150), PST-SERVER2 (64.139.88.249),
GTS-PEDRO-H (68.230.27.220) — 3 different ISPs/sites, all v0.6.63. ~40 other 0.6.63 agents work fine.
## DECISIVE NEW CAPTURE (on the NPM/Jupiter host, of the NPM<->PST-agent leg, during 3 marked commands)
- Dispatched 3 tiny commands (CLEANA/B/C) to PST-SERVER at known times.
- NPM->agent direction showed exactly THREE 214-byte TLS frames (= the 3 commands; a hostname command is a
193-byte WS Text frame + ~21B TLS overhead) interleaved with 31-byte frames (the server's WS Pings).
- The TCP sequence numbers advanced cleanly across the 214B frames with NO retransmissions (the apparent 3x
duplication is just `tcpdump -i any` capturing each packet on veth+bridge+host).
- Interpretation: NPM DID forward all three command frames to the agent, and the agent's TCP ACKed them
(no retransmit). i.e. the command BYTES reached the agent's kernel/TCP stack.
- Yet: the agent never sent a CommandAck and never executed any of them (server-side acked_at stays null).
- Earlier captures already proved the frame leaves Rust and leaves origin nginx. And the agent on this same
connection continues to answer the server's pings (it does not hit its 90s no-inbound reconnect) and sends
heartbeats/metrics that DO reach the server (last_seen stays fresh; ~1346B agent->server frames were seen).
## What this seems to rule out
- NPM swallowing the frame (it forwarded all 3).
- WAN black-hole / wrong-source SNAT on the server->agent direction (a wrong source IP would cause the agent
NOT to ACK -> retransmissions; we saw clean ACKs, no retransmits).
- The frame never reaching the agent.
## pfSense facts (operator suspected an SNAT/return-path issue)
- Outbound NAT mode = hybrid; multi-WAN (WAN + FIBER).
- DNAT: wan:443 -> 172.16.3.20:18443 (NPM). (There is also a near-duplicate "Emby on Fiber" wan:443->.20:18443.)
- Manual outbound-NAT (SNAT) exists for src 172.16.3.10 -> 72.194.62.5, but there is NO explicit SNAT pinning
NPM (172.16.3.20) -> 72.194.62.10; NPM replies rely on pf state/reply-to + automatic outbound NAT.
- BUT the capture shows the agent ACKing the server->agent command frames, which argues the reply path is
currently delivering with an acceptable source IP. So the SNAT config gap, while real, does not appear to be
dropping these particular server->agent frames.
## Agent self-logs
- The agent uploads logs to the server (agent->server works), but the last uploaded batch for PST-SERVER is
from 22:21 (the moment it updated to 0.6.63 — at which point it successfully applied a ConfigUpdate, sent
inventory, and at 22:23 ONE command succeeded+acked). No fresh logs since (log upload is server-triggered =
broken channel). So we cannot see the agent's current-state logs remotely.
## The contradiction to resolve
The command bytes provably reach the agent's TCP and are ACKed, the agent processes the server's PING control
frames on the SAME connection (so its read loop is alive), yet it does not parse/ack/execute the command TEXT
frames. It worked once at 22:23 right after the 0.6.63 update, then stopped.
## Questions (each model, concise)
1. Given the command bytes are TCP-ACKed by the agent but never produce a CommandAck, and the same connection
still processes pings: what agent-side mechanisms can make a WS client ACK TCP + handle Ping control frames
but NOT deliver Text data frames to the application? (e.g., TLS record vs WS-frame reassembly state; a stuck
read between the TLS layer and the WS dispatch; a per-connection WS read buffer that desyncs after a certain
point; recv-buffer filling because the app stopped reading while the kernel keeps ACKing; a panic/dropped
task in the command dispatch path; etc.) Rank them.
2. Is "kernel ACKs while the application stopped reading" plausible here, and how would the pings still get
answered if the app stopped reading? (Reconcile carefully.)
3. Could anything OTHER than the agent still explain TCP-ACKed-but-not-processed (e.g., a middlebox at the
agent's own site doing TCP proxying/termination that ACKs then mangles; TLS-inspection at the agent's UDR;
GRO/LRO/TSO offload on the server-grade NICs corrupting reassembly only for these hosts)? Weigh these.
4. The single most decisive next test. We CAN: capture at any point in OUR infra (pfSense/NPM/origin), read the
server DB, read the agent's LAST uploaded logs, change pfSense/NPM/server config, force the agent to
reconnect (server-side eviction), restart NPM. We CANNOT push config to the broken agents or get on-site
quickly. What single action best discriminates "agent app" vs "agent-site middlebox/offload" vs "still NPM/pf"?
5. Does the "worked at 22:23 right after update, never since" timing point anywhere specific?