claudetools/tmp/rmm-diag-round6.md

# GuruRMM command-delivery diagnosis — ROUND 6 (decisive capture + reconcile a contradiction)

Quorum continuing. A clean packet capture has produced a result that seems to CONTRADICT the leading
network/proxy hypotheses. Please reconcile it and pick the decisive next test.

## Path recap (no Cloudflare for agents)
agent -> site gateway -> internet -> pfSense (multi-WAN, public IP blocks 72.194.62.x AND 70.175.28.x;
DNAT wan:443 -> NPM 172.16.3.20:18443) -> NPM (Nginx Proxy Manager, terminates TLS) -> origin nginx .30:80
-> Rust server :3001. Affected: PST-SERVER (98.190.129.150), PST-SERVER2 (64.139.88.249),
GTS-PEDRO-H (68.230.27.220) — 3 different ISPs/sites, all v0.6.63. ~40 other 0.6.63 agents work fine.

## DECISIVE NEW CAPTURE (on the NPM/Jupiter host, of the NPM<->PST-agent leg, during 3 marked commands)
- Dispatched 3 tiny commands (CLEANA/B/C) to PST-SERVER at known times.
- NPM->agent direction showed exactly THREE 214-byte TLS frames (= the 3 commands; a hostname command is a
  193-byte WS Text frame + ~21B TLS overhead) interleaved with 31-byte frames (the server's WS Pings).
- The TCP sequence numbers advanced cleanly across the 214B frames with NO retransmissions (the apparent 3x
  duplication is just `tcpdump -i any` capturing each packet on veth+bridge+host).
- Interpretation: NPM DID forward all three command frames to the agent, and the agent's TCP ACKed them
  (no retransmit). i.e. the command BYTES reached the agent's kernel/TCP stack.
- Yet: the agent never sent a CommandAck and never executed any of them (server-side acked_at stays null).
- Earlier captures already proved the frame leaves Rust and leaves origin nginx. And the agent on this same
  connection continues to answer the server's pings (it does not hit its 90s no-inbound reconnect) and sends
  heartbeats/metrics that DO reach the server (last_seen stays fresh; ~1346B agent->server frames were seen).

## What this seems to rule out
- NPM swallowing the frame (it forwarded all 3).
- WAN black-hole / wrong-source SNAT on the server->agent direction (a wrong source IP would cause the agent
  NOT to ACK -> retransmissions; we saw clean ACKs, no retransmits).
- The frame never reaching the agent.

## pfSense facts (operator suspected an SNAT/return-path issue)
- Outbound NAT mode = hybrid; multi-WAN (WAN + FIBER).
- DNAT: wan:443 -> 172.16.3.20:18443 (NPM). (There is also a near-duplicate "Emby on Fiber" wan:443->.20:18443.)
- Manual outbound-NAT (SNAT) exists for src 172.16.3.10 -> 72.194.62.5, but there is NO explicit SNAT pinning
  NPM (172.16.3.20) -> 72.194.62.10; NPM replies rely on pf state/reply-to + automatic outbound NAT.
- BUT the capture shows the agent ACKing the server->agent command frames, which argues the reply path is
  currently delivering with an acceptable source IP. So the SNAT config gap, while real, does not appear to be
  dropping these particular server->agent frames.

## Agent self-logs
- The agent uploads logs to the server (agent->server works), but the last uploaded batch for PST-SERVER is
  from 22:21 (the moment it updated to 0.6.63 — at which point it successfully applied a ConfigUpdate, sent
  inventory, and at 22:23 ONE command succeeded+acked). No fresh logs since (log upload is server-triggered =
  broken channel). So we cannot see the agent's current-state logs remotely.

## The contradiction to resolve
The command bytes provably reach the agent's TCP and are ACKed, the agent processes the server's PING control
frames on the SAME connection (so its read loop is alive), yet it does not parse/ack/execute the command TEXT
frames. It worked once at 22:23 right after the 0.6.63 update, then stopped.

## Questions (each model, concise)
1. Given the command bytes are TCP-ACKed by the agent but never produce a CommandAck, and the same connection
   still processes pings: what agent-side mechanisms can make a WS client ACK TCP + handle Ping control frames
   but NOT deliver Text data frames to the application? (e.g., TLS record vs WS-frame reassembly state; a stuck
   read between the TLS layer and the WS dispatch; a per-connection WS read buffer that desyncs after a certain
   point; recv-buffer filling because the app stopped reading while the kernel keeps ACKing; a panic/dropped
   task in the command dispatch path; etc.) Rank them.
2. Is "kernel ACKs while the application stopped reading" plausible here, and how would the pings still get
   answered if the app stopped reading? (Reconcile carefully.)
3. Could anything OTHER than the agent still explain TCP-ACKed-but-not-processed (e.g., a middlebox at the
   agent's own site doing TCP proxying/termination that ACKs then mangles; TLS-inspection at the agent's UDR;
   GRO/LRO/TSO offload on the server-grade NICs corrupting reassembly only for these hosts)? Weigh these.
4. The single most decisive next test. We CAN: capture at any point in OUR infra (pfSense/NPM/origin), read the
   server DB, read the agent's LAST uploaded logs, change pfSense/NPM/server config, force the agent to
   reconnect (server-side eviction), restart NPM. We CANNOT push config to the broken agents or get on-site
   quickly. What single action best discriminates "agent app" vs "agent-site middlebox/offload" vs "still NPM/pf"?
5. Does the "worked at 22:23 right after update, never since" timing point anywhere specific?