68 lines
5.2 KiB
Markdown
68 lines
5.2 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 6 (decisive capture + reconcile a contradiction)
|
|
|
|
Quorum continuing. A clean packet capture has produced a result that seems to CONTRADICT the leading
|
|
network/proxy hypotheses. Please reconcile it and pick the decisive next test.
|
|
|
|
## Path recap (no Cloudflare for agents)
|
|
agent -> site gateway -> internet -> pfSense (multi-WAN, public IP blocks 72.194.62.x AND 70.175.28.x;
|
|
DNAT wan:443 -> NPM 172.16.3.20:18443) -> NPM (Nginx Proxy Manager, terminates TLS) -> origin nginx .30:80
|
|
-> Rust server :3001. Affected: PST-SERVER (98.190.129.150), PST-SERVER2 (64.139.88.249),
|
|
GTS-PEDRO-H (68.230.27.220) — 3 different ISPs/sites, all v0.6.63. ~40 other 0.6.63 agents work fine.
|
|
|
|
## DECISIVE NEW CAPTURE (on the NPM/Jupiter host, of the NPM<->PST-agent leg, during 3 marked commands)
|
|
- Dispatched 3 tiny commands (CLEANA/B/C) to PST-SERVER at known times.
|
|
- NPM->agent direction showed exactly THREE 214-byte TLS frames (= the 3 commands; a hostname command is a
|
|
193-byte WS Text frame + ~21B TLS overhead) interleaved with 31-byte frames (the server's WS Pings).
|
|
- The TCP sequence numbers advanced cleanly across the 214B frames with NO retransmissions (the apparent 3x
|
|
duplication is just `tcpdump -i any` capturing each packet on veth+bridge+host).
|
|
- Interpretation: NPM DID forward all three command frames to the agent, and the agent's TCP ACKed them
|
|
(no retransmit). i.e. the command BYTES reached the agent's kernel/TCP stack.
|
|
- Yet: the agent never sent a CommandAck and never executed any of them (server-side acked_at stays null).
|
|
- Earlier captures already proved the frame leaves Rust and leaves origin nginx. And the agent on this same
|
|
connection continues to answer the server's pings (it does not hit its 90s no-inbound reconnect) and sends
|
|
heartbeats/metrics that DO reach the server (last_seen stays fresh; ~1346B agent->server frames were seen).
|
|
|
|
## What this seems to rule out
|
|
- NPM swallowing the frame (it forwarded all 3).
|
|
- WAN black-hole / wrong-source SNAT on the server->agent direction (a wrong source IP would cause the agent
|
|
NOT to ACK -> retransmissions; we saw clean ACKs, no retransmits).
|
|
- The frame never reaching the agent.
|
|
|
|
## pfSense facts (operator suspected an SNAT/return-path issue)
|
|
- Outbound NAT mode = hybrid; multi-WAN (WAN + FIBER).
|
|
- DNAT: wan:443 -> 172.16.3.20:18443 (NPM). (There is also a near-duplicate "Emby on Fiber" wan:443->.20:18443.)
|
|
- Manual outbound-NAT (SNAT) exists for src 172.16.3.10 -> 72.194.62.5, but there is NO explicit SNAT pinning
|
|
NPM (172.16.3.20) -> 72.194.62.10; NPM replies rely on pf state/reply-to + automatic outbound NAT.
|
|
- BUT the capture shows the agent ACKing the server->agent command frames, which argues the reply path is
|
|
currently delivering with an acceptable source IP. So the SNAT config gap, while real, does not appear to be
|
|
dropping these particular server->agent frames.
|
|
|
|
## Agent self-logs
|
|
- The agent uploads logs to the server (agent->server works), but the last uploaded batch for PST-SERVER is
|
|
from 22:21 (the moment it updated to 0.6.63 — at which point it successfully applied a ConfigUpdate, sent
|
|
inventory, and at 22:23 ONE command succeeded+acked). No fresh logs since (log upload is server-triggered =
|
|
broken channel). So we cannot see the agent's current-state logs remotely.
|
|
|
|
## The contradiction to resolve
|
|
The command bytes provably reach the agent's TCP and are ACKed, the agent processes the server's PING control
|
|
frames on the SAME connection (so its read loop is alive), yet it does not parse/ack/execute the command TEXT
|
|
frames. It worked once at 22:23 right after the 0.6.63 update, then stopped.
|
|
|
|
## Questions (each model, concise)
|
|
1. Given the command bytes are TCP-ACKed by the agent but never produce a CommandAck, and the same connection
|
|
still processes pings: what agent-side mechanisms can make a WS client ACK TCP + handle Ping control frames
|
|
but NOT deliver Text data frames to the application? (e.g., TLS record vs WS-frame reassembly state; a stuck
|
|
read between the TLS layer and the WS dispatch; a per-connection WS read buffer that desyncs after a certain
|
|
point; recv-buffer filling because the app stopped reading while the kernel keeps ACKing; a panic/dropped
|
|
task in the command dispatch path; etc.) Rank them.
|
|
2. Is "kernel ACKs while the application stopped reading" plausible here, and how would the pings still get
|
|
answered if the app stopped reading? (Reconcile carefully.)
|
|
3. Could anything OTHER than the agent still explain TCP-ACKed-but-not-processed (e.g., a middlebox at the
|
|
agent's own site doing TCP proxying/termination that ACKs then mangles; TLS-inspection at the agent's UDR;
|
|
GRO/LRO/TSO offload on the server-grade NICs corrupting reassembly only for these hosts)? Weigh these.
|
|
4. The single most decisive next test. We CAN: capture at any point in OUR infra (pfSense/NPM/origin), read the
|
|
server DB, read the agent's LAST uploaded logs, change pfSense/NPM/server config, force the agent to
|
|
reconnect (server-side eviction), restart NPM. We CANNOT push config to the broken agents or get on-site
|
|
quickly. What single action best discriminates "agent app" vs "agent-site middlebox/offload" vs "still NPM/pf"?
|
|
5. Does the "worked at 22:23 right after update, never since" timing point anywhere specific?
|