claudetools/tmp/rmm-diag-round4.md

# GuruRMM command-delivery diagnosis — ROUND 4 (new packet-capture evidence)

Quorum of two models, continuing. New hard evidence has localized the failure further. Please
update your hypothesis ranking and pick the single best next action.

## Recap (established)
- Agent path (NO Cloudflare; agents use grey-cloud rmm-api): `agent -> endpoint gateway (UniFi UDR) ->
  internet -> public IP 72.194.62.10 -> edge firewall DNAT -> NPM (Nginx Proxy Manager, host .20, Docker
  on an Unraid box; TERMINATES TLS) -> http://172.16.3.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust server)`.
- Symptom: affected agents (PST-SERVER, PST-SERVER2 same site/UDR; GTS-PEDRO-H different site; all v0.6.63)
  heartbeat fine (agent->server OK), don't trip their 90s no-inbound reconnect, but never ACK/execute
  server->agent commands (tiny 80B AND 4KB). ~40 other agents on the SAME path get commands fine.

## NEW EVIDENCE — packet capture on the origin host (.30), both legs simultaneously
We dispatched 5 uniquely-marked tiny `hostname-FINALMARKn-www` commands to the stuck PST-SERVER while
capturing on .30 (loopback + LAN). Findings (payloads are PLAINTEXT — WS permessage-deflate is NOT in use,
so frames are directly readable; agent->server frames are masked as expected):

- On the **loopback :3001** leg (Rust server <-> origin nginx): the command Text frame is present, e.g.:
  `localhost.3001 > localhost.54036  {"type":"command","payload":{"id":"c109e270...","command":"hostname-FINALMARK1-www",...}}`
  => **The Rust server DID emit the WS Text frame.**
- On the **.30 -> .20:80** leg (origin nginx -> NPM): the SAME command frame is present, e.g.:
  `gururmm.http > 172.16.3.20.45614  {"type":"command",...FINALMARK1...}`
  => **The origin nginx DID forward the frame out toward NPM (.20).**
- So the frame provably traverses the Rust server AND the origin nginx and is sent to NPM. It still never
  reaches the agent (no ACK, no execution).
- (Minor: the marker appeared 13x on loopback vs 8x on the LAN leg over the window — possibly retransmit/
  re-offer timing; not yet interpreted. We did NOT capture on NPM (.20) or on the agent.)

## What this rules in/out
- Refuted: "Rust send is lying" and "origin nginx (.30) swallows/buffers the frame" — the frame leaves both.
- The death is now localized to: **NPM (.20) itself, OR the NPM->agent leg (public internet / the endpoint's
  UDR gateway), OR the agent.**

## Constraints on next tests
- We CAN now capture on the NPM box (.20, Unraid host; we have root). The NPM->agent leg is TLS (encrypted),
  so on .20 we could see whether NPM emits TLS application-data toward the agent's WAN IP (98.190.129.150 for
  PST) shortly after receiving each command, and whether those packets are ACK'd or retransmitted (a black-hole
  signature) — but we cannot read the frame contents on that leg.
- We have a CATCH-22 for any "point the agent at a bypass" test: the only way to reconfigure an affected
  agent's server URL is to send it a command/config — which is the very channel that's broken. There is no
  known second, WORKING agent at the PST site. (On-site access would be required to re-point an agent there.)
- We can freely add config to NPM (via its API/advanced-config) and to the origin nginx, and reload.
- ~40 other agents work fine through this identical NPM. The affected set is a subset.

## Please answer (each model, concise)
1. Updated ranked hypotheses given the frame reaches NPM but not the agent. Where do you now put: NPM
   buffering/relay defect; NPM->WAN/UDR black-hole (H3); agent-side defect (H2)? What does "subset of agents
   on the same NPM" imply about NPM-internal vs per-connection-WAN causes?
2. The single most decisive next test we can actually run (given we can capture on .20 but canNOT easily
   reconfigure the broken agents). For each outcome, what it tells us. Specifically: is a .20 capture of the
   NPM->agent leg (encrypted; flow/retransmit analysis only) decisive enough, or is there a better move?
3. Does this new evidence change the verdict on the operator's "remove NPM / NAT straight to .30" idea — is it
   now more justified, or is there still a cheaper test that isolates NPM specifically?