claudetools/tmp/rmm-diag-round5.md

# GuruRMM command-delivery diagnosis — ROUND 5

Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest
useful diagnostic plan; we have several more rounds available, so think hard about what would actually
discriminate the remaining hypotheses.

## Where we are (established by packet capture)
- Agent path (NO Cloudflare): `agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager,
  Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust)`.
- A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded
  by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec).
- So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated.
- A capture on the NPM box of the NPM->agent leg is in progress (results not yet in).

## NEW: the affected population vs the working majority (this is the key analytical puzzle)
Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites:
- PST-SERVER     98.190.129.150   (site A; UniFi UDR gateway)
- PST-SERVER2    64.139.88.249    (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway)
- GTS-PEDRO-H    68.230.27.220    (site C, different ISP again)
All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine.

So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT:
- a single site's gateway (three different gateways/ISPs),
- the NPM proxy globally (40 work through it),
- the agent version globally (40 work on 0.6.63).
What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old
connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config
pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.)

## NEW: NPM internal config (dumped from the container)
- NPM proxy host for rmm-api/rmm: `http2 on;`, WS headers present (`proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection $http_connection; proxy_http_version 1.1;`), `include block-exploits.conf;`,
  forwards `http://172.16.3.30:80`. NPM's `proxy.conf` has **no `proxy_buffering off;` and no
  `proxy_read_timeout` override** -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s).
- NPM error log: many `"an upstream response is buffered to a temporary file while reading upstream"` warnings,
  but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the
  WS connections and NONE mentioning the affected agents' IPs.
- NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which
  complicates per-connection correlation there (we are capturing on the host instead).

## Questions for round 5
1. Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent
   Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2).
   The three-different-paths fact seems to argue strongly against independent per-site network black-holes.
   What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable.
2. Does `http2 on` at NPM combined with default `proxy_buffering on` and a WS upgrade create any known failure
   mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a
   known nginx/NPM bug where the *upgraded* connection's downstream (server->client) frames can stall under
   specific conditions while pings/heartbeats pass?
3. Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host;
   read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a
   brief reconnect of all ~200 agents]; add/remove NPM config like `proxy_buffering off` or `http2 off` and
   reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config
   to the broken agents nor get on-site easily). Order them by (information gained / risk).
4. Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would
   "stuck agents recover after NPM restart" vs "they do not" each prove?