# GuruRMM command-delivery diagnosis — ROUND 5 Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest useful diagnostic plan; we have several more rounds available, so think hard about what would actually discriminate the remaining hypotheses. ## Where we are (established by packet capture) - Agent path (NO Cloudflare): `agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager, Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust)`. - A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec). - So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated. - A capture on the NPM box of the NPM->agent leg is in progress (results not yet in). ## NEW: the affected population vs the working majority (this is the key analytical puzzle) Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites: - PST-SERVER 98.190.129.150 (site A; UniFi UDR gateway) - PST-SERVER2 64.139.88.249 (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway) - GTS-PEDRO-H 68.230.27.220 (site C, different ISP again) All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine. So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT: - a single site's gateway (three different gateways/ISPs), - the NPM proxy globally (40 work through it), - the agent version globally (40 work on 0.6.63). What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.) ## NEW: NPM internal config (dumped from the container) - NPM proxy host for rmm-api/rmm: `http2 on;`, WS headers present (`proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $http_connection; proxy_http_version 1.1;`), `include block-exploits.conf;`, forwards `http://172.16.3.30:80`. NPM's `proxy.conf` has **no `proxy_buffering off;` and no `proxy_read_timeout` override** -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s). - NPM error log: many `"an upstream response is buffered to a temporary file while reading upstream"` warnings, but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the WS connections and NONE mentioning the affected agents' IPs. - NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which complicates per-connection correlation there (we are capturing on the host instead). ## Questions for round 5 1. Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2). The three-different-paths fact seems to argue strongly against independent per-site network black-holes. What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable. 2. Does `http2 on` at NPM combined with default `proxy_buffering on` and a WS upgrade create any known failure mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a known nginx/NPM bug where the *upgraded* connection's downstream (server->client) frames can stall under specific conditions while pings/heartbeats pass? 3. Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host; read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a brief reconnect of all ~200 agents]; add/remove NPM config like `proxy_buffering off` or `http2 off` and reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config to the broken agents nor get on-site easily). Order them by (information gained / risk). 4. Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would "stuck agents recover after NPM restart" vs "they do not" each prove?