4.5 KiB
4.5 KiB
GuruRMM command-delivery diagnosis — ROUND 5
Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest useful diagnostic plan; we have several more rounds available, so think hard about what would actually discriminate the remaining hypotheses.
Where we are (established by packet capture)
- Agent path (NO Cloudflare):
agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager, Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust). - A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec).
- So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated.
- A capture on the NPM box of the NPM->agent leg is in progress (results not yet in).
NEW: the affected population vs the working majority (this is the key analytical puzzle)
Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites:
- PST-SERVER 98.190.129.150 (site A; UniFi UDR gateway)
- PST-SERVER2 64.139.88.249 (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway)
- GTS-PEDRO-H 68.230.27.220 (site C, different ISP again) All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine.
So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT:
- a single site's gateway (three different gateways/ISPs),
- the NPM proxy globally (40 work through it),
- the agent version globally (40 work on 0.6.63). What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.)
NEW: NPM internal config (dumped from the container)
- NPM proxy host for rmm-api/rmm:
http2 on;, WS headers present (proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $http_connection; proxy_http_version 1.1;),include block-exploits.conf;, forwardshttp://172.16.3.30:80. NPM'sproxy.confhas noproxy_buffering off;and noproxy_read_timeoutoverride -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s). - NPM error log: many
"an upstream response is buffered to a temporary file while reading upstream"warnings, but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the WS connections and NONE mentioning the affected agents' IPs. - NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which complicates per-connection correlation there (we are capturing on the host instead).
Questions for round 5
- Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2). The three-different-paths fact seems to argue strongly against independent per-site network black-holes. What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable.
- Does
http2 onat NPM combined with defaultproxy_buffering onand a WS upgrade create any known failure mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a known nginx/NPM bug where the upgraded connection's downstream (server->client) frames can stall under specific conditions while pings/heartbeats pass? - Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host;
read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a
brief reconnect of all ~200 agents]; add/remove NPM config like
proxy_buffering offorhttp2 offand reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config to the broken agents nor get on-site easily). Order them by (information gained / risk). - Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would "stuck agents recover after NPM restart" vs "they do not" each prove?