Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

4.5 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 5

Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest useful diagnostic plan; we have several more rounds available, so think hard about what would actually discriminate the remaining hypotheses.

Where we are (established by packet capture)

Agent path (NO Cloudflare): agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager, Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust).
A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec).
So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated.
A capture on the NPM box of the NPM->agent leg is in progress (results not yet in).

NEW: the affected population vs the working majority (this is the key analytical puzzle)

Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites:

PST-SERVER 98.190.129.150 (site A; UniFi UDR gateway)
PST-SERVER2 64.139.88.249 (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway)
GTS-PEDRO-H 68.230.27.220 (site C, different ISP again) All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine.

So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT:

a single site's gateway (three different gateways/ISPs),
the NPM proxy globally (40 work through it),
the agent version globally (40 work on 0.6.63). What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.)

NEW: NPM internal config (dumped from the container)

NPM proxy host for rmm-api/rmm: http2 on;, WS headers present (proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $http_connection; proxy_http_version 1.1;), include block-exploits.conf;, forwards http://172.16.3.30:80. NPM's proxy.conf has no proxy_buffering off; and no proxy_read_timeout override -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s).
NPM error log: many "an upstream response is buffered to a temporary file while reading upstream" warnings, but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the WS connections and NONE mentioning the affected agents' IPs.
NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which complicates per-connection correlation there (we are capturing on the host instead).

Questions for round 5

Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2). The three-different-paths fact seems to argue strongly against independent per-site network black-holes. What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable.
Does http2 on at NPM combined with default proxy_buffering on and a WS upgrade create any known failure mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a known nginx/NPM bug where the upgraded connection's downstream (server->client) frames can stall under specific conditions while pings/heartbeats pass?
Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host; read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a brief reconnect of all ~200 agents]; add/remove NPM config like proxy_buffering off or http2 off and reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config to the broken agents nor get on-site easily). Order them by (information gained / risk).
Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would "stuck agents recover after NPM restart" vs "they do not" each prove?

4.5 KiB Raw Blame History