57 lines
4.5 KiB
Markdown
57 lines
4.5 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 5
|
|
|
|
Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest
|
|
useful diagnostic plan; we have several more rounds available, so think hard about what would actually
|
|
discriminate the remaining hypotheses.
|
|
|
|
## Where we are (established by packet capture)
|
|
- Agent path (NO Cloudflare): `agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager,
|
|
Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust)`.
|
|
- A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded
|
|
by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec).
|
|
- So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated.
|
|
- A capture on the NPM box of the NPM->agent leg is in progress (results not yet in).
|
|
|
|
## NEW: the affected population vs the working majority (this is the key analytical puzzle)
|
|
Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites:
|
|
- PST-SERVER 98.190.129.150 (site A; UniFi UDR gateway)
|
|
- PST-SERVER2 64.139.88.249 (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway)
|
|
- GTS-PEDRO-H 68.230.27.220 (site C, different ISP again)
|
|
All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine.
|
|
|
|
So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT:
|
|
- a single site's gateway (three different gateways/ISPs),
|
|
- the NPM proxy globally (40 work through it),
|
|
- the agent version globally (40 work on 0.6.63).
|
|
What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old
|
|
connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config
|
|
pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.)
|
|
|
|
## NEW: NPM internal config (dumped from the container)
|
|
- NPM proxy host for rmm-api/rmm: `http2 on;`, WS headers present (`proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection $http_connection; proxy_http_version 1.1;`), `include block-exploits.conf;`,
|
|
forwards `http://172.16.3.30:80`. NPM's `proxy.conf` has **no `proxy_buffering off;` and no
|
|
`proxy_read_timeout` override** -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s).
|
|
- NPM error log: many `"an upstream response is buffered to a temporary file while reading upstream"` warnings,
|
|
but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the
|
|
WS connections and NONE mentioning the affected agents' IPs.
|
|
- NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which
|
|
complicates per-connection correlation there (we are capturing on the host instead).
|
|
|
|
## Questions for round 5
|
|
1. Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent
|
|
Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2).
|
|
The three-different-paths fact seems to argue strongly against independent per-site network black-holes.
|
|
What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable.
|
|
2. Does `http2 on` at NPM combined with default `proxy_buffering on` and a WS upgrade create any known failure
|
|
mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a
|
|
known nginx/NPM bug where the *upgraded* connection's downstream (server->client) frames can stall under
|
|
specific conditions while pings/heartbeats pass?
|
|
3. Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host;
|
|
read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a
|
|
brief reconnect of all ~200 agents]; add/remove NPM config like `proxy_buffering off` or `http2 off` and
|
|
reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config
|
|
to the broken agents nor get on-site easily). Order them by (information gained / risk).
|
|
4. Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would
|
|
"stuck agents recover after NPM restart" vs "they do not" each prove?
|