Files
claudetools/tmp/rmm-diag-round5.md
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

4.5 KiB

GuruRMM command-delivery diagnosis — ROUND 5

Quorum of two models, continuing. New config + population evidence. Re-rank and propose the deepest useful diagnostic plan; we have several more rounds available, so think hard about what would actually discriminate the remaining hypotheses.

Where we are (established by packet capture)

  • Agent path (NO Cloudflare): agent -> site gateway -> internet -> public IP -> NPM (Nginx Proxy Manager, Docker bridge container on Unraid host .20, TERMINATES TLS) -> http://.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust).
  • A marked command Text frame provably (a) leaves the Rust server (loopback :3001 capture) and (b) is forwarded by the origin nginx out toward NPM (.30->.20:80 capture). It never reaches the agent (no CommandAck, no exec).
  • So the failure is in the NPM(.20)->agent segment. Everything upstream is exonerated.
  • A capture on the NPM box of the NPM->agent leg is in progress (results not yet in).

NEW: the affected population vs the working majority (this is the key analytical puzzle)

Affected agents and their PUBLIC IPs (from agent-reported metrics) — note THREE DIFFERENT ISPs/sites:

  • PST-SERVER 98.190.129.150 (site A; UniFi UDR gateway)
  • PST-SERVER2 64.139.88.249 (site B, different physical site/ISP from PST-SERVER; also a UniFi gateway)
  • GTS-PEDRO-H 68.230.27.220 (site C, different ISP again) All three are v0.6.63. ~40 OTHER agents on the SAME NPM + SAME 0.6.63 receive commands fine.

So the root cause must be something COMMON to these three and ABSENT from the 40 working ones. It is NOT:

  • a single site's gateway (three different gateways/ISPs),
  • the NPM proxy globally (40 work through it),
  • the agent version globally (40 work on 0.6.63). What could the shared factor be? (e.g., all three are WINDOWS SERVERS rather than workstations? long-lived/old connections? a particular enrollment/update cohort? NIC/TCP-offload on server hardware? a specific policy/config pushed to them? connection idle pattern? Please reason about what class of shared attribute fits.)

NEW: NPM internal config (dumped from the container)

  • NPM proxy host for rmm-api/rmm: http2 on;, WS headers present (proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $http_connection; proxy_http_version 1.1;), include block-exploits.conf;, forwards http://172.16.3.30:80. NPM's proxy.conf has no proxy_buffering off; and no proxy_read_timeout override -> nginx DEFAULTS apply (proxy_buffering on; proxy_read_timeout 60s).
  • NPM error log: many "an upstream response is buffered to a temporary file while reading upstream" warnings, but ALL are for NON-WebSocket HTTP requests (/downloads/ the agent .exe, /api/ GETs, /assets/*). NONE for the WS connections and NONE mentioning the affected agents' IPs.
  • NPM container is bridge-networked; inside the container, client source IPs are NAT-masked by Docker, which complicates per-connection correlation there (we are capturing on the host instead).

Questions for round 5

  1. Given three DIFFERENT ISPs/sites all show the identical "control frames + heartbeats fine, server->agent Text frames never delivered" signature, re-rank: NPM-relay/state vs per-site-gateway (H3) vs agent-side (H2). The three-different-paths fact seems to argue strongly against independent per-site network black-holes. What shared attribute best explains "these 3 but not the other 40"? Be concrete and falsifiable.
  2. Does http2 on at NPM combined with default proxy_buffering on and a WS upgrade create any known failure mode for server->client data frames on long-lived connections (vs client->server which works)? Is there a known nginx/NPM bug where the upgraded connection's downstream (server->client) frames can stall under specific conditions while pings/heartbeats pass?
  3. Propose the most discriminating SEQUENCE of tests for the next 2-3 rounds (we can: capture on .20 host; read NPM/agent TCP socket stats via ss on the host; restart the NPM container as a one-shot probe [costs a brief reconnect of all ~200 agents]; add/remove NPM config like proxy_buffering off or http2 off and reload; query the affected agents' DB rows for any shared policy/enrollment attribute; we CANNOT push config to the broken agents nor get on-site easily). Order them by (information gained / risk).
  4. Specifically: is restarting the NPM container a worthwhile one-shot probe now, and exactly what would "stuck agents recover after NPM restart" vs "they do not" each prove?