Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

4.8 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)

You analyzed this in round 1 (your prior answer is included at the bottom). A FACTUAL CORRECTION to the architecture has come to light that invalidates part of the round-1 premise. Please re-evaluate from scratch where needed. Be willing to discard your prior top hypothesis.

CRITICAL CORRECTION to the ingress topology

In round 1 we told you: "agent -> Cloudflare -> nginx -> Rust server". That was wrong for the AGENT path. Verified facts now:

The agent's hard-coded WebSocket URL is wss://rmm-api.azcomputerguru.com/ws (from agent source: DEFAULT_SERVER_URL, and the agent config default). The installer and enrollment also use rmm-api.azcomputerguru.com.
DNS / Cloudflare proxy status for the two hostnames (both A-record to the same public IP 72.194.62.10):
- rmm.azcomputerguru.com -> proxied = true (orange cloud; goes THROUGH Cloudflare) — this is the human DASHBOARD.
- rmm-api.azcomputerguru.com -> proxied = false (grey cloud; DNS-only, BYPASSES Cloudflare) — this is what the AGENTS use.
Therefore Cloudflare is NOT in the agent's path at all. All round-1 hypotheses about Cloudflare WAF / CF WebSocket buffering / CF frame handling are moot for agent command delivery. (Cloudflare only fronts the dashboard.)

The ACTUAL agent path (verified)

agent (endpoint, e.g. PST-SERVER behind a UniFi UDR/"Cloudflare-Ultra"-class gateway)
  -> endpoint LAN/NAT
  -> public internet
  -> 72.194.62.10  (public IP; this is the NPM box)
  -> NPM = "Nginx Proxy Manager" on host 172.16.3.20  (terminates TLS; one nginx layer)
       NPM proxy-host settings for rmm-api: forward http://172.16.3.30:80, websockets=ON,
       http2_support=ON, block_exploits=ON (NPM "Block Common Exploits"), caching=OFF, advanced_config=EMPTY
  -> http://172.16.3.30:80   (the ORIGIN nginx; a SECOND nginx layer)  -- PLAINTEXT HTTP over the LAN here
  -> proxy_pass http://127.0.0.1:3001   (the Rust server)

So there are TWO nginx proxy layers in series (NPM on .20, then origin nginx on .30), no CDN.

Origin nginx /ws block (verbatim):

location /ws {
    proxy_pass http://127.0.0.1:3001;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 86400;
}

(No explicit proxy_buffering off;, no proxy_send_timeout. NPM's generated config for the proxy host is the standard NPM template with websockets enabled + block_exploits; we have not yet dumped its exact directives.)

What this changes / things to reconsider

The "asymmetry" symptom is unchanged and still the core puzzle: for affected agents the server keeps receiving inbound (heartbeats, last_seen fresh), the agent does NOT trip its 90s no-inbound reconnect (so it appears to be receiving at least the server's 30s WS Ping frames), yet server->agent COMMAND (Text) frames are never acked/executed — both ~4 KB commands AND ~80 B hostname commands. One command DID succeed earlier (right after the agent updated/reconnected). A server-forced eviction + fresh reconnect did NOT restore command delivery (a tiny command on the fresh connection still never acked) — but note we have NO confirmation the agent received anything server->agent on that fresh connection either.
Affected: PST-SERVER + PST-SERVER2 (same site/UDR), and GTS-PEDRO-H (different site). ~40 other agents deliver commands fine through this SAME NPM+nginx path.
Because rmm-api is grey-cloud, the agent's TLS connection goes straight from the endpoint's gateway to the public IP — there is no CDN normalizing MTU/keepalive/buffering on the path; the endpoint's own gateway (e.g. PST's UDR) is directly in the path.

A note on one earlier "test"

We tried to tcpdump the origin loopback (nginx->Rust on 127.0.0.1:3001) during a marked command dispatch and saw 0 packets — but that capture almost certainly FAILED (the backgrounded tcpdump was killed when the SSH session closed), so treat it as NO DATA, not as evidence that the loopback is silent.

Please now provide (round 2)

Your REVISED ranked hypotheses given Cloudflare is out of the agent path and there are two nginx layers + the endpoint gateway directly in the path. Explicitly say which round-1 hypotheses you drop or demote.
The single most diagnostic next test (we can run packet captures on the .30 host on the plaintext .20->.30:80 leg and on the 127.0.0.1:3001 loopback; we can read NPM's generated nginx config on .20; we can query the UDR; we currently CANNOT get shell/logs on the affected endpoints). For each plausible outcome, what it tells you.

YOUR ROUND-1 ANSWER (for reference)

ROUND1

4.8 KiB Raw Blame History