Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

8.8 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)

You analyzed this in round 1 (your prior answer is included at the bottom). A FACTUAL CORRECTION to the architecture has come to light that invalidates part of the round-1 premise. Please re-evaluate from scratch where needed. Be willing to discard your prior top hypothesis.

CRITICAL CORRECTION to the ingress topology

In round 1 we told you: "agent -> Cloudflare -> nginx -> Rust server". That was wrong for the AGENT path. Verified facts now:

The agent's hard-coded WebSocket URL is wss://rmm-api.azcomputerguru.com/ws (from agent source: DEFAULT_SERVER_URL, and the agent config default). The installer and enrollment also use rmm-api.azcomputerguru.com.
DNS / Cloudflare proxy status for the two hostnames (both A-record to the same public IP 72.194.62.10):
- rmm.azcomputerguru.com -> proxied = true (orange cloud; goes THROUGH Cloudflare) — this is the human DASHBOARD.
- rmm-api.azcomputerguru.com -> proxied = false (grey cloud; DNS-only, BYPASSES Cloudflare) — this is what the AGENTS use.
Therefore Cloudflare is NOT in the agent's path at all. All round-1 hypotheses about Cloudflare WAF / CF WebSocket buffering / CF frame handling are moot for agent command delivery. (Cloudflare only fronts the dashboard.)

The ACTUAL agent path (verified)

agent (endpoint, e.g. PST-SERVER behind a UniFi UDR/"Cloudflare-Ultra"-class gateway)
  -> endpoint LAN/NAT
  -> public internet
  -> 72.194.62.10  (public IP; this is the NPM box)
  -> NPM = "Nginx Proxy Manager" on host 172.16.3.20  (terminates TLS; one nginx layer)
       NPM proxy-host settings for rmm-api: forward http://172.16.3.30:80, websockets=ON,
       http2_support=ON, block_exploits=ON (NPM "Block Common Exploits"), caching=OFF, advanced_config=EMPTY
  -> http://172.16.3.30:80   (the ORIGIN nginx; a SECOND nginx layer)  -- PLAINTEXT HTTP over the LAN here
  -> proxy_pass http://127.0.0.1:3001   (the Rust server)

So there are TWO nginx proxy layers in series (NPM on .20, then origin nginx on .30), no CDN.

Origin nginx /ws block (verbatim):

location /ws {
    proxy_pass http://127.0.0.1:3001;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 86400;
}

(No explicit proxy_buffering off;, no proxy_send_timeout. NPM's generated config for the proxy host is the standard NPM template with websockets enabled + block_exploits; we have not yet dumped its exact directives.)

What this changes / things to reconsider

The "asymmetry" symptom is unchanged and still the core puzzle: for affected agents the server keeps receiving inbound (heartbeats, last_seen fresh), the agent does NOT trip its 90s no-inbound reconnect (so it appears to be receiving at least the server's 30s WS Ping frames), yet server->agent COMMAND (Text) frames are never acked/executed — both ~4 KB commands AND ~80 B hostname commands. One command DID succeed earlier (right after the agent updated/reconnected). A server-forced eviction + fresh reconnect did NOT restore command delivery (a tiny command on the fresh connection still never acked) — but note we have NO confirmation the agent received anything server->agent on that fresh connection either.
Affected: PST-SERVER + PST-SERVER2 (same site/UDR), and GTS-PEDRO-H (different site). ~40 other agents deliver commands fine through this SAME NPM+nginx path.
Because rmm-api is grey-cloud, the agent's TLS connection goes straight from the endpoint's gateway to the public IP — there is no CDN normalizing MTU/keepalive/buffering on the path; the endpoint's own gateway (e.g. PST's UDR) is directly in the path.

A note on one earlier "test"

We tried to tcpdump the origin loopback (nginx->Rust on 127.0.0.1:3001) during a marked command dispatch and saw 0 packets — but that capture almost certainly FAILED (the backgrounded tcpdump was killed when the SSH session closed), so treat it as NO DATA, not as evidence that the loopback is silent.

Please now provide (round 2)

Your REVISED ranked hypotheses given Cloudflare is out of the agent path and there are two nginx layers + the endpoint gateway directly in the path. Explicitly say which round-1 hypotheses you drop or demote.
The single most diagnostic next test (we can run packet captures on the .30 host on the plaintext .20->.30:80 leg and on the 127.0.0.1:3001 loopback; we can read NPM's generated nginx config on .20; we can query the UDR; we currently CANNOT get shell/logs on the affected endpoints). For each plausible outcome, what it tells you.

YOUR ROUND-1 ANSWER (for reference)

GuruRMM — agent command delivery failure: diagnostic dossier (neutral)

The evidence suggests a highly selective delivery failure where outbound (Server -> Agent) WebSocket Text frames are being dropped, mangled, or ignored, while WebSocket Control frames (Ping) and inbound (Agent -> Server) traffic remain functional.

1. Candidate Root-Cause Hypotheses

Hypothesis	Reasoning / Evidence FOR	Reasoning / Evidence AGAINST
H1: MTU / Fragmentation (Large Frame Blackhole)	Common in site-to-site tunnels (UniFi UDR). Large commands (4KB) + overhead might exceed path MTU, causing fragmentation. If the path drops fragments, the Text frame never arrives.	Refuted by T2/T3/T9: Tiny `hostname` commands (~80B) also failed. MTU issues usually spare small packets.
H2: Application-Layer Deadlock (Agent Rx Loop)	If the Agent's `handle_server_message` or the internal `tx` (to the executor) blocks or panics without crashing the read loop, the agent stops processing Text frames but continues to receive Pings (keeping the 90s deadline alive).	Weakened by T7/T8: A server-forced eviction and fresh reconnect (fresh Agent state) did not restore functionality.
H3: Intermediate Proxy / Firewall (Text-Frame Filtering)	Cloudflare or a local "Security Gateway" (UniFi) may be inspecting WebSocket payloads. If the Command JSON pattern matches a signature or if the `Upgrade` handshake is being "half-policed," Text frames might be dropped.	The fleet baseline (40 agents working) suggests it's not a global Cloudflare issue, but could be specific to "enhanced" security profiles at certain sites.
H4: Protocol Desync (ConfigUpdate Serialization)	On connect, the Server sends a `ConfigUpdate` (multi-KB). If this frame is malformed or if `tokio-tungstenite` on the server-side has a buffer issue with specific frame sequences, the Agent's parser might be stuck/errored for that specific stream.	Supported by T8: T8 confirms that even after a reconnect, delivery fails. This points to a failure during or immediately after the handshake/ConfigUpdate sequence.
H5: Nginx/Proxy Buffer "Clogging"	`proxy_read_timeout` is long, but `proxy_buffers` or internal memory limits on the origin/nginx could be hit for specific connections if the socket becomes "write-blocked" from the origin to Nginx.	Against: Nginx logs would typically show 499s or buffer overflows; `send_to` on the server returns `success`, meaning it's reaching the OS/loopback buffer.

2. The Verdict on the Evidence

Inference Check: The claim that "the agent was receiving SOME inbound frames" (T4) is an unverified inference. While the 90s timeout didn't trigger, we cannot distinguish between "the agent received a Ping" and "the agent's internal timer is somehow stalled/incorrect."
Contradiction: T8 is the most troubling. If a fresh connection (after eviction) doesn't fix it, the issue is either environmental (Network/Path) or intrinsic to the Agent's 0.6.63 startup logic (e.g., it crashes its own Text handler immediately upon receiving the first ConfigUpdate or Command).

3. The SINGLE Most Diagnostic Test

Action: Dispatch a WebSocket Ping from the Server to a stuck agent and explicitly log the receipt of the corresponding Pong.

Reasoning:

If Pong is received: We confirm the path is fully transparent to WebSocket Control frames. This isolates the failure to Text-frame serialization, size, or content inspection (H3/H4).
If Pong is NOT received: The agent is "ghosting" the server entirely. This means the 90s timeout in the agent (T4) is the actual failure point — the agent should have reconnected but didn't. This points to an Agent-side Rx loop deadlock (H2).

Outcome-based path:

Pong Received: Inspect Nginx/Cloudflare for "Web Application Firewall" (WAF) drops or "WebSocket inspection" settings.
Pong NOT Received: The agent is dead-locked. Investigate the tokio task management in the Agent's 0.6.63 read loop.

8.8 KiB Raw Blame History