Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

14 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)

You analyzed this in round 1 (your prior answer is included at the bottom). A FACTUAL CORRECTION to the architecture has come to light that invalidates part of the round-1 premise. Please re-evaluate from scratch where needed. Be willing to discard your prior top hypothesis.

CRITICAL CORRECTION to the ingress topology

In round 1 we told you: "agent -> Cloudflare -> nginx -> Rust server". That was wrong for the AGENT path. Verified facts now:

The agent's hard-coded WebSocket URL is wss://rmm-api.azcomputerguru.com/ws (from agent source: DEFAULT_SERVER_URL, and the agent config default). The installer and enrollment also use rmm-api.azcomputerguru.com.
DNS / Cloudflare proxy status for the two hostnames (both A-record to the same public IP 72.194.62.10):
- rmm.azcomputerguru.com -> proxied = true (orange cloud; goes THROUGH Cloudflare) — this is the human DASHBOARD.
- rmm-api.azcomputerguru.com -> proxied = false (grey cloud; DNS-only, BYPASSES Cloudflare) — this is what the AGENTS use.
Therefore Cloudflare is NOT in the agent's path at all. All round-1 hypotheses about Cloudflare WAF / CF WebSocket buffering / CF frame handling are moot for agent command delivery. (Cloudflare only fronts the dashboard.)

The ACTUAL agent path (verified)

agent (endpoint, e.g. PST-SERVER behind a UniFi UDR/"Cloudflare-Ultra"-class gateway)
  -> endpoint LAN/NAT
  -> public internet
  -> 72.194.62.10  (public IP; this is the NPM box)
  -> NPM = "Nginx Proxy Manager" on host 172.16.3.20  (terminates TLS; one nginx layer)
       NPM proxy-host settings for rmm-api: forward http://172.16.3.30:80, websockets=ON,
       http2_support=ON, block_exploits=ON (NPM "Block Common Exploits"), caching=OFF, advanced_config=EMPTY
  -> http://172.16.3.30:80   (the ORIGIN nginx; a SECOND nginx layer)  -- PLAINTEXT HTTP over the LAN here
  -> proxy_pass http://127.0.0.1:3001   (the Rust server)

So there are TWO nginx proxy layers in series (NPM on .20, then origin nginx on .30), no CDN.

Origin nginx /ws block (verbatim):

location /ws {
    proxy_pass http://127.0.0.1:3001;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 86400;
}

(No explicit proxy_buffering off;, no proxy_send_timeout. NPM's generated config for the proxy host is the standard NPM template with websockets enabled + block_exploits; we have not yet dumped its exact directives.)

What this changes / things to reconsider

The "asymmetry" symptom is unchanged and still the core puzzle: for affected agents the server keeps receiving inbound (heartbeats, last_seen fresh), the agent does NOT trip its 90s no-inbound reconnect (so it appears to be receiving at least the server's 30s WS Ping frames), yet server->agent COMMAND (Text) frames are never acked/executed — both ~4 KB commands AND ~80 B hostname commands. One command DID succeed earlier (right after the agent updated/reconnected). A server-forced eviction + fresh reconnect did NOT restore command delivery (a tiny command on the fresh connection still never acked) — but note we have NO confirmation the agent received anything server->agent on that fresh connection either.
Affected: PST-SERVER + PST-SERVER2 (same site/UDR), and GTS-PEDRO-H (different site). ~40 other agents deliver commands fine through this SAME NPM+nginx path.
Because rmm-api is grey-cloud, the agent's TLS connection goes straight from the endpoint's gateway to the public IP — there is no CDN normalizing MTU/keepalive/buffering on the path; the endpoint's own gateway (e.g. PST's UDR) is directly in the path.

A note on one earlier "test"

We tried to tcpdump the origin loopback (nginx->Rust on 127.0.0.1:3001) during a marked command dispatch and saw 0 packets — but that capture almost certainly FAILED (the backgrounded tcpdump was killed when the SSH session closed), so treat it as NO DATA, not as evidence that the loopback is silent.

Please now provide (round 2)

Your REVISED ranked hypotheses given Cloudflare is out of the agent path and there are two nginx layers + the endpoint gateway directly in the path. Explicitly say which round-1 hypotheses you drop or demote.
The single most diagnostic next test (we can run packet captures on the .30 host on the plaintext .20->.30:80 leg and on the 127.0.0.1:3001 loopback; we can read NPM's generated nginx config on .20; we can query the UDR; we currently CANNOT get shell/logs on the affected endpoints). For each plausible outcome, what it tells you.

YOUR ROUND-1 ANSWER (for reference)

Diagnostic analysis: GuruRMM command delivery failure

What the evidence actually proves

Established facts:

Agent → server path is healthy for affected agents: last_seen stays current, Heartbeats arrive, re-offer logic fires on Heartbeat.
Server believes it is dispatching: send_to() returns true, reaper logs re-delivery, no enqueue failures.
Affected agents never send CommandAck and never execute commands (large and tiny).
Most of the fleet (~40 agents, 397 commands in 6h) is unaffected on the same server build.
PST-SERVER had one successful acked command at 22:23:40 on 0.6.63; failure begins after the 00:22:03 reconnect.
Eviction → reconnect works mechanically on the server side, but commands still fail on the fresh connection (T8).

Unverified inferences (treat as hypotheses, not facts):

“Agent is receiving server Pings” — inferred only from no 90s agent-side reconnect (T4). No Pong logging on the server; server code ignores inbound Pong.
“send_to success means the command was sent” — it only means the in-memory channel accepted the message.
“ConfigUpdate was delivered on reconnect” — server logged Sent ConfigUpdate; no agent-side confirmation.
PST-SERVER, PST-SERVER2, and GTS-PEDRO-H share one root cause — GTS failed with attempts=1 at a 60s timeout, not the 10-attempt undeliverable pattern; it had been idle since 06-06. That may be a second failure mode.

The core puzzle: This is not a dead connection in the usual sense. It looks like a live, asymmetric channel: inbound app traffic works; outbound command delivery does not — yet the connection is stable enough that the agent does not trigger its 90s “no inbound activity” reconnect.

Ranked root-cause hypotheses

1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via `select!` interleaving

For:

send_to() only checks tx.send() on an (likely unbounded) channel, not socket delivery.
Send task uses one loop for Text and Ping; a stuck sender.send(Text).await does not necessarily kill the task immediately; Ping ticks can still fire when not blocked inside a Text send.
Would explain: heartbeats in, no CommandAck, no execution, reaper climbing attempts, “online” agent.
Fresh reconnect (T8) could still fail if the underlying TCP/WebSocket path is half-open for application Text but still passes occasional Ping frames.

Against:

If the send task were permanently blocked on Text, Pings would stop and the agent should reconnect every ~90s. T4 shows a 30+ minute stable connection — so either Pings are getting through, or the 90s inference is wrong.
After eviction, a brand-new TCP session should reset middlebox state; T8 still fails, which weakens “stuck write on old socket” unless the pathology is immediate on every new session at that site.

Verdict: Strong, but needs socket-level write confirmation. The contradiction around 90s stability is the main gap.

2. Agent receives inbound frames (at least Pings) but server → agent Text frames are dropped or never parsed — agent-side or on-path selective loss

For:

Best fit for “connection alive, heartbeats out, commands never acked/executed.”
Agent ACKs on receipt before execution; no ACK + no execution ⇒ Command handler likely never ran on a parsed Command.
Tiny hostname (~80 B) fails (T2, T3, T8) — rules out payload size as the primary filter.
PST-SERVER regressed exactly at reconnect — first thing on a new session is AuthAck, ConfigUpdate, then command re-offers. We do not know whether any post-reconnect Text reached the agent.

Against:

WebSocket middleboxes rarely drop only Text opcode frames while passing Ping. Possible but unusual unless something is inspecting JSON content.
Does not by itself explain GTS-PEDRO-H at a different site unless this is a broader server bug in Text serialization for certain agents.

Verdict: Very plausible. “Ping works, Text doesn’t” is an inference, not a measurement — but the symptom shape matches.

3. Agent-side bug or state corruption in 0.6.63 (Command path broken; Heartbeat/metrics path fine)

For:

PST-SERVER2 has never acked a command on 0.6.63 — suggests a persistent local pathology, not a transient network blip.
PST-SERVER worked once, then failed after reconnect — consistent with handshake/ConfigUpdate leaving the command handler or ACK channel in a bad state.
Agent architecture likely separates heartbeat/metrics tasks from the read/command path; one can work while the other is broken.
Explains T8: new server connection does not help if the agent process state is wrong.

Against:

Hard to explain why 40 other agents on 0.6.63 are fine unless trigger is site-specific policy/config in ConfigUpdate.
Without agent logs, this is hard to distinguish from hypothesis #2.

Verdict: High plausibility, especially for PST-SERVER2. May combine with #2 (bad config triggers bug).

4. Site/gateway pathology (UniFi UDR “Cloudflare-Ultra” class) affecting server → client WebSocket application data

For:

PST-SERVER and PST-SERVER2 share site and gateway; both broken, overlapping timeframe.
UniFi + aggressive DNS/CF integration could plausibly affect long-lived WSS in non-obvious ways.

Against:

GTS-PEDRO-H is a different site and also fails — unless that’s unrelated (see #6).
Fleet-wide Cloudflare + nginx path works for 40 agents; edge config would need to be connection- or path-specific.
Tiny frames also fail — less consistent with typical MTU/DPI size limits.

Verdict: Plausible for PST pair alone; weak as a single explanation for all three agents.

5. Stale/wrong connection routing in server in-memory agent map (commands enqueued to a dead sender; heartbeats handled elsewhere)

For:

Classic split-brain pattern for connection registries.
send_to returning true on a queued-but-never-drained channel matches symptoms.

Against:

Eviction explicitly closed the connection and re-ran handshake (T7); map should have been replaced.
Single nginx → single Rust server process; no load-balancer ambiguity described.
If receive task is dead, last_seen would not update — it does.

Verdict: Lower plausibility after T7/T8, unless there is a subtle duplicate-registration bug not ruled out by eviction.

6. Multiple independent failure modes grouped by coincidence

For:

GTS-PEDRO-H: 6-day command gap, attempts=1, 60s timeout — different from PST’s climb to 10 “undeliverable.”
Pre-06-12 successes with acked=false are expected (pre-CommandAck).
One test on GTS (T9) is thin evidence to merge it with PST.

Against:

All three are 0.6.63 and fail the same operator test (hostname, no ack).

Verdict: Do not assume one root cause for all three without more data. GTS may be a red herring or a second bug.

7. Command JSON / server message construction bug affecting only certain agents

For:

Could cause agent to silently drop unparsable Text while still handling other message types — if those other types were received, which we have not confirmed post-reconnect.

Against:

Same server code path completes 397 commands for other agents.
Would need agent-specific command content or policy embedding in the Command frame.

Verdict: Low unless ConfigUpdate policy differs per agent and corrupts something downstream.

Single most diagnostic next step

Instrument the server per-connection send task and receive loop for one affected agent (PST-SERVER): log (with monotonic timestamps and connection_id) every completed sender.send() by frame type (Text vs Ping), Text byte length and command_id, and every inbound Pong correlated to the preceding Ping; then dispatch one tiny hostname command.

No agent access required. Run for 2–3 minutes on the stuck agent.

Outcome	Interpretation
Ping writes succeed + inbound Pongs observed; Text write for `hostname` logs success; still no `CommandAck`	Server → socket → nginx path delivers control frames; agent is not processing server Text (agent bug, or on-path Text opcode loss after origin). Strongest signal for #2 / #3.
Ping writes succeed + Pongs observed; Text write errors or never completes	Server-side or loopback socket pathology on application Text (#1).
Ping writes logged but no inbound Pongs	The “agent receives Pings” inference was wrong; connection is stranger than modeled — revisit half-open TCP, nginx connection mapping, or agent timeout logic (#1 / #5).
Neither Ping writes nor Pongs, yet agent stays “online” and heartbeats continue	Split connection or duplicate session: inbound and outbound are not the same WebSocket the send task thinks (#5). Escalate to connection registry + nginx upstream connection audit.
Text write succeeds on fresh post-eviction `connection_id`, same as above	Rules out “stale socket only”; points to persistent agent-side or site-path Text loss (#2/#3/#4).

That one test separates “server thinks it sent” from “bytes left the process,” validates or falsifies the Ping inference, and narrows the fault domain to origin socket / agent / middlebox without needing endpoint access first.

Secondary high-value data (if the above is already partially instrumented): nginx tcpdump or OpenResty debug on loopback :3001 during a hostname dispatch — confirms whether Text frames leave the Rust process into nginx. That does not prove WAN delivery, but it collapses #1 vs #2/#3 on the origin host in one capture.

14 KiB Raw Blame History Unescape Escape

GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)

CRITICAL CORRECTION to the ingress topology

The ACTUAL agent path (verified)

What this changes / things to reconsider

A note on one earlier "test"

Please now provide (round 2)

YOUR ROUND-1 ANSWER (for reference)

Diagnostic analysis: GuruRMM command delivery failure

What the evidence actually proves

Ranked root-cause hypotheses

1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via select! interleaving

2. Agent receives inbound frames (at least Pings) but server → agent Text frames are dropped or never parsed — agent-side or on-path selective loss

3. Agent-side bug or state corruption in 0.6.63 (Command path broken; Heartbeat/metrics path fine)

4. Site/gateway pathology (UniFi UDR “Cloudflare-Ultra” class) affecting server → client WebSocket application data

5. Stale/wrong connection routing in server in-memory agent map (commands enqueued to a dead sender; heartbeats handled elsewhere)

6. Multiple independent failure modes grouped by coincidence

7. Command JSON / server message construction bug affecting only certain agents

Single most diagnostic next step

14 KiB

Raw Blame History

1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via `select!` interleaving