# GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)

You analyzed this in round 1 (your prior answer is included at the bottom). A FACTUAL CORRECTION
to the architecture has come to light that invalidates part of the round-1 premise. Please
re-evaluate from scratch where needed. Be willing to discard your prior top hypothesis.

## CRITICAL CORRECTION to the ingress topology

In round 1 we told you: "agent -> Cloudflare -> nginx -> Rust server". **That was wrong for the
AGENT path.** Verified facts now:

- The agent's hard-coded WebSocket URL is **`wss://rmm-api.azcomputerguru.com/ws`** (from agent source:
  `DEFAULT_SERVER_URL`, and the agent config default). The installer and enrollment also use
  `rmm-api.azcomputerguru.com`.
- DNS / Cloudflare proxy status for the two hostnames (both A-record to the same public IP 72.194.62.10):
  - `rmm.azcomputerguru.com`     -> **proxied = true**  (orange cloud; goes THROUGH Cloudflare)  — this is the human DASHBOARD.
  - `rmm-api.azcomputerguru.com` -> **proxied = false** (grey cloud; **DNS-only, BYPASSES Cloudflare**) — this is what the AGENTS use.
- Therefore **Cloudflare is NOT in the agent's path at all.** All round-1 hypotheses about Cloudflare
  WAF / CF WebSocket buffering / CF frame handling are moot for agent command delivery. (Cloudflare only
  fronts the dashboard.)

## The ACTUAL agent path (verified)

```
agent (endpoint, e.g. PST-SERVER behind a UniFi UDR/"Cloudflare-Ultra"-class gateway)
  -> endpoint LAN/NAT
  -> public internet
  -> 72.194.62.10  (public IP; this is the NPM box)
  -> NPM = "Nginx Proxy Manager" on host 172.16.3.20  (terminates TLS; one nginx layer)
       NPM proxy-host settings for rmm-api: forward http://172.16.3.30:80, websockets=ON,
       http2_support=ON, block_exploits=ON (NPM "Block Common Exploits"), caching=OFF, advanced_config=EMPTY
  -> http://172.16.3.30:80   (the ORIGIN nginx; a SECOND nginx layer)  -- PLAINTEXT HTTP over the LAN here
  -> proxy_pass http://127.0.0.1:3001   (the Rust server)
```

So there are **TWO nginx proxy layers** in series (NPM on .20, then origin nginx on .30), no CDN.

Origin nginx `/ws` block (verbatim):
```
location /ws {
    proxy_pass http://127.0.0.1:3001;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 86400;
}
```
(No explicit `proxy_buffering off;`, no `proxy_send_timeout`. NPM's generated config for the proxy host is
the standard NPM template with websockets enabled + block_exploits; we have not yet dumped its exact directives.)

## What this changes / things to reconsider

- The "asymmetry" symptom is unchanged and still the core puzzle: for affected agents the server keeps
  receiving inbound (heartbeats, last_seen fresh), the agent does NOT trip its 90s no-inbound reconnect
  (so it appears to be receiving at least the server's 30s WS Ping frames), yet server->agent COMMAND (Text)
  frames are never acked/executed — both ~4 KB commands AND ~80 B `hostname` commands. One command DID
  succeed earlier (right after the agent updated/reconnected). A server-forced eviction + fresh reconnect did
  NOT restore command delivery (a tiny command on the fresh connection still never acked) — but note we have
  NO confirmation the agent received anything server->agent on that fresh connection either.
- Affected: PST-SERVER + PST-SERVER2 (same site/UDR), and GTS-PEDRO-H (different site). ~40 other agents
  deliver commands fine through this SAME NPM+nginx path.
- Because rmm-api is grey-cloud, the agent's TLS connection goes straight from the endpoint's gateway to the
  public IP — there is no CDN normalizing MTU/keepalive/buffering on the path; the endpoint's own gateway
  (e.g. PST's UDR) is directly in the path.

## A note on one earlier "test"

We tried to `tcpdump` the origin loopback (nginx->Rust on 127.0.0.1:3001) during a marked command dispatch and
saw 0 packets — but that capture almost certainly FAILED (the backgrounded tcpdump was killed when the SSH
session closed), so treat it as NO DATA, not as evidence that the loopback is silent.

## Please now provide (round 2)

1. Your REVISED ranked hypotheses given Cloudflare is out of the agent path and there are two nginx layers +
   the endpoint gateway directly in the path. Explicitly say which round-1 hypotheses you drop or demote.
2. The single most diagnostic next test (we can run packet captures on the .30 host on the plaintext .20->.30:80
   leg and on the 127.0.0.1:3001 loopback; we can read NPM's generated nginx config on .20; we can query the
   UDR; we currently CANNOT get shell/logs on the affected endpoints). For each plausible outcome, what it tells you.

---
## YOUR ROUND-1 ANSWER (for reference)
# Diagnostic analysis: GuruRMM command delivery failure

## What the evidence actually proves

**Established facts:**
- Agent → server path is healthy for affected agents: `last_seen` stays current, Heartbeats arrive, re-offer logic fires on Heartbeat.
- Server believes it is dispatching: `send_to()` returns true, reaper logs re-delivery, no enqueue failures.
- Affected agents never send `CommandAck` and never execute commands (large and tiny).
- Most of the fleet (~40 agents, 397 commands in 6h) is unaffected on the same server build.
- PST-SERVER had one successful acked command at 22:23:40 on 0.6.63; failure begins after the 00:22:03 reconnect.
- Eviction → reconnect works mechanically on the server side, but commands still fail on the fresh connection (T8).

**Unverified inferences (treat as hypotheses, not facts):**
- “Agent is receiving server Pings” — inferred only from *no* 90s agent-side reconnect (T4). No Pong logging on the server; server code ignores inbound `Pong`.
- “`send_to` success means the command was sent” — it only means the in-memory channel accepted the message.
- “ConfigUpdate was delivered on reconnect” — server logged `Sent ConfigUpdate`; no agent-side confirmation.
- PST-SERVER, PST-SERVER2, and GTS-PEDRO-H share one root cause — GTS failed with `attempts=1` at a 60s timeout, not the 10-attempt undeliverable pattern; it had been idle since 06-06. That may be a second failure mode.

**The core puzzle:** This is not a dead connection in the usual sense. It looks like a **live, asymmetric channel**: inbound app traffic works; outbound *command delivery* does not — yet the connection is stable enough that the agent does not trigger its 90s “no inbound activity” reconnect.

---

## Ranked root-cause hypotheses

### 1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via `select!` interleaving
**For:**
- `send_to()` only checks `tx.send()` on an (likely unbounded) channel, not socket delivery.
- Send task uses one loop for Text and Ping; a **stuck `sender.send(Text).await`** does not necessarily kill the task immediately; Ping ticks can still fire when not blocked inside a Text send.
- Would explain: heartbeats in, no CommandAck, no execution, reaper climbing attempts, “online” agent.
- Fresh reconnect (T8) could still fail if the underlying TCP/WebSocket path is half-open for application Text but still passes occasional Ping frames.

**Against:**
- If the send task were **permanently** blocked on Text, Pings would stop and the agent should reconnect every ~90s. T4 shows a 30+ minute stable connection — so either Pings are getting through, or the 90s inference is wrong.
- After eviction, a brand-new TCP session should reset middlebox state; T8 still fails, which weakens “stuck write on old socket” unless the pathology is immediate on every new session at that site.

**Verdict:** Strong, but needs socket-level write confirmation. The contradiction around 90s stability is the main gap.

---

### 2. Agent receives inbound frames (at least Pings) but server → agent **Text** frames are dropped or never parsed — agent-side or on-path selective loss
**For:**
- Best fit for “connection alive, heartbeats out, commands never acked/executed.”
- Agent ACKs on receipt *before* execution; no ACK + no execution ⇒ Command handler likely never ran on a parsed `Command`.
- Tiny `hostname` (~80 B) fails (T2, T3, T8) — rules out payload size as the primary filter.
- PST-SERVER regressed exactly at reconnect — first thing on a new session is `AuthAck`, `ConfigUpdate`, then command re-offers. We do not know whether *any* post-reconnect Text reached the agent.

**Against:**
- WebSocket middleboxes rarely drop only `Text` opcode frames while passing `Ping`. Possible but unusual unless something is inspecting JSON content.
- Does not by itself explain GTS-PEDRO-H at a different site unless this is a broader server bug in Text serialization for certain agents.

**Verdict:** Very plausible. “Ping works, Text doesn’t” is an inference, not a measurement — but the symptom shape matches.

---

### 3. Agent-side bug or state corruption in 0.6.63 (Command path broken; Heartbeat/metrics path fine)
**For:**
- PST-SERVER2 has **never** acked a command on 0.6.63 — suggests a persistent local pathology, not a transient network blip.
- PST-SERVER worked once, then failed after reconnect — consistent with handshake/`ConfigUpdate` leaving the command handler or ACK channel in a bad state.
- Agent architecture likely separates heartbeat/metrics tasks from the read/command path; one can work while the other is broken.
- Explains T8: new server connection does not help if the agent process state is wrong.

**Against:**
- Hard to explain why 40 other agents on 0.6.63 are fine unless trigger is site-specific policy/config in `ConfigUpdate`.
- Without agent logs, this is hard to distinguish from hypothesis #2.

**Verdict:** High plausibility, especially for PST-SERVER2. May combine with #2 (bad config triggers bug).

---

### 4. Site/gateway pathology (UniFi UDR “Cloudflare-Ultra” class) affecting server → client WebSocket application data
**For:**
- PST-SERVER and PST-SERVER2 share site and gateway; both broken, overlapping timeframe.
- UniFi + aggressive DNS/CF integration could plausibly affect long-lived WSS in non-obvious ways.

**Against:**
- GTS-PEDRO-H is a different site and also fails — unless that’s unrelated (see #6).
- Fleet-wide Cloudflare + nginx path works for 40 agents; edge config would need to be connection- or path-specific.
- Tiny frames also fail — less consistent with typical MTU/DPI size limits.

**Verdict:** Plausible for PST pair alone; weak as a single explanation for all three agents.

---

### 5. Stale/wrong connection routing in server in-memory agent map (commands enqueued to a dead sender; heartbeats handled elsewhere)
**For:**
- Classic split-brain pattern for connection registries.
- `send_to` returning true on a queued-but-never-drained channel matches symptoms.

**Against:**
- Eviction explicitly closed the connection and re-ran handshake (T7); map should have been replaced.
- Single nginx → single Rust server process; no load-balancer ambiguity described.
- If receive task is dead, `last_seen` would not update — it does.

**Verdict:** Lower plausibility after T7/T8, unless there is a subtle duplicate-registration bug not ruled out by eviction.

---

### 6. Multiple independent failure modes grouped by coincidence
**For:**
- GTS-PEDRO-H: 6-day command gap, `attempts=1`, 60s timeout — different from PST’s climb to 10 “undeliverable.”
- Pre-06-12 successes with `acked=false` are expected (pre-CommandAck).
- One test on GTS (T9) is thin evidence to merge it with PST.

**Against:**
- All three are 0.6.63 and fail the same operator test (`hostname`, no ack).

**Verdict:** Do not assume one root cause for all three without more data. GTS may be a red herring or a second bug.

---

### 7. Command JSON / server message construction bug affecting only certain agents
**For:**
- Could cause agent to silently drop unparsable `Text` while still handling other message types — if those other types were received, which we have not confirmed post-reconnect.

**Against:**
- Same server code path completes 397 commands for other agents.
- Would need agent-specific command content or policy embedding in the `Command` frame.

**Verdict:** Low unless `ConfigUpdate` policy differs per agent and corrupts something downstream.

---

## Single most diagnostic next step

**Instrument the server per-connection send task and receive loop for one affected agent (PST-SERVER): log (with monotonic timestamps and `connection_id`) every completed `sender.send()` by frame type (`Text` vs `Ping`), Text byte length and `command_id`, and every inbound `Pong` correlated to the preceding `Ping`; then dispatch one tiny `hostname` command.**

No agent access required. Run for 2–3 minutes on the stuck agent.

| Outcome | Interpretation |
|--------|----------------|
| **Ping writes succeed + inbound Pongs observed; Text write for `hostname` logs success; still no `CommandAck`** | Server → socket → nginx path delivers control frames; **agent is not processing server Text** (agent bug, or on-path Text opcode loss after origin). Strongest signal for #2 / #3. |
| **Ping writes succeed + Pongs observed; Text write errors or never completes** | **Server-side or loopback socket pathology** on application Text (#1). |
| **Ping writes logged but no inbound Pongs** | The “agent receives Pings” inference was wrong; connection is stranger than modeled — revisit half-open TCP, nginx connection mapping, or agent timeout logic (#1 / #5). |
| **Neither Ping writes nor Pongs, yet agent stays “online” and heartbeats continue** | **Split connection or duplicate session**: inbound and outbound are not the same WebSocket the send task thinks (#5). Escalate to connection registry + nginx upstream connection audit. |
| **Text write succeeds on fresh post-eviction `connection_id`, same as above** | Rules out “stale socket only”; points to **persistent agent-side or site-path Text loss** (#2/#3/#4). |

That one test separates “server thinks it sent” from “bytes left the process,” validates or falsifies the Ping inference, and narrows the fault domain to **origin socket / agent / middlebox** without needing endpoint access first.

**Secondary high-value data (if the above is already partially instrumented):** nginx `tcpdump` or OpenResty debug on loopback `:3001` during a `hostname` dispatch — confirms whether Text frames leave the Rust process into nginx. That does not prove WAN delivery, but it collapses #1 vs #2/#3 on the origin host in one capture.