Files
claudetools/tmp/r1-grok.txt
Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38
2026-06-12 05:58:05 -07:00

140 lines
9.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Diagnostic analysis: GuruRMM command delivery failure
## What the evidence actually proves
**Established facts:**
- Agent → server path is healthy for affected agents: `last_seen` stays current, Heartbeats arrive, re-offer logic fires on Heartbeat.
- Server believes it is dispatching: `send_to()` returns true, reaper logs re-delivery, no enqueue failures.
- Affected agents never send `CommandAck` and never execute commands (large and tiny).
- Most of the fleet (~40 agents, 397 commands in 6h) is unaffected on the same server build.
- PST-SERVER had one successful acked command at 22:23:40 on 0.6.63; failure begins after the 00:22:03 reconnect.
- Eviction → reconnect works mechanically on the server side, but commands still fail on the fresh connection (T8).
**Unverified inferences (treat as hypotheses, not facts):**
- “Agent is receiving server Pings” — inferred only from *no* 90s agent-side reconnect (T4). No Pong logging on the server; server code ignores inbound `Pong`.
- “`send_to` success means the command was sent” — it only means the in-memory channel accepted the message.
- “ConfigUpdate was delivered on reconnect” — server logged `Sent ConfigUpdate`; no agent-side confirmation.
- PST-SERVER, PST-SERVER2, and GTS-PEDRO-H share one root cause — GTS failed with `attempts=1` at a 60s timeout, not the 10-attempt undeliverable pattern; it had been idle since 06-06. That may be a second failure mode.
**The core puzzle:** This is not a dead connection in the usual sense. It looks like a **live, asymmetric channel**: inbound app traffic works; outbound *command delivery* does not — yet the connection is stable enough that the agent does not trigger its 90s “no inbound activity” reconnect.
---
## Ranked root-cause hypotheses
### 1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via `select!` interleaving
**For:**
- `send_to()` only checks `tx.send()` on an (likely unbounded) channel, not socket delivery.
- Send task uses one loop for Text and Ping; a **stuck `sender.send(Text).await`** does not necessarily kill the task immediately; Ping ticks can still fire when not blocked inside a Text send.
- Would explain: heartbeats in, no CommandAck, no execution, reaper climbing attempts, “online” agent.
- Fresh reconnect (T8) could still fail if the underlying TCP/WebSocket path is half-open for application Text but still passes occasional Ping frames.
**Against:**
- If the send task were **permanently** blocked on Text, Pings would stop and the agent should reconnect every ~90s. T4 shows a 30+ minute stable connection — so either Pings are getting through, or the 90s inference is wrong.
- After eviction, a brand-new TCP session should reset middlebox state; T8 still fails, which weakens “stuck write on old socket” unless the pathology is immediate on every new session at that site.
**Verdict:** Strong, but needs socket-level write confirmation. The contradiction around 90s stability is the main gap.
---
### 2. Agent receives inbound frames (at least Pings) but server → agent **Text** frames are dropped or never parsed — agent-side or on-path selective loss
**For:**
- Best fit for “connection alive, heartbeats out, commands never acked/executed.”
- Agent ACKs on receipt *before* execution; no ACK + no execution ⇒ Command handler likely never ran on a parsed `Command`.
- Tiny `hostname` (~80 B) fails (T2, T3, T8) — rules out payload size as the primary filter.
- PST-SERVER regressed exactly at reconnect — first thing on a new session is `AuthAck`, `ConfigUpdate`, then command re-offers. We do not know whether *any* post-reconnect Text reached the agent.
**Against:**
- WebSocket middleboxes rarely drop only `Text` opcode frames while passing `Ping`. Possible but unusual unless something is inspecting JSON content.
- Does not by itself explain GTS-PEDRO-H at a different site unless this is a broader server bug in Text serialization for certain agents.
**Verdict:** Very plausible. “Ping works, Text doesnt” is an inference, not a measurement — but the symptom shape matches.
---
### 3. Agent-side bug or state corruption in 0.6.63 (Command path broken; Heartbeat/metrics path fine)
**For:**
- PST-SERVER2 has **never** acked a command on 0.6.63 — suggests a persistent local pathology, not a transient network blip.
- PST-SERVER worked once, then failed after reconnect — consistent with handshake/`ConfigUpdate` leaving the command handler or ACK channel in a bad state.
- Agent architecture likely separates heartbeat/metrics tasks from the read/command path; one can work while the other is broken.
- Explains T8: new server connection does not help if the agent process state is wrong.
**Against:**
- Hard to explain why 40 other agents on 0.6.63 are fine unless trigger is site-specific policy/config in `ConfigUpdate`.
- Without agent logs, this is hard to distinguish from hypothesis #2.
**Verdict:** High plausibility, especially for PST-SERVER2. May combine with #2 (bad config triggers bug).
---
### 4. Site/gateway pathology (UniFi UDR “Cloudflare-Ultra” class) affecting server → client WebSocket application data
**For:**
- PST-SERVER and PST-SERVER2 share site and gateway; both broken, overlapping timeframe.
- UniFi + aggressive DNS/CF integration could plausibly affect long-lived WSS in non-obvious ways.
**Against:**
- GTS-PEDRO-H is a different site and also fails — unless thats unrelated (see #6).
- Fleet-wide Cloudflare + nginx path works for 40 agents; edge config would need to be connection- or path-specific.
- Tiny frames also fail — less consistent with typical MTU/DPI size limits.
**Verdict:** Plausible for PST pair alone; weak as a single explanation for all three agents.
---
### 5. Stale/wrong connection routing in server in-memory agent map (commands enqueued to a dead sender; heartbeats handled elsewhere)
**For:**
- Classic split-brain pattern for connection registries.
- `send_to` returning true on a queued-but-never-drained channel matches symptoms.
**Against:**
- Eviction explicitly closed the connection and re-ran handshake (T7); map should have been replaced.
- Single nginx → single Rust server process; no load-balancer ambiguity described.
- If receive task is dead, `last_seen` would not update — it does.
**Verdict:** Lower plausibility after T7/T8, unless there is a subtle duplicate-registration bug not ruled out by eviction.
---
### 6. Multiple independent failure modes grouped by coincidence
**For:**
- GTS-PEDRO-H: 6-day command gap, `attempts=1`, 60s timeout — different from PSTs climb to 10 “undeliverable.”
- Pre-06-12 successes with `acked=false` are expected (pre-CommandAck).
- One test on GTS (T9) is thin evidence to merge it with PST.
**Against:**
- All three are 0.6.63 and fail the same operator test (`hostname`, no ack).
**Verdict:** Do not assume one root cause for all three without more data. GTS may be a red herring or a second bug.
---
### 7. Command JSON / server message construction bug affecting only certain agents
**For:**
- Could cause agent to silently drop unparsable `Text` while still handling other message types — if those other types were received, which we have not confirmed post-reconnect.
**Against:**
- Same server code path completes 397 commands for other agents.
- Would need agent-specific command content or policy embedding in the `Command` frame.
**Verdict:** Low unless `ConfigUpdate` policy differs per agent and corrupts something downstream.
---
## Single most diagnostic next step
**Instrument the server per-connection send task and receive loop for one affected agent (PST-SERVER): log (with monotonic timestamps and `connection_id`) every completed `sender.send()` by frame type (`Text` vs `Ping`), Text byte length and `command_id`, and every inbound `Pong` correlated to the preceding `Ping`; then dispatch one tiny `hostname` command.**
No agent access required. Run for 23 minutes on the stuck agent.
| Outcome | Interpretation |
|--------|----------------|
| **Ping writes succeed + inbound Pongs observed; Text write for `hostname` logs success; still no `CommandAck`** | Server → socket → nginx path delivers control frames; **agent is not processing server Text** (agent bug, or on-path Text opcode loss after origin). Strongest signal for #2 / #3. |
| **Ping writes succeed + Pongs observed; Text write errors or never completes** | **Server-side or loopback socket pathology** on application Text (#1). |
| **Ping writes logged but no inbound Pongs** | The “agent receives Pings” inference was wrong; connection is stranger than modeled — revisit half-open TCP, nginx connection mapping, or agent timeout logic (#1 / #5). |
| **Neither Ping writes nor Pongs, yet agent stays “online” and heartbeats continue** | **Split connection or duplicate session**: inbound and outbound are not the same WebSocket the send task thinks (#5). Escalate to connection registry + nginx upstream connection audit. |
| **Text write succeeds on fresh post-eviction `connection_id`, same as above** | Rules out “stale socket only”; points to **persistent agent-side or site-path Text loss** (#2/#3/#4). |
That one test separates “server thinks it sent” from “bytes left the process,” validates or falsifies the Ping inference, and narrows the fault domain to **origin socket / agent / middlebox** without needing endpoint access first.
**Secondary high-value data (if the above is already partially instrumented):** nginx `tcpdump` or OpenResty debug on loopback `:3001` during a `hostname` dispatch — confirms whether Text frames leave the Rust process into nginx. That does not prove WAN delivery, but it collapses #1 vs #2/#3 on the origin host in one capture.