140 lines
9.8 KiB
Plaintext
140 lines
9.8 KiB
Plaintext
# Diagnostic analysis: GuruRMM command delivery failure
|
||
|
||
## What the evidence actually proves
|
||
|
||
**Established facts:**
|
||
- Agent → server path is healthy for affected agents: `last_seen` stays current, Heartbeats arrive, re-offer logic fires on Heartbeat.
|
||
- Server believes it is dispatching: `send_to()` returns true, reaper logs re-delivery, no enqueue failures.
|
||
- Affected agents never send `CommandAck` and never execute commands (large and tiny).
|
||
- Most of the fleet (~40 agents, 397 commands in 6h) is unaffected on the same server build.
|
||
- PST-SERVER had one successful acked command at 22:23:40 on 0.6.63; failure begins after the 00:22:03 reconnect.
|
||
- Eviction → reconnect works mechanically on the server side, but commands still fail on the fresh connection (T8).
|
||
|
||
**Unverified inferences (treat as hypotheses, not facts):**
|
||
- “Agent is receiving server Pings” — inferred only from *no* 90s agent-side reconnect (T4). No Pong logging on the server; server code ignores inbound `Pong`.
|
||
- “`send_to` success means the command was sent” — it only means the in-memory channel accepted the message.
|
||
- “ConfigUpdate was delivered on reconnect” — server logged `Sent ConfigUpdate`; no agent-side confirmation.
|
||
- PST-SERVER, PST-SERVER2, and GTS-PEDRO-H share one root cause — GTS failed with `attempts=1` at a 60s timeout, not the 10-attempt undeliverable pattern; it had been idle since 06-06. That may be a second failure mode.
|
||
|
||
**The core puzzle:** This is not a dead connection in the usual sense. It looks like a **live, asymmetric channel**: inbound app traffic works; outbound *command delivery* does not — yet the connection is stable enough that the agent does not trigger its 90s “no inbound activity” reconnect.
|
||
|
||
---
|
||
|
||
## Ranked root-cause hypotheses
|
||
|
||
### 1. Server outbound Text writes succeed at enqueue but fail (or stall) at the socket — while Ping writes still work via `select!` interleaving
|
||
**For:**
|
||
- `send_to()` only checks `tx.send()` on an (likely unbounded) channel, not socket delivery.
|
||
- Send task uses one loop for Text and Ping; a **stuck `sender.send(Text).await`** does not necessarily kill the task immediately; Ping ticks can still fire when not blocked inside a Text send.
|
||
- Would explain: heartbeats in, no CommandAck, no execution, reaper climbing attempts, “online” agent.
|
||
- Fresh reconnect (T8) could still fail if the underlying TCP/WebSocket path is half-open for application Text but still passes occasional Ping frames.
|
||
|
||
**Against:**
|
||
- If the send task were **permanently** blocked on Text, Pings would stop and the agent should reconnect every ~90s. T4 shows a 30+ minute stable connection — so either Pings are getting through, or the 90s inference is wrong.
|
||
- After eviction, a brand-new TCP session should reset middlebox state; T8 still fails, which weakens “stuck write on old socket” unless the pathology is immediate on every new session at that site.
|
||
|
||
**Verdict:** Strong, but needs socket-level write confirmation. The contradiction around 90s stability is the main gap.
|
||
|
||
---
|
||
|
||
### 2. Agent receives inbound frames (at least Pings) but server → agent **Text** frames are dropped or never parsed — agent-side or on-path selective loss
|
||
**For:**
|
||
- Best fit for “connection alive, heartbeats out, commands never acked/executed.”
|
||
- Agent ACKs on receipt *before* execution; no ACK + no execution ⇒ Command handler likely never ran on a parsed `Command`.
|
||
- Tiny `hostname` (~80 B) fails (T2, T3, T8) — rules out payload size as the primary filter.
|
||
- PST-SERVER regressed exactly at reconnect — first thing on a new session is `AuthAck`, `ConfigUpdate`, then command re-offers. We do not know whether *any* post-reconnect Text reached the agent.
|
||
|
||
**Against:**
|
||
- WebSocket middleboxes rarely drop only `Text` opcode frames while passing `Ping`. Possible but unusual unless something is inspecting JSON content.
|
||
- Does not by itself explain GTS-PEDRO-H at a different site unless this is a broader server bug in Text serialization for certain agents.
|
||
|
||
**Verdict:** Very plausible. “Ping works, Text doesn’t” is an inference, not a measurement — but the symptom shape matches.
|
||
|
||
---
|
||
|
||
### 3. Agent-side bug or state corruption in 0.6.63 (Command path broken; Heartbeat/metrics path fine)
|
||
**For:**
|
||
- PST-SERVER2 has **never** acked a command on 0.6.63 — suggests a persistent local pathology, not a transient network blip.
|
||
- PST-SERVER worked once, then failed after reconnect — consistent with handshake/`ConfigUpdate` leaving the command handler or ACK channel in a bad state.
|
||
- Agent architecture likely separates heartbeat/metrics tasks from the read/command path; one can work while the other is broken.
|
||
- Explains T8: new server connection does not help if the agent process state is wrong.
|
||
|
||
**Against:**
|
||
- Hard to explain why 40 other agents on 0.6.63 are fine unless trigger is site-specific policy/config in `ConfigUpdate`.
|
||
- Without agent logs, this is hard to distinguish from hypothesis #2.
|
||
|
||
**Verdict:** High plausibility, especially for PST-SERVER2. May combine with #2 (bad config triggers bug).
|
||
|
||
---
|
||
|
||
### 4. Site/gateway pathology (UniFi UDR “Cloudflare-Ultra” class) affecting server → client WebSocket application data
|
||
**For:**
|
||
- PST-SERVER and PST-SERVER2 share site and gateway; both broken, overlapping timeframe.
|
||
- UniFi + aggressive DNS/CF integration could plausibly affect long-lived WSS in non-obvious ways.
|
||
|
||
**Against:**
|
||
- GTS-PEDRO-H is a different site and also fails — unless that’s unrelated (see #6).
|
||
- Fleet-wide Cloudflare + nginx path works for 40 agents; edge config would need to be connection- or path-specific.
|
||
- Tiny frames also fail — less consistent with typical MTU/DPI size limits.
|
||
|
||
**Verdict:** Plausible for PST pair alone; weak as a single explanation for all three agents.
|
||
|
||
---
|
||
|
||
### 5. Stale/wrong connection routing in server in-memory agent map (commands enqueued to a dead sender; heartbeats handled elsewhere)
|
||
**For:**
|
||
- Classic split-brain pattern for connection registries.
|
||
- `send_to` returning true on a queued-but-never-drained channel matches symptoms.
|
||
|
||
**Against:**
|
||
- Eviction explicitly closed the connection and re-ran handshake (T7); map should have been replaced.
|
||
- Single nginx → single Rust server process; no load-balancer ambiguity described.
|
||
- If receive task is dead, `last_seen` would not update — it does.
|
||
|
||
**Verdict:** Lower plausibility after T7/T8, unless there is a subtle duplicate-registration bug not ruled out by eviction.
|
||
|
||
---
|
||
|
||
### 6. Multiple independent failure modes grouped by coincidence
|
||
**For:**
|
||
- GTS-PEDRO-H: 6-day command gap, `attempts=1`, 60s timeout — different from PST’s climb to 10 “undeliverable.”
|
||
- Pre-06-12 successes with `acked=false` are expected (pre-CommandAck).
|
||
- One test on GTS (T9) is thin evidence to merge it with PST.
|
||
|
||
**Against:**
|
||
- All three are 0.6.63 and fail the same operator test (`hostname`, no ack).
|
||
|
||
**Verdict:** Do not assume one root cause for all three without more data. GTS may be a red herring or a second bug.
|
||
|
||
---
|
||
|
||
### 7. Command JSON / server message construction bug affecting only certain agents
|
||
**For:**
|
||
- Could cause agent to silently drop unparsable `Text` while still handling other message types — if those other types were received, which we have not confirmed post-reconnect.
|
||
|
||
**Against:**
|
||
- Same server code path completes 397 commands for other agents.
|
||
- Would need agent-specific command content or policy embedding in the `Command` frame.
|
||
|
||
**Verdict:** Low unless `ConfigUpdate` policy differs per agent and corrupts something downstream.
|
||
|
||
---
|
||
|
||
## Single most diagnostic next step
|
||
|
||
**Instrument the server per-connection send task and receive loop for one affected agent (PST-SERVER): log (with monotonic timestamps and `connection_id`) every completed `sender.send()` by frame type (`Text` vs `Ping`), Text byte length and `command_id`, and every inbound `Pong` correlated to the preceding `Ping`; then dispatch one tiny `hostname` command.**
|
||
|
||
No agent access required. Run for 2–3 minutes on the stuck agent.
|
||
|
||
| Outcome | Interpretation |
|
||
|--------|----------------|
|
||
| **Ping writes succeed + inbound Pongs observed; Text write for `hostname` logs success; still no `CommandAck`** | Server → socket → nginx path delivers control frames; **agent is not processing server Text** (agent bug, or on-path Text opcode loss after origin). Strongest signal for #2 / #3. |
|
||
| **Ping writes succeed + Pongs observed; Text write errors or never completes** | **Server-side or loopback socket pathology** on application Text (#1). |
|
||
| **Ping writes logged but no inbound Pongs** | The “agent receives Pings” inference was wrong; connection is stranger than modeled — revisit half-open TCP, nginx connection mapping, or agent timeout logic (#1 / #5). |
|
||
| **Neither Ping writes nor Pongs, yet agent stays “online” and heartbeats continue** | **Split connection or duplicate session**: inbound and outbound are not the same WebSocket the send task thinks (#5). Escalate to connection registry + nginx upstream connection audit. |
|
||
| **Text write succeeds on fresh post-eviction `connection_id`, same as above** | Rules out “stale socket only”; points to **persistent agent-side or site-path Text loss** (#2/#3/#4). |
|
||
|
||
That one test separates “server thinks it sent” from “bytes left the process,” validates or falsifies the Ping inference, and narrows the fault domain to **origin socket / agent / middlebox** without needing endpoint access first.
|
||
|
||
**Secondary high-value data (if the above is already partially instrumented):** nginx `tcpdump` or OpenResty debug on loopback `:3001` during a `hostname` dispatch — confirms whether Text frames leave the Rust process into nginx. That does not prove WAN delivery, but it collapses #1 vs #2/#3 on the origin host in one capture.
|