2.0 KiB
name, description, metadata
| name | description | metadata | ||
|---|---|---|---|---|
| project_gururmm_dispatch_hang_fix | GuruRMM fleet-wide command-dispatch hang root cause + fix (send_to try_send, 9dae20c) and the still-missing eviction |
|
On 2026-06-22 the live GuruRMM server (172.16.3.30:3001) hung on every POST /api/agents/:id/command (30s+ timeouts, all agents; GET worked) — command dispatch was down fleet-wide.
Root cause: AgentConnections::send_to (server src/ws/mod.rs) did a blocking tx.send(msg).await on a bounded (cap 100) per-agent mpsc channel. A black-holed/half-open agent socket stops its WS writer draining the channel → it fills → send().await blocks forever. send_command holds state.agents.read().await across that await, so the next agent (re)connect's .write().await starves tokio's write-preferring RwLock, queuing all later dispatches behind it. One dead socket wedged the whole fleet. The recovery path "evict non-delivering connections" (7c578fd) had been reverted (80df458), leaving no escape hatch.
Fix (9dae20c, on main, deployed): send_to now uses non-blocking try_send — a full/closed channel returns "not delivered"; the command stays persisted and is re-offered by redispatch_pending_commands (reconnect) + the reaper requeue_undelivered_commands. Failure stays local. Verified live (other agent ran a command end-to-end in ~5s).
Still open / watch: the proper per-connection eviction of a black-holed socket is still absent (only reverted code existed). A truly half-open agent will keep heartbeating online while its server→agent channel silently drops messages (commands dispatch as running but never return → reaper fails them on timeout). If this recurs, finish the eviction/keepalive-drop work rather than relying on try_send alone.
Deploy model: merging to gururmm main triggers the webhook build on .30 (rebuild + systemctl restart, auto-rollback if the binary won't start). See the gururmm-build skill. Pairs with project_guruscan_in_test_paused.