claudetools/.claude/memory/project_gururmm_dispatch_hang_fix.md at 373883fb4889b83cd4eae46c70b41cd2b4c38d7f

Files

Howard Enos 26aa5034f1 sync: auto-sync from HOWARD-HOME at 2026-06-22 14:04:53

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-22 14:04:53

2026-06-22 14:05:26 -07:00

2.0 KiB

Raw Blame History

name, description, metadata

name

description

metadata

project_gururmm_dispatch_hang_fix

GuruRMM fleet-wide command-dispatch hang root cause + fix (send_to try_send, 9dae20c) and the still-missing eviction

type
project

On 2026-06-22 the live GuruRMM server (172.16.3.30:3001) hung on every POST /api/agents/:id/command (30s+ timeouts, all agents; GET worked) — command dispatch was down fleet-wide.

Root cause: AgentConnections::send_to (server src/ws/mod.rs) did a blocking tx.send(msg).await on a bounded (cap 100) per-agent mpsc channel. A black-holed/half-open agent socket stops its WS writer draining the channel → it fills → send().await blocks forever. send_command holds state.agents.read().await across that await, so the next agent (re)connect's .write().await starves tokio's write-preferring RwLock, queuing all later dispatches behind it. One dead socket wedged the whole fleet. The recovery path "evict non-delivering connections" (7c578fd) had been reverted (80df458), leaving no escape hatch.

Fix (9dae20c, on main, deployed): send_to now uses non-blocking try_send — a full/closed channel returns "not delivered"; the command stays persisted and is re-offered by redispatch_pending_commands (reconnect) + the reaper requeue_undelivered_commands. Failure stays local. Verified live (other agent ran a command end-to-end in ~5s).

Still open / watch: the proper per-connection eviction of a black-holed socket is still absent (only reverted code existed). A truly half-open agent will keep heartbeating online while its server→agent channel silently drops messages (commands dispatch as running but never return → reaper fails them on timeout). If this recurs, finish the eviction/keepalive-drop work rather than relying on try_send alone.

Deploy model: merging to gururmm main triggers the webhook build on .30 (rebuild + systemctl restart, auto-rollback if the binary won't start). See the gururmm-build skill. Pairs with project_guruscan_in_test_paused.

2.0 KiB Raw Blame History

2.0 KiB

Raw Blame History