sync: auto-sync from HOWARD-HOME at 2026-06-22 14:04:53
Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-22 14:04:53
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
# GuruRMM fleet dispatch-hang fix + GuruScan Emsisoft test (paused) — 2026-06-22
|
||||
|
||||
## User
|
||||
- **User:** Howard Enos (howard)
|
||||
- **Machine:** Howard-Home
|
||||
- **Role:** tech
|
||||
|
||||
## Session Summary
|
||||
|
||||
Resumed the GuruScan multi-engine malware-scanner verification on test VM DESKTOP-MS42HNC
|
||||
(the open item from yesterday: run the heavy Emsisoft full `/f=C:\` engine to confirm full
|
||||
scan + full removal + automated lifecycle, after HitmanPro's 36-threat lifecycle was already
|
||||
proven). Updated this machine's `guru-scan` submodule from a 3-week-old `2f8fbcd` to the
|
||||
hardened `fb09102`, confirmed the test VM was back online, and launched the automated
|
||||
`scan-one Emsisoft` harness — which died instantly at the first step with `[ERROR] Dispatch
|
||||
failed`.
|
||||
|
||||
Tracing that failure uncovered a production incident much larger than the scan: the live
|
||||
GuruRMM server (`172.16.3.30:3001`) was hanging on **every** `POST /api/agents/:id/command`
|
||||
(30s+ timeouts) while `GET` worked. Reproduced on two independent online agents — command
|
||||
dispatch was down **fleet-wide**. Root-caused it in the deployed `main` code: `AgentConnections::send_to`
|
||||
did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel; a
|
||||
black-holed/half-open agent socket stops its WS writer draining the channel, it fills, and
|
||||
the send blocks forever. `send_command` holds `state.agents.read().await` across that await,
|
||||
so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`,
|
||||
queuing all subsequent dispatches behind it — one dead socket wedged the whole fleet. The
|
||||
recovery path ("evict non-delivering connections", `7c578fd`) had been reverted in `80df458`,
|
||||
leaving no escape hatch.
|
||||
|
||||
Before touching anything, preserved the "HELD" guru-rmm working tree (detached HEAD at the
|
||||
SPEC-030 uninstall-engine commit `4ca1d06` plus 8 uncommitted files of unrelated DoS-hardening
|
||||
+ dashboard + docs work) onto branch `wip/held-2026-06-22` (commit `3a6277a`) and pushed it
|
||||
to origin for durable backup — nothing lost. Then, with Howard's clearance to follow the RMM
|
||||
update rules, fixed `send_to` to use non-blocking `try_send` (a full/closed channel returns
|
||||
"not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands`
|
||||
on reconnect and the reaper `requeue_undelivered_commands`). Verified with `cargo check`
|
||||
(via the `gururmm-build` skill), committed `9dae20c`, merged to `main`, and pushed — which
|
||||
triggers the webhook build+restart on `.30`. Dispatch recovered ~2.5 min later; verified a
|
||||
command ran end-to-end on a healthy second agent (DESKTOP-QNP3ON5) in ~5s.
|
||||
|
||||
The test VM itself then needed recovery: it went silent at the exact moment of the server
|
||||
restart and didn't reconnect (its `online` was stale grace-period). Howard restarted the
|
||||
GuruRMM agent service on it; the agent reconnected and, after draining the redispatched
|
||||
backlog, executed commands cleanly (~1s round-trips). The ~480 junk commands created by the
|
||||
hung-dispatch attempts self-drained to `failed` via the reaper (won't re-run). Relaunched the
|
||||
real `scan-one Emsisoft` run: setup restored 516 malware samples and the detached, no-cap
|
||||
`GuruScan-one` scheduled task reached `state=Running` — the full scan was underway. Howard then
|
||||
rebooted DESKTOP-MS42HNC to test other software, interrupting the scan. Per his direction, the
|
||||
Emsisoft verification is marked **IN TEST / paused** to resume later; the background harness was
|
||||
stopped, scratch files cleaned, and state saved.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Fixed + deployed the server bug rather than just reporting it** — Howard explicitly cleared
|
||||
"following the rules for updating the RMM and following Mike's requests do what is needed."
|
||||
The merge-to-main deploy is the documented path (`gururmm-build` skill) and the only way to
|
||||
unblock; auto-rollback protects a bad binary.
|
||||
- **Minimal hotfix = non-blocking `try_send`, signature unchanged** — kept `send_to` as
|
||||
`async fn` so all ~24 callers stay untouched; only the body changed. A black-holed agent now
|
||||
degrades to "command queued" (re-offered on reconnect / by the reaper) instead of wedging the
|
||||
fleet. Did NOT re-implement the reverted eviction (larger change, reverted for a reason) — left
|
||||
as a documented follow-up.
|
||||
- **Preserved the HELD guru-rmm work before any git surgery** — branched the dangling SPEC-030
|
||||
commit + uncommitted edits onto `wip/held-2026-06-22` and pushed it, so deploying a clean
|
||||
main-based fix never risked the in-progress work. Did not bundle that unrelated WIP into the
|
||||
production deploy.
|
||||
- **Did not manually fight the junk-command backlog** — the server's reaper fails undeliverable
|
||||
commands on its own (`pending → failed`), so manual per-id cancel was redundant and raced it.
|
||||
- **Marked Emsisoft verification IN TEST / paused** — VM rebooted mid-scan for other testing;
|
||||
resume cleanly via the self-restoring harness.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Fleet-wide command-dispatch hang** — blocking bounded-channel send in `send_to` + read-lock
|
||||
held across the await → RwLock writer starvation. Fixed with `try_send` (`9dae20c`), deployed,
|
||||
verified end-to-end on a second agent.
|
||||
- **Bash cwd parked in a submodule** — an earlier `cd projects/msp-tools/guru-rmm` persisted, so
|
||||
relative paths failed (`guru-scan` "not found"). Switched to absolute `C:/claudetools/...` paths.
|
||||
- **Test VM silent after server restart** — DESKTOP-MS42HNC didn't reconnect on its own (laptop
|
||||
VM, sleeps). Howard restarted the agent service; it recovered. A first post-reconnect test
|
||||
command timed out transiently while the agent drained the redispatch backlog, then went healthy.
|
||||
- **Scan interrupted by reboot** — Howard rebooted the VM to test other software; the in-flight
|
||||
Emsisoft scan was lost. Marked paused; harness self-restores samples on resume.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
Modified (guru-rmm submodule, committed + deployed):
|
||||
- `server/src/ws/mod.rs` — `AgentConnections::send_to` now uses non-blocking `tx.try_send(msg)`
|
||||
instead of `tx.send(msg).await`; added a detailed comment on the fleet-wide-wedge root cause.
|
||||
Commit `9dae20c` on `main` (pushed → webhook build+deploy).
|
||||
|
||||
Preserved (guru-rmm submodule, pushed, NOT deployed):
|
||||
- Branch `wip/held-2026-06-22` (`3a6277a` on top of `4ca1d06`) — the held SPEC-030 uninstall
|
||||
prototype + uncommitted DoS-hardening (`cap_field`/`cap_vec` in `ws/mod.rs`, `agents.rs`),
|
||||
dashboard edits, docs, `script-library/time-date/`, migration `060_alert_mutes_agent_id_index.sql`.
|
||||
|
||||
Submodule pointer:
|
||||
- `guru-scan` updated to hardened `fb09102` (was `2f8fbcd`).
|
||||
- `guru-rmm` submodule moved from the HELD detached state to `main@9dae20c`.
|
||||
|
||||
Memory (main repo):
|
||||
- `.claude/memory/project_guruscan_in_test_paused.md` (new) + index line.
|
||||
- `.claude/memory/project_gururmm_dispatch_hang_fix.md` (new) + index line.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- None created or discovered. GuruRMM API creds read from vault `infrastructure/gururmm-server.sops.yaml`
|
||||
(`credentials.gururmm-api.admin-email` / `admin-password`).
|
||||
- **DESKTOP-MS42HNC still has Windows Defender RTP + Tamper Protection DISABLED** (Howard, at the
|
||||
console, for malware testing) — must be re-enabled during final cleanup.
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **GuruRMM server** — `http://172.16.3.30:3001`; deploy = merge to `gururmm` `main` → Gitea
|
||||
webhook on `.30` rebuilds (`cargo build --release`, SQLX_OFFLINE) + `systemctl stop/start
|
||||
gururmm-server`, auto-rollback if the new binary won't start. Single server binary
|
||||
(`/opt/gururmm/gururmm-server`), no beta/prod split for the server. Build log
|
||||
`/var/log/gururmm-build-server.log`.
|
||||
- **DESKTOP-MS42HNC** — agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`; client AZ Computer Guru,
|
||||
site Howard-VM; flaky laptop test VM (sleeps/reboots). Has malware sample set restored to
|
||||
`C:\Users\Owner\Desktop\malware-samples-master` (516 files) + zip in Downloads; Defender disabled.
|
||||
- **DESKTOP-QNP3ON5** — agent id `ba173f0c-19e8-488d-834c-1b6f6dfd5699`; used as the healthy
|
||||
control agent to verify dispatch recovery (command completed in ~5s).
|
||||
- GuruScan endpoint paths: `C:\GuruScan\` (module), `C:\GuruScan\downloads\` (scanner EXEs),
|
||||
`C:\ScanLogs\<scanid>\` (logs).
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- Verify server compiles the deploy way: `bash .claude/skills/gururmm-build/scripts/verify.sh server --check` → `[OK] server check passed`.
|
||||
- Deploy: merge `fix/ws-send-to-nonblocking-dispatch-hang` → `main`, `git push origin main` (`b04caf2..9dae20c`).
|
||||
- Recovery verified: dispatch POST `HTTP=200` in ~0.1s (was `HTTP=000` 30s timeout); `Write-Output` on QNP3ON5 `status=completed` in ~5s.
|
||||
- Resume the paused scan (hands-off, self-restores samples):
|
||||
`bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`
|
||||
- Reaper behavior observed: undeliverable backlog `pending → failed` ("agent unreachable — server-side reaper"), ~480 commands self-cleared.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **[IN TEST / PAUSED] Emsisoft full-scan/removal verification on DESKTOP-MS42HNC** — VM rebooted
|
||||
mid-scan. Resume with the harness above; expect Emsisoft to clear well beyond HitmanPro's 36
|
||||
(incl. the `.js` droppers HitmanPro ignores). Then confirm `results.json` `total_threats` /
|
||||
`reboot_required` + reboot-cleanup task.
|
||||
- **Clean up DESKTOP-MS42HNC** — remove samples + zip + EICAR + test tasks, clear scanner
|
||||
quarantine, **re-enable Windows Defender RTP + Tamper Protection**.
|
||||
- **guru-scan gitlink in main repo** — intentionally bumped to `fb09102` this session; held no
|
||||
longer (guru-rmm is unblocked).
|
||||
- **Follow-up (server):** the proper black-hole connection eviction / keepalive-drop is still
|
||||
absent (was reverted in `80df458`). If a half-open agent recurs (heartbeats `online` but
|
||||
commands dispatch `running` → reaper-fail on timeout), finish that work rather than relying on
|
||||
`try_send` alone. Held DoS-hardening on `wip/held-2026-06-22` is unreviewed/unmerged.
|
||||
|
||||
## Reference Information
|
||||
|
||||
- guru-rmm fix commit: `9dae20c` on `main` (pushed to `https://git.azcomputerguru.com/azcomputerguru/gururmm.git`).
|
||||
- Preserved held work: branch `wip/held-2026-06-22` (`3a6277a` over `4ca1d06`), pushed to origin.
|
||||
- Related prior commits: `7c578fd` (evict non-delivering connections — reverted by `80df458`),
|
||||
`ca1657b` (reaper re-delivers black-holed commands), `b93f2ef` (durable command-ACK foundation).
|
||||
- guru-scan hardened commit: `fb09102`.
|
||||
- Build model + verifier: `gururmm-build` skill; `projects/msp-tools/guru-rmm/docs/BUILD.md`.
|
||||
- Yesterday's GuruScan hardening log: `session-logs/2026-06/2026-06-22-howard-guruscan-hardening.md`.
|
||||
Reference in New Issue
Block a user