sync: auto-sync from HOWARD-HOME at 2026-06-22 14:04:53

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-22 14:04:53
2026-06-22 14:05:24 -07:00
parent 48286e80e0
commit 26aa5034f1
5 changed files with 194 additions and 1 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -170,3 +170,5 @@
 - [AI-auth product boundary](project_ai_auth_product_boundary.md) — ClaudeTools/ClaudeTools 3.0 = internal-only, per-person subscription OAuth ok; GuruRMM = sellable, customer brings own API key (never ACG's subscription); backend dev = internal. Anthropic ToS bans subscription auth in third-party products.
 - [RMM SYSTEM context can't see user mapped drives](feedback_rmm_system_context_mapped_drives.md) — RMM runs as SYSTEM; `Test-Path F:\` etc. is False even when the user's mapped/redirected drive exists. Diagnose mapped-drive/redirect issues in `context:user_session`. Elevated apps (e.g. QB DB Server Manager "unable to retrieve root folder") need `EnableLinkedConnections=1` + reboot.
 - [AD2 = Dataforth-ops fork](project_ad2_dataforth_fork.md) — branch ad2 = main + thin Dataforth layer; keep fork edits ADDITIVE (Dataforth context in clients/dataforth/CLAUDE.dataforth.md, NOT .claude/CLAUDE.md); rebase onto main directly when sync.sh self-lock hits; no vault/jq/sops/age on this box.
+- [GuruScan verification IN TEST / paused](project_guruscan_in_test_paused.md) — multi-engine scanner verify on DESKTOP-MS42HNC paused 2026-06-22 (VM rebooted mid-Emsisoft run); HitmanPro done (36 removed), Emsisoft full-scan unverified; resume `guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`; Defender RTP/Tamper still off on VM
+- [GuruRMM fleet dispatch-hang fix](project_gururmm_dispatch_hang_fix.md) — blocking send_to on a full bounded channel to one black-holed agent wedged ALL command dispatch; fixed with try_send (9dae20c, deployed); proper black-hole eviction still missing (was reverted in 80df458) — finish it if it recurs
--- a/.claude/memory/project_gururmm_dispatch_hang_fix.md
+++ b/.claude/memory/project_gururmm_dispatch_hang_fix.md
@@ -0,0 +1,16 @@
+---
+name: project_gururmm_dispatch_hang_fix
+description: GuruRMM fleet-wide command-dispatch hang root cause + fix (send_to try_send, 9dae20c) and the still-missing eviction
+metadata:
+  type: project
+---
+
+On 2026-06-22 the live GuruRMM server (`172.16.3.30:3001`) hung on **every** `POST /api/agents/:id/command` (30s+ timeouts, all agents; GET worked) — command dispatch was down fleet-wide.
+
+**Root cause:** `AgentConnections::send_to` (server `src/ws/mod.rs`) did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel. A black-holed/half-open agent socket stops its WS writer draining the channel → it fills → `send().await` blocks forever. `send_command` holds `state.agents.read().await` across that await, so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`, queuing all later dispatches behind it. **One dead socket wedged the whole fleet.** The recovery path "evict non-delivering connections" (`7c578fd`) had been **reverted** (`80df458`), leaving no escape hatch.
+
+**Fix (`9dae20c`, on `main`, deployed):** `send_to` now uses non-blocking `try_send` — a full/closed channel returns "not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands` (reconnect) + the reaper `requeue_undelivered_commands`. Failure stays local. Verified live (other agent ran a command end-to-end in ~5s).
+
+**Still open / watch:** the proper per-connection eviction of a black-holed socket is still absent (only reverted code existed). A truly half-open agent will keep heartbeating `online` while its server→agent channel silently drops messages (commands dispatch as `running` but never return → reaper fails them on timeout). If this recurs, finish the eviction/keepalive-drop work rather than relying on `try_send` alone.
+
+Deploy model: merging to `gururmm` `main` triggers the webhook build on `.30` (rebuild + `systemctl` restart, auto-rollback if the binary won't start). See the `gururmm-build` skill. Pairs with [[project_guruscan_in_test_paused]].
--- a/.claude/memory/project_guruscan_in_test_paused.md
+++ b/.claude/memory/project_guruscan_in_test_paused.md
@@ -0,0 +1,17 @@
+---
+name: project_guruscan_in_test_paused
+description: GuruScan multi-engine scan verification is IN TEST / paused on DESKTOP-MS42HNC (resume steps + state)
+metadata:
+  type: project
+---
+
+GuruScan (multi-engine malware scanner, `projects/msp-tools/guru-scan`, hardened at `fb09102`) is **IN TEST — PAUSED** as of 2026-06-22. Verifying full-scan + full-removal + automated lifecycle on test VM **DESKTOP-MS42HNC** (agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`, client AZ Computer Guru / site Howard-VM — a flaky laptop VM that sleeps/reboots).
+
+State:
+- **HitmanPro** lifecycle already verified (36 threats detected+removed, reboot-cleanup task fires).
+- **Emsisoft** full `/f=C:\` run (the heavy ~80-min engine) was launched with 516 samples staged but **interrupted by a VM reboot — NOT yet verified**. This is the open item.
+- Resume hands-off: `bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft` (self-restores samples, detached no-cap, reports removal + results.json + cleanup task).
+
+Cleanup still owed on the VM once testing is done: remove samples/zip/EICAR/test tasks, clear scanner quarantine, and **re-enable Windows Defender RTP + Tamper Protection** (disabled at the console for malware testing).
+
+Blocker that surfaced + was fixed this session: a fleet-wide RMM command-dispatch hang — see [[project_gururmm_dispatch_hang_fix]].
--- a/.disp.json
+++ b/.disp.json
@@ -1 +0,0 @@
-{"command_type":"powershell","command":"Write-Output HELLO_FROM_TEST","timeout_seconds":30}
--- a/session-logs/2026-06/2026-06-22-howard-gururmm-dispatch-hang-fix-guruscan-paused.md
+++ b/session-logs/2026-06/2026-06-22-howard-gururmm-dispatch-hang-fix-guruscan-paused.md
@@ -0,0 +1,159 @@
+# GuruRMM fleet dispatch-hang fix + GuruScan Emsisoft test (paused) — 2026-06-22
+
+## User
+- **User:** Howard Enos (howard)
+- **Machine:** Howard-Home
+- **Role:** tech
+
+## Session Summary
+
+Resumed the GuruScan multi-engine malware-scanner verification on test VM DESKTOP-MS42HNC
+(the open item from yesterday: run the heavy Emsisoft full `/f=C:\` engine to confirm full
+scan + full removal + automated lifecycle, after HitmanPro's 36-threat lifecycle was already
+proven). Updated this machine's `guru-scan` submodule from a 3-week-old `2f8fbcd` to the
+hardened `fb09102`, confirmed the test VM was back online, and launched the automated
+`scan-one Emsisoft` harness — which died instantly at the first step with `[ERROR] Dispatch
+failed`.
+
+Tracing that failure uncovered a production incident much larger than the scan: the live
+GuruRMM server (`172.16.3.30:3001`) was hanging on **every** `POST /api/agents/:id/command`
+(30s+ timeouts) while `GET` worked. Reproduced on two independent online agents — command
+dispatch was down **fleet-wide**. Root-caused it in the deployed `main` code: `AgentConnections::send_to`
+did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel; a
+black-holed/half-open agent socket stops its WS writer draining the channel, it fills, and
+the send blocks forever. `send_command` holds `state.agents.read().await` across that await,
+so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`,
+queuing all subsequent dispatches behind it — one dead socket wedged the whole fleet. The
+recovery path ("evict non-delivering connections", `7c578fd`) had been reverted in `80df458`,
+leaving no escape hatch.
+
+Before touching anything, preserved the "HELD" guru-rmm working tree (detached HEAD at the
+SPEC-030 uninstall-engine commit `4ca1d06` plus 8 uncommitted files of unrelated DoS-hardening
+ dashboard + docs work) onto branch `wip/held-2026-06-22` (commit `3a6277a`) and pushed it
+to origin for durable backup — nothing lost. Then, with Howard's clearance to follow the RMM
+update rules, fixed `send_to` to use non-blocking `try_send` (a full/closed channel returns
+"not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands`
+on reconnect and the reaper `requeue_undelivered_commands`). Verified with `cargo check`
+(via the `gururmm-build` skill), committed `9dae20c`, merged to `main`, and pushed — which
+triggers the webhook build+restart on `.30`. Dispatch recovered ~2.5 min later; verified a
+command ran end-to-end on a healthy second agent (DESKTOP-QNP3ON5) in ~5s.
+
+The test VM itself then needed recovery: it went silent at the exact moment of the server
+restart and didn't reconnect (its `online` was stale grace-period). Howard restarted the
+GuruRMM agent service on it; the agent reconnected and, after draining the redispatched
+backlog, executed commands cleanly (~1s round-trips). The ~480 junk commands created by the
+hung-dispatch attempts self-drained to `failed` via the reaper (won't re-run). Relaunched the
+real `scan-one Emsisoft` run: setup restored 516 malware samples and the detached, no-cap
+`GuruScan-one` scheduled task reached `state=Running` — the full scan was underway. Howard then
+rebooted DESKTOP-MS42HNC to test other software, interrupting the scan. Per his direction, the
+Emsisoft verification is marked **IN TEST / paused** to resume later; the background harness was
+stopped, scratch files cleaned, and state saved.
+
+## Key Decisions
+
+- **Fixed + deployed the server bug rather than just reporting it** — Howard explicitly cleared
+  "following the rules for updating the RMM and following Mike's requests do what is needed."
+  The merge-to-main deploy is the documented path (`gururmm-build` skill) and the only way to
+  unblock; auto-rollback protects a bad binary.
+- **Minimal hotfix = non-blocking `try_send`, signature unchanged** — kept `send_to` as
+  `async fn` so all ~24 callers stay untouched; only the body changed. A black-holed agent now
+  degrades to "command queued" (re-offered on reconnect / by the reaper) instead of wedging the
+  fleet. Did NOT re-implement the reverted eviction (larger change, reverted for a reason) — left
+  as a documented follow-up.
+- **Preserved the HELD guru-rmm work before any git surgery** — branched the dangling SPEC-030
+  commit + uncommitted edits onto `wip/held-2026-06-22` and pushed it, so deploying a clean
+  main-based fix never risked the in-progress work. Did not bundle that unrelated WIP into the
+  production deploy.
+- **Did not manually fight the junk-command backlog** — the server's reaper fails undeliverable
+  commands on its own (`pending → failed`), so manual per-id cancel was redundant and raced it.
+- **Marked Emsisoft verification IN TEST / paused** — VM rebooted mid-scan for other testing;
+  resume cleanly via the self-restoring harness.
+
+## Problems Encountered
+
+- **Fleet-wide command-dispatch hang** — blocking bounded-channel send in `send_to` + read-lock
+  held across the await → RwLock writer starvation. Fixed with `try_send` (`9dae20c`), deployed,
+  verified end-to-end on a second agent.
+- **Bash cwd parked in a submodule** — an earlier `cd projects/msp-tools/guru-rmm` persisted, so
+  relative paths failed (`guru-scan` "not found"). Switched to absolute `C:/claudetools/...` paths.
+- **Test VM silent after server restart** — DESKTOP-MS42HNC didn't reconnect on its own (laptop
+  VM, sleeps). Howard restarted the agent service; it recovered. A first post-reconnect test
+  command timed out transiently while the agent drained the redispatch backlog, then went healthy.
+- **Scan interrupted by reboot** — Howard rebooted the VM to test other software; the in-flight
+  Emsisoft scan was lost. Marked paused; harness self-restores samples on resume.
+
+## Configuration Changes
+
+Modified (guru-rmm submodule, committed + deployed):
+- `server/src/ws/mod.rs` — `AgentConnections::send_to` now uses non-blocking `tx.try_send(msg)`
+  instead of `tx.send(msg).await`; added a detailed comment on the fleet-wide-wedge root cause.
+  Commit `9dae20c` on `main` (pushed → webhook build+deploy).
+
+Preserved (guru-rmm submodule, pushed, NOT deployed):
+- Branch `wip/held-2026-06-22` (`3a6277a` on top of `4ca1d06`) — the held SPEC-030 uninstall
+  prototype + uncommitted DoS-hardening (`cap_field`/`cap_vec` in `ws/mod.rs`, `agents.rs`),
+  dashboard edits, docs, `script-library/time-date/`, migration `060_alert_mutes_agent_id_index.sql`.
+
+Submodule pointer:
+- `guru-scan` updated to hardened `fb09102` (was `2f8fbcd`).
+- `guru-rmm` submodule moved from the HELD detached state to `main@9dae20c`.
+
+Memory (main repo):
+- `.claude/memory/project_guruscan_in_test_paused.md` (new) + index line.
+- `.claude/memory/project_gururmm_dispatch_hang_fix.md` (new) + index line.
+
+## Credentials & Secrets
+
+- None created or discovered. GuruRMM API creds read from vault `infrastructure/gururmm-server.sops.yaml`
+  (`credentials.gururmm-api.admin-email` / `admin-password`).
+- **DESKTOP-MS42HNC still has Windows Defender RTP + Tamper Protection DISABLED** (Howard, at the
+  console, for malware testing) — must be re-enabled during final cleanup.
+
+## Infrastructure & Servers
+
+- **GuruRMM server** — `http://172.16.3.30:3001`; deploy = merge to `gururmm` `main` → Gitea
+  webhook on `.30` rebuilds (`cargo build --release`, SQLX_OFFLINE) + `systemctl stop/start
+  gururmm-server`, auto-rollback if the new binary won't start. Single server binary
+  (`/opt/gururmm/gururmm-server`), no beta/prod split for the server. Build log
+  `/var/log/gururmm-build-server.log`.
+- **DESKTOP-MS42HNC** — agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`; client AZ Computer Guru,
+  site Howard-VM; flaky laptop test VM (sleeps/reboots). Has malware sample set restored to
+  `C:\Users\Owner\Desktop\malware-samples-master` (516 files) + zip in Downloads; Defender disabled.
+- **DESKTOP-QNP3ON5** — agent id `ba173f0c-19e8-488d-834c-1b6f6dfd5699`; used as the healthy
+  control agent to verify dispatch recovery (command completed in ~5s).
+- GuruScan endpoint paths: `C:\GuruScan\` (module), `C:\GuruScan\downloads\` (scanner EXEs),
+  `C:\ScanLogs\<scanid>\` (logs).
+
+## Commands & Outputs
+
+- Verify server compiles the deploy way: `bash .claude/skills/gururmm-build/scripts/verify.sh server --check` → `[OK] server check passed`.
+- Deploy: merge `fix/ws-send-to-nonblocking-dispatch-hang` → `main`, `git push origin main` (`b04caf2..9dae20c`).
+- Recovery verified: dispatch POST `HTTP=200` in ~0.1s (was `HTTP=000` 30s timeout); `Write-Output` on QNP3ON5 `status=completed` in ~5s.
+- Resume the paused scan (hands-off, self-restores samples):
+  `bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`
+- Reaper behavior observed: undeliverable backlog `pending → failed` ("agent unreachable — server-side reaper"), ~480 commands self-cleared.
+
+## Pending / Incomplete Tasks
+
+- **[IN TEST / PAUSED] Emsisoft full-scan/removal verification on DESKTOP-MS42HNC** — VM rebooted
+  mid-scan. Resume with the harness above; expect Emsisoft to clear well beyond HitmanPro's 36
+  (incl. the `.js` droppers HitmanPro ignores). Then confirm `results.json` `total_threats` /
+  `reboot_required` + reboot-cleanup task.
+- **Clean up DESKTOP-MS42HNC** — remove samples + zip + EICAR + test tasks, clear scanner
+  quarantine, **re-enable Windows Defender RTP + Tamper Protection**.
+- **guru-scan gitlink in main repo** — intentionally bumped to `fb09102` this session; held no
+  longer (guru-rmm is unblocked).
+- **Follow-up (server):** the proper black-hole connection eviction / keepalive-drop is still
+  absent (was reverted in `80df458`). If a half-open agent recurs (heartbeats `online` but
+  commands dispatch `running` → reaper-fail on timeout), finish that work rather than relying on
+  `try_send` alone. Held DoS-hardening on `wip/held-2026-06-22` is unreviewed/unmerged.
+
+## Reference Information
+
+- guru-rmm fix commit: `9dae20c` on `main` (pushed to `https://git.azcomputerguru.com/azcomputerguru/gururmm.git`).
+- Preserved held work: branch `wip/held-2026-06-22` (`3a6277a` over `4ca1d06`), pushed to origin.
+- Related prior commits: `7c578fd` (evict non-delivering connections — reverted by `80df458`),
+  `ca1657b` (reaper re-delivers black-holed commands), `b93f2ef` (durable command-ACK foundation).
+- guru-scan hardened commit: `fb09102`.
+- Build model + verifier: `gururmm-build` skill; `projects/msp-tools/guru-rmm/docs/BUILD.md`.
+- Yesterday's GuruScan hardening log: `session-logs/2026-06/2026-06-22-howard-guruscan-hardening.md`.
				`@@ -1 +0,0 @@`
				`{"command_type":"powershell","command":"Write-Output HELLO_FROM_TEST","timeout_seconds":30}`