sync: auto-sync from HOWARD-HOME at 2026-06-22 14:04:53

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-22 14:04:53
This commit is contained in:
2026-06-22 14:05:24 -07:00
parent 48286e80e0
commit 26aa5034f1
5 changed files with 194 additions and 1 deletions

View File

@@ -170,3 +170,5 @@
- [AI-auth product boundary](project_ai_auth_product_boundary.md) — ClaudeTools/ClaudeTools 3.0 = internal-only, per-person subscription OAuth ok; GuruRMM = sellable, customer brings own API key (never ACG's subscription); backend dev = internal. Anthropic ToS bans subscription auth in third-party products.
- [RMM SYSTEM context can't see user mapped drives](feedback_rmm_system_context_mapped_drives.md) — RMM runs as SYSTEM; `Test-Path F:\` etc. is False even when the user's mapped/redirected drive exists. Diagnose mapped-drive/redirect issues in `context:user_session`. Elevated apps (e.g. QB DB Server Manager "unable to retrieve root folder") need `EnableLinkedConnections=1` + reboot.
- [AD2 = Dataforth-ops fork](project_ad2_dataforth_fork.md) — branch ad2 = main + thin Dataforth layer; keep fork edits ADDITIVE (Dataforth context in clients/dataforth/CLAUDE.dataforth.md, NOT .claude/CLAUDE.md); rebase onto main directly when sync.sh self-lock hits; no vault/jq/sops/age on this box.
- [GuruScan verification IN TEST / paused](project_guruscan_in_test_paused.md) — multi-engine scanner verify on DESKTOP-MS42HNC paused 2026-06-22 (VM rebooted mid-Emsisoft run); HitmanPro done (36 removed), Emsisoft full-scan unverified; resume `guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`; Defender RTP/Tamper still off on VM
- [GuruRMM fleet dispatch-hang fix](project_gururmm_dispatch_hang_fix.md) — blocking send_to on a full bounded channel to one black-holed agent wedged ALL command dispatch; fixed with try_send (9dae20c, deployed); proper black-hole eviction still missing (was reverted in 80df458) — finish it if it recurs

View File

@@ -0,0 +1,16 @@
---
name: project_gururmm_dispatch_hang_fix
description: GuruRMM fleet-wide command-dispatch hang root cause + fix (send_to try_send, 9dae20c) and the still-missing eviction
metadata:
type: project
---
On 2026-06-22 the live GuruRMM server (`172.16.3.30:3001`) hung on **every** `POST /api/agents/:id/command` (30s+ timeouts, all agents; GET worked) — command dispatch was down fleet-wide.
**Root cause:** `AgentConnections::send_to` (server `src/ws/mod.rs`) did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel. A black-holed/half-open agent socket stops its WS writer draining the channel → it fills → `send().await` blocks forever. `send_command` holds `state.agents.read().await` across that await, so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`, queuing all later dispatches behind it. **One dead socket wedged the whole fleet.** The recovery path "evict non-delivering connections" (`7c578fd`) had been **reverted** (`80df458`), leaving no escape hatch.
**Fix (`9dae20c`, on `main`, deployed):** `send_to` now uses non-blocking `try_send` — a full/closed channel returns "not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands` (reconnect) + the reaper `requeue_undelivered_commands`. Failure stays local. Verified live (other agent ran a command end-to-end in ~5s).
**Still open / watch:** the proper per-connection eviction of a black-holed socket is still absent (only reverted code existed). A truly half-open agent will keep heartbeating `online` while its server→agent channel silently drops messages (commands dispatch as `running` but never return → reaper fails them on timeout). If this recurs, finish the eviction/keepalive-drop work rather than relying on `try_send` alone.
Deploy model: merging to `gururmm` `main` triggers the webhook build on `.30` (rebuild + `systemctl` restart, auto-rollback if the binary won't start). See the `gururmm-build` skill. Pairs with [[project_guruscan_in_test_paused]].

View File

@@ -0,0 +1,17 @@
---
name: project_guruscan_in_test_paused
description: GuruScan multi-engine scan verification is IN TEST / paused on DESKTOP-MS42HNC (resume steps + state)
metadata:
type: project
---
GuruScan (multi-engine malware scanner, `projects/msp-tools/guru-scan`, hardened at `fb09102`) is **IN TEST — PAUSED** as of 2026-06-22. Verifying full-scan + full-removal + automated lifecycle on test VM **DESKTOP-MS42HNC** (agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`, client AZ Computer Guru / site Howard-VM — a flaky laptop VM that sleeps/reboots).
State:
- **HitmanPro** lifecycle already verified (36 threats detected+removed, reboot-cleanup task fires).
- **Emsisoft** full `/f=C:\` run (the heavy ~80-min engine) was launched with 516 samples staged but **interrupted by a VM reboot — NOT yet verified**. This is the open item.
- Resume hands-off: `bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft` (self-restores samples, detached no-cap, reports removal + results.json + cleanup task).
Cleanup still owed on the VM once testing is done: remove samples/zip/EICAR/test tasks, clear scanner quarantine, and **re-enable Windows Defender RTP + Tamper Protection** (disabled at the console for malware testing).
Blocker that surfaced + was fixed this session: a fleet-wide RMM command-dispatch hang — see [[project_gururmm_dispatch_hang_fix]].

View File

@@ -1 +0,0 @@
{"command_type":"powershell","command":"Write-Output HELLO_FROM_TEST","timeout_seconds":30}

View File

@@ -0,0 +1,159 @@
# GuruRMM fleet dispatch-hang fix + GuruScan Emsisoft test (paused) — 2026-06-22
## User
- **User:** Howard Enos (howard)
- **Machine:** Howard-Home
- **Role:** tech
## Session Summary
Resumed the GuruScan multi-engine malware-scanner verification on test VM DESKTOP-MS42HNC
(the open item from yesterday: run the heavy Emsisoft full `/f=C:\` engine to confirm full
scan + full removal + automated lifecycle, after HitmanPro's 36-threat lifecycle was already
proven). Updated this machine's `guru-scan` submodule from a 3-week-old `2f8fbcd` to the
hardened `fb09102`, confirmed the test VM was back online, and launched the automated
`scan-one Emsisoft` harness — which died instantly at the first step with `[ERROR] Dispatch
failed`.
Tracing that failure uncovered a production incident much larger than the scan: the live
GuruRMM server (`172.16.3.30:3001`) was hanging on **every** `POST /api/agents/:id/command`
(30s+ timeouts) while `GET` worked. Reproduced on two independent online agents — command
dispatch was down **fleet-wide**. Root-caused it in the deployed `main` code: `AgentConnections::send_to`
did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel; a
black-holed/half-open agent socket stops its WS writer draining the channel, it fills, and
the send blocks forever. `send_command` holds `state.agents.read().await` across that await,
so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`,
queuing all subsequent dispatches behind it — one dead socket wedged the whole fleet. The
recovery path ("evict non-delivering connections", `7c578fd`) had been reverted in `80df458`,
leaving no escape hatch.
Before touching anything, preserved the "HELD" guru-rmm working tree (detached HEAD at the
SPEC-030 uninstall-engine commit `4ca1d06` plus 8 uncommitted files of unrelated DoS-hardening
+ dashboard + docs work) onto branch `wip/held-2026-06-22` (commit `3a6277a`) and pushed it
to origin for durable backup — nothing lost. Then, with Howard's clearance to follow the RMM
update rules, fixed `send_to` to use non-blocking `try_send` (a full/closed channel returns
"not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands`
on reconnect and the reaper `requeue_undelivered_commands`). Verified with `cargo check`
(via the `gururmm-build` skill), committed `9dae20c`, merged to `main`, and pushed — which
triggers the webhook build+restart on `.30`. Dispatch recovered ~2.5 min later; verified a
command ran end-to-end on a healthy second agent (DESKTOP-QNP3ON5) in ~5s.
The test VM itself then needed recovery: it went silent at the exact moment of the server
restart and didn't reconnect (its `online` was stale grace-period). Howard restarted the
GuruRMM agent service on it; the agent reconnected and, after draining the redispatched
backlog, executed commands cleanly (~1s round-trips). The ~480 junk commands created by the
hung-dispatch attempts self-drained to `failed` via the reaper (won't re-run). Relaunched the
real `scan-one Emsisoft` run: setup restored 516 malware samples and the detached, no-cap
`GuruScan-one` scheduled task reached `state=Running` — the full scan was underway. Howard then
rebooted DESKTOP-MS42HNC to test other software, interrupting the scan. Per his direction, the
Emsisoft verification is marked **IN TEST / paused** to resume later; the background harness was
stopped, scratch files cleaned, and state saved.
## Key Decisions
- **Fixed + deployed the server bug rather than just reporting it** — Howard explicitly cleared
"following the rules for updating the RMM and following Mike's requests do what is needed."
The merge-to-main deploy is the documented path (`gururmm-build` skill) and the only way to
unblock; auto-rollback protects a bad binary.
- **Minimal hotfix = non-blocking `try_send`, signature unchanged** — kept `send_to` as
`async fn` so all ~24 callers stay untouched; only the body changed. A black-holed agent now
degrades to "command queued" (re-offered on reconnect / by the reaper) instead of wedging the
fleet. Did NOT re-implement the reverted eviction (larger change, reverted for a reason) — left
as a documented follow-up.
- **Preserved the HELD guru-rmm work before any git surgery** — branched the dangling SPEC-030
commit + uncommitted edits onto `wip/held-2026-06-22` and pushed it, so deploying a clean
main-based fix never risked the in-progress work. Did not bundle that unrelated WIP into the
production deploy.
- **Did not manually fight the junk-command backlog** — the server's reaper fails undeliverable
commands on its own (`pending → failed`), so manual per-id cancel was redundant and raced it.
- **Marked Emsisoft verification IN TEST / paused** — VM rebooted mid-scan for other testing;
resume cleanly via the self-restoring harness.
## Problems Encountered
- **Fleet-wide command-dispatch hang** — blocking bounded-channel send in `send_to` + read-lock
held across the await → RwLock writer starvation. Fixed with `try_send` (`9dae20c`), deployed,
verified end-to-end on a second agent.
- **Bash cwd parked in a submodule** — an earlier `cd projects/msp-tools/guru-rmm` persisted, so
relative paths failed (`guru-scan` "not found"). Switched to absolute `C:/claudetools/...` paths.
- **Test VM silent after server restart** — DESKTOP-MS42HNC didn't reconnect on its own (laptop
VM, sleeps). Howard restarted the agent service; it recovered. A first post-reconnect test
command timed out transiently while the agent drained the redispatch backlog, then went healthy.
- **Scan interrupted by reboot** — Howard rebooted the VM to test other software; the in-flight
Emsisoft scan was lost. Marked paused; harness self-restores samples on resume.
## Configuration Changes
Modified (guru-rmm submodule, committed + deployed):
- `server/src/ws/mod.rs``AgentConnections::send_to` now uses non-blocking `tx.try_send(msg)`
instead of `tx.send(msg).await`; added a detailed comment on the fleet-wide-wedge root cause.
Commit `9dae20c` on `main` (pushed → webhook build+deploy).
Preserved (guru-rmm submodule, pushed, NOT deployed):
- Branch `wip/held-2026-06-22` (`3a6277a` on top of `4ca1d06`) — the held SPEC-030 uninstall
prototype + uncommitted DoS-hardening (`cap_field`/`cap_vec` in `ws/mod.rs`, `agents.rs`),
dashboard edits, docs, `script-library/time-date/`, migration `060_alert_mutes_agent_id_index.sql`.
Submodule pointer:
- `guru-scan` updated to hardened `fb09102` (was `2f8fbcd`).
- `guru-rmm` submodule moved from the HELD detached state to `main@9dae20c`.
Memory (main repo):
- `.claude/memory/project_guruscan_in_test_paused.md` (new) + index line.
- `.claude/memory/project_gururmm_dispatch_hang_fix.md` (new) + index line.
## Credentials & Secrets
- None created or discovered. GuruRMM API creds read from vault `infrastructure/gururmm-server.sops.yaml`
(`credentials.gururmm-api.admin-email` / `admin-password`).
- **DESKTOP-MS42HNC still has Windows Defender RTP + Tamper Protection DISABLED** (Howard, at the
console, for malware testing) — must be re-enabled during final cleanup.
## Infrastructure & Servers
- **GuruRMM server** — `http://172.16.3.30:3001`; deploy = merge to `gururmm` `main` → Gitea
webhook on `.30` rebuilds (`cargo build --release`, SQLX_OFFLINE) + `systemctl stop/start
gururmm-server`, auto-rollback if the new binary won't start. Single server binary
(`/opt/gururmm/gururmm-server`), no beta/prod split for the server. Build log
`/var/log/gururmm-build-server.log`.
- **DESKTOP-MS42HNC** — agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`; client AZ Computer Guru,
site Howard-VM; flaky laptop test VM (sleeps/reboots). Has malware sample set restored to
`C:\Users\Owner\Desktop\malware-samples-master` (516 files) + zip in Downloads; Defender disabled.
- **DESKTOP-QNP3ON5** — agent id `ba173f0c-19e8-488d-834c-1b6f6dfd5699`; used as the healthy
control agent to verify dispatch recovery (command completed in ~5s).
- GuruScan endpoint paths: `C:\GuruScan\` (module), `C:\GuruScan\downloads\` (scanner EXEs),
`C:\ScanLogs\<scanid>\` (logs).
## Commands & Outputs
- Verify server compiles the deploy way: `bash .claude/skills/gururmm-build/scripts/verify.sh server --check``[OK] server check passed`.
- Deploy: merge `fix/ws-send-to-nonblocking-dispatch-hang``main`, `git push origin main` (`b04caf2..9dae20c`).
- Recovery verified: dispatch POST `HTTP=200` in ~0.1s (was `HTTP=000` 30s timeout); `Write-Output` on QNP3ON5 `status=completed` in ~5s.
- Resume the paused scan (hands-off, self-restores samples):
`bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`
- Reaper behavior observed: undeliverable backlog `pending → failed` ("agent unreachable — server-side reaper"), ~480 commands self-cleared.
## Pending / Incomplete Tasks
- **[IN TEST / PAUSED] Emsisoft full-scan/removal verification on DESKTOP-MS42HNC** — VM rebooted
mid-scan. Resume with the harness above; expect Emsisoft to clear well beyond HitmanPro's 36
(incl. the `.js` droppers HitmanPro ignores). Then confirm `results.json` `total_threats` /
`reboot_required` + reboot-cleanup task.
- **Clean up DESKTOP-MS42HNC** — remove samples + zip + EICAR + test tasks, clear scanner
quarantine, **re-enable Windows Defender RTP + Tamper Protection**.
- **guru-scan gitlink in main repo** — intentionally bumped to `fb09102` this session; held no
longer (guru-rmm is unblocked).
- **Follow-up (server):** the proper black-hole connection eviction / keepalive-drop is still
absent (was reverted in `80df458`). If a half-open agent recurs (heartbeats `online` but
commands dispatch `running` → reaper-fail on timeout), finish that work rather than relying on
`try_send` alone. Held DoS-hardening on `wip/held-2026-06-22` is unreviewed/unmerged.
## Reference Information
- guru-rmm fix commit: `9dae20c` on `main` (pushed to `https://git.azcomputerguru.com/azcomputerguru/gururmm.git`).
- Preserved held work: branch `wip/held-2026-06-22` (`3a6277a` over `4ca1d06`), pushed to origin.
- Related prior commits: `7c578fd` (evict non-delivering connections — reverted by `80df458`),
`ca1657b` (reaper re-delivers black-holed commands), `b93f2ef` (durable command-ACK foundation).
- guru-scan hardened commit: `fb09102`.
- Build model + verifier: `gururmm-build` skill; `projects/msp-tools/guru-rmm/docs/BUILD.md`.
- Yesterday's GuruScan hardening log: `session-logs/2026-06/2026-06-22-howard-guruscan-hardening.md`.