sync: auto-sync from HOWARD-HOME at 2026-06-22 14:04:53
Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-22 14:04:53
This commit is contained in:
@@ -170,3 +170,5 @@
|
||||
- [AI-auth product boundary](project_ai_auth_product_boundary.md) — ClaudeTools/ClaudeTools 3.0 = internal-only, per-person subscription OAuth ok; GuruRMM = sellable, customer brings own API key (never ACG's subscription); backend dev = internal. Anthropic ToS bans subscription auth in third-party products.
|
||||
- [RMM SYSTEM context can't see user mapped drives](feedback_rmm_system_context_mapped_drives.md) — RMM runs as SYSTEM; `Test-Path F:\` etc. is False even when the user's mapped/redirected drive exists. Diagnose mapped-drive/redirect issues in `context:user_session`. Elevated apps (e.g. QB DB Server Manager "unable to retrieve root folder") need `EnableLinkedConnections=1` + reboot.
|
||||
- [AD2 = Dataforth-ops fork](project_ad2_dataforth_fork.md) — branch ad2 = main + thin Dataforth layer; keep fork edits ADDITIVE (Dataforth context in clients/dataforth/CLAUDE.dataforth.md, NOT .claude/CLAUDE.md); rebase onto main directly when sync.sh self-lock hits; no vault/jq/sops/age on this box.
|
||||
- [GuruScan verification IN TEST / paused](project_guruscan_in_test_paused.md) — multi-engine scanner verify on DESKTOP-MS42HNC paused 2026-06-22 (VM rebooted mid-Emsisoft run); HitmanPro done (36 removed), Emsisoft full-scan unverified; resume `guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`; Defender RTP/Tamper still off on VM
|
||||
- [GuruRMM fleet dispatch-hang fix](project_gururmm_dispatch_hang_fix.md) — blocking send_to on a full bounded channel to one black-holed agent wedged ALL command dispatch; fixed with try_send (9dae20c, deployed); proper black-hole eviction still missing (was reverted in 80df458) — finish it if it recurs
|
||||
|
||||
16
.claude/memory/project_gururmm_dispatch_hang_fix.md
Normal file
16
.claude/memory/project_gururmm_dispatch_hang_fix.md
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
name: project_gururmm_dispatch_hang_fix
|
||||
description: GuruRMM fleet-wide command-dispatch hang root cause + fix (send_to try_send, 9dae20c) and the still-missing eviction
|
||||
metadata:
|
||||
type: project
|
||||
---
|
||||
|
||||
On 2026-06-22 the live GuruRMM server (`172.16.3.30:3001`) hung on **every** `POST /api/agents/:id/command` (30s+ timeouts, all agents; GET worked) — command dispatch was down fleet-wide.
|
||||
|
||||
**Root cause:** `AgentConnections::send_to` (server `src/ws/mod.rs`) did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel. A black-holed/half-open agent socket stops its WS writer draining the channel → it fills → `send().await` blocks forever. `send_command` holds `state.agents.read().await` across that await, so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`, queuing all later dispatches behind it. **One dead socket wedged the whole fleet.** The recovery path "evict non-delivering connections" (`7c578fd`) had been **reverted** (`80df458`), leaving no escape hatch.
|
||||
|
||||
**Fix (`9dae20c`, on `main`, deployed):** `send_to` now uses non-blocking `try_send` — a full/closed channel returns "not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands` (reconnect) + the reaper `requeue_undelivered_commands`. Failure stays local. Verified live (other agent ran a command end-to-end in ~5s).
|
||||
|
||||
**Still open / watch:** the proper per-connection eviction of a black-holed socket is still absent (only reverted code existed). A truly half-open agent will keep heartbeating `online` while its server→agent channel silently drops messages (commands dispatch as `running` but never return → reaper fails them on timeout). If this recurs, finish the eviction/keepalive-drop work rather than relying on `try_send` alone.
|
||||
|
||||
Deploy model: merging to `gururmm` `main` triggers the webhook build on `.30` (rebuild + `systemctl` restart, auto-rollback if the binary won't start). See the `gururmm-build` skill. Pairs with [[project_guruscan_in_test_paused]].
|
||||
17
.claude/memory/project_guruscan_in_test_paused.md
Normal file
17
.claude/memory/project_guruscan_in_test_paused.md
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
name: project_guruscan_in_test_paused
|
||||
description: GuruScan multi-engine scan verification is IN TEST / paused on DESKTOP-MS42HNC (resume steps + state)
|
||||
metadata:
|
||||
type: project
|
||||
---
|
||||
|
||||
GuruScan (multi-engine malware scanner, `projects/msp-tools/guru-scan`, hardened at `fb09102`) is **IN TEST — PAUSED** as of 2026-06-22. Verifying full-scan + full-removal + automated lifecycle on test VM **DESKTOP-MS42HNC** (agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`, client AZ Computer Guru / site Howard-VM — a flaky laptop VM that sleeps/reboots).
|
||||
|
||||
State:
|
||||
- **HitmanPro** lifecycle already verified (36 threats detected+removed, reboot-cleanup task fires).
|
||||
- **Emsisoft** full `/f=C:\` run (the heavy ~80-min engine) was launched with 516 samples staged but **interrupted by a VM reboot — NOT yet verified**. This is the open item.
|
||||
- Resume hands-off: `bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft` (self-restores samples, detached no-cap, reports removal + results.json + cleanup task).
|
||||
|
||||
Cleanup still owed on the VM once testing is done: remove samples/zip/EICAR/test tasks, clear scanner quarantine, and **re-enable Windows Defender RTP + Tamper Protection** (disabled at the console for malware testing).
|
||||
|
||||
Blocker that surfaced + was fixed this session: a fleet-wide RMM command-dispatch hang — see [[project_gururmm_dispatch_hang_fix]].
|
||||
@@ -1 +0,0 @@
|
||||
{"command_type":"powershell","command":"Write-Output HELLO_FROM_TEST","timeout_seconds":30}
|
||||
@@ -0,0 +1,159 @@
|
||||
# GuruRMM fleet dispatch-hang fix + GuruScan Emsisoft test (paused) — 2026-06-22
|
||||
|
||||
## User
|
||||
- **User:** Howard Enos (howard)
|
||||
- **Machine:** Howard-Home
|
||||
- **Role:** tech
|
||||
|
||||
## Session Summary
|
||||
|
||||
Resumed the GuruScan multi-engine malware-scanner verification on test VM DESKTOP-MS42HNC
|
||||
(the open item from yesterday: run the heavy Emsisoft full `/f=C:\` engine to confirm full
|
||||
scan + full removal + automated lifecycle, after HitmanPro's 36-threat lifecycle was already
|
||||
proven). Updated this machine's `guru-scan` submodule from a 3-week-old `2f8fbcd` to the
|
||||
hardened `fb09102`, confirmed the test VM was back online, and launched the automated
|
||||
`scan-one Emsisoft` harness — which died instantly at the first step with `[ERROR] Dispatch
|
||||
failed`.
|
||||
|
||||
Tracing that failure uncovered a production incident much larger than the scan: the live
|
||||
GuruRMM server (`172.16.3.30:3001`) was hanging on **every** `POST /api/agents/:id/command`
|
||||
(30s+ timeouts) while `GET` worked. Reproduced on two independent online agents — command
|
||||
dispatch was down **fleet-wide**. Root-caused it in the deployed `main` code: `AgentConnections::send_to`
|
||||
did a blocking `tx.send(msg).await` on a bounded (cap 100) per-agent mpsc channel; a
|
||||
black-holed/half-open agent socket stops its WS writer draining the channel, it fills, and
|
||||
the send blocks forever. `send_command` holds `state.agents.read().await` across that await,
|
||||
so the next agent (re)connect's `.write().await` starves tokio's write-preferring `RwLock`,
|
||||
queuing all subsequent dispatches behind it — one dead socket wedged the whole fleet. The
|
||||
recovery path ("evict non-delivering connections", `7c578fd`) had been reverted in `80df458`,
|
||||
leaving no escape hatch.
|
||||
|
||||
Before touching anything, preserved the "HELD" guru-rmm working tree (detached HEAD at the
|
||||
SPEC-030 uninstall-engine commit `4ca1d06` plus 8 uncommitted files of unrelated DoS-hardening
|
||||
+ dashboard + docs work) onto branch `wip/held-2026-06-22` (commit `3a6277a`) and pushed it
|
||||
to origin for durable backup — nothing lost. Then, with Howard's clearance to follow the RMM
|
||||
update rules, fixed `send_to` to use non-blocking `try_send` (a full/closed channel returns
|
||||
"not delivered"; the command stays persisted and is re-offered by `redispatch_pending_commands`
|
||||
on reconnect and the reaper `requeue_undelivered_commands`). Verified with `cargo check`
|
||||
(via the `gururmm-build` skill), committed `9dae20c`, merged to `main`, and pushed — which
|
||||
triggers the webhook build+restart on `.30`. Dispatch recovered ~2.5 min later; verified a
|
||||
command ran end-to-end on a healthy second agent (DESKTOP-QNP3ON5) in ~5s.
|
||||
|
||||
The test VM itself then needed recovery: it went silent at the exact moment of the server
|
||||
restart and didn't reconnect (its `online` was stale grace-period). Howard restarted the
|
||||
GuruRMM agent service on it; the agent reconnected and, after draining the redispatched
|
||||
backlog, executed commands cleanly (~1s round-trips). The ~480 junk commands created by the
|
||||
hung-dispatch attempts self-drained to `failed` via the reaper (won't re-run). Relaunched the
|
||||
real `scan-one Emsisoft` run: setup restored 516 malware samples and the detached, no-cap
|
||||
`GuruScan-one` scheduled task reached `state=Running` — the full scan was underway. Howard then
|
||||
rebooted DESKTOP-MS42HNC to test other software, interrupting the scan. Per his direction, the
|
||||
Emsisoft verification is marked **IN TEST / paused** to resume later; the background harness was
|
||||
stopped, scratch files cleaned, and state saved.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Fixed + deployed the server bug rather than just reporting it** — Howard explicitly cleared
|
||||
"following the rules for updating the RMM and following Mike's requests do what is needed."
|
||||
The merge-to-main deploy is the documented path (`gururmm-build` skill) and the only way to
|
||||
unblock; auto-rollback protects a bad binary.
|
||||
- **Minimal hotfix = non-blocking `try_send`, signature unchanged** — kept `send_to` as
|
||||
`async fn` so all ~24 callers stay untouched; only the body changed. A black-holed agent now
|
||||
degrades to "command queued" (re-offered on reconnect / by the reaper) instead of wedging the
|
||||
fleet. Did NOT re-implement the reverted eviction (larger change, reverted for a reason) — left
|
||||
as a documented follow-up.
|
||||
- **Preserved the HELD guru-rmm work before any git surgery** — branched the dangling SPEC-030
|
||||
commit + uncommitted edits onto `wip/held-2026-06-22` and pushed it, so deploying a clean
|
||||
main-based fix never risked the in-progress work. Did not bundle that unrelated WIP into the
|
||||
production deploy.
|
||||
- **Did not manually fight the junk-command backlog** — the server's reaper fails undeliverable
|
||||
commands on its own (`pending → failed`), so manual per-id cancel was redundant and raced it.
|
||||
- **Marked Emsisoft verification IN TEST / paused** — VM rebooted mid-scan for other testing;
|
||||
resume cleanly via the self-restoring harness.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Fleet-wide command-dispatch hang** — blocking bounded-channel send in `send_to` + read-lock
|
||||
held across the await → RwLock writer starvation. Fixed with `try_send` (`9dae20c`), deployed,
|
||||
verified end-to-end on a second agent.
|
||||
- **Bash cwd parked in a submodule** — an earlier `cd projects/msp-tools/guru-rmm` persisted, so
|
||||
relative paths failed (`guru-scan` "not found"). Switched to absolute `C:/claudetools/...` paths.
|
||||
- **Test VM silent after server restart** — DESKTOP-MS42HNC didn't reconnect on its own (laptop
|
||||
VM, sleeps). Howard restarted the agent service; it recovered. A first post-reconnect test
|
||||
command timed out transiently while the agent drained the redispatch backlog, then went healthy.
|
||||
- **Scan interrupted by reboot** — Howard rebooted the VM to test other software; the in-flight
|
||||
Emsisoft scan was lost. Marked paused; harness self-restores samples on resume.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
Modified (guru-rmm submodule, committed + deployed):
|
||||
- `server/src/ws/mod.rs` — `AgentConnections::send_to` now uses non-blocking `tx.try_send(msg)`
|
||||
instead of `tx.send(msg).await`; added a detailed comment on the fleet-wide-wedge root cause.
|
||||
Commit `9dae20c` on `main` (pushed → webhook build+deploy).
|
||||
|
||||
Preserved (guru-rmm submodule, pushed, NOT deployed):
|
||||
- Branch `wip/held-2026-06-22` (`3a6277a` on top of `4ca1d06`) — the held SPEC-030 uninstall
|
||||
prototype + uncommitted DoS-hardening (`cap_field`/`cap_vec` in `ws/mod.rs`, `agents.rs`),
|
||||
dashboard edits, docs, `script-library/time-date/`, migration `060_alert_mutes_agent_id_index.sql`.
|
||||
|
||||
Submodule pointer:
|
||||
- `guru-scan` updated to hardened `fb09102` (was `2f8fbcd`).
|
||||
- `guru-rmm` submodule moved from the HELD detached state to `main@9dae20c`.
|
||||
|
||||
Memory (main repo):
|
||||
- `.claude/memory/project_guruscan_in_test_paused.md` (new) + index line.
|
||||
- `.claude/memory/project_gururmm_dispatch_hang_fix.md` (new) + index line.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- None created or discovered. GuruRMM API creds read from vault `infrastructure/gururmm-server.sops.yaml`
|
||||
(`credentials.gururmm-api.admin-email` / `admin-password`).
|
||||
- **DESKTOP-MS42HNC still has Windows Defender RTP + Tamper Protection DISABLED** (Howard, at the
|
||||
console, for malware testing) — must be re-enabled during final cleanup.
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **GuruRMM server** — `http://172.16.3.30:3001`; deploy = merge to `gururmm` `main` → Gitea
|
||||
webhook on `.30` rebuilds (`cargo build --release`, SQLX_OFFLINE) + `systemctl stop/start
|
||||
gururmm-server`, auto-rollback if the new binary won't start. Single server binary
|
||||
(`/opt/gururmm/gururmm-server`), no beta/prod split for the server. Build log
|
||||
`/var/log/gururmm-build-server.log`.
|
||||
- **DESKTOP-MS42HNC** — agent id `0de89b88-b21d-4647-ab64-96157ba87cc5`; client AZ Computer Guru,
|
||||
site Howard-VM; flaky laptop test VM (sleeps/reboots). Has malware sample set restored to
|
||||
`C:\Users\Owner\Desktop\malware-samples-master` (516 files) + zip in Downloads; Defender disabled.
|
||||
- **DESKTOP-QNP3ON5** — agent id `ba173f0c-19e8-488d-834c-1b6f6dfd5699`; used as the healthy
|
||||
control agent to verify dispatch recovery (command completed in ~5s).
|
||||
- GuruScan endpoint paths: `C:\GuruScan\` (module), `C:\GuruScan\downloads\` (scanner EXEs),
|
||||
`C:\ScanLogs\<scanid>\` (logs).
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- Verify server compiles the deploy way: `bash .claude/skills/gururmm-build/scripts/verify.sh server --check` → `[OK] server check passed`.
|
||||
- Deploy: merge `fix/ws-send-to-nonblocking-dispatch-hang` → `main`, `git push origin main` (`b04caf2..9dae20c`).
|
||||
- Recovery verified: dispatch POST `HTTP=200` in ~0.1s (was `HTTP=000` 30s timeout); `Write-Output` on QNP3ON5 `status=completed` in ~5s.
|
||||
- Resume the paused scan (hands-off, self-restores samples):
|
||||
`bash .claude/scripts/guruscan-agent-test.sh DESKTOP-MS42HNC scan-one Emsisoft`
|
||||
- Reaper behavior observed: undeliverable backlog `pending → failed` ("agent unreachable — server-side reaper"), ~480 commands self-cleared.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **[IN TEST / PAUSED] Emsisoft full-scan/removal verification on DESKTOP-MS42HNC** — VM rebooted
|
||||
mid-scan. Resume with the harness above; expect Emsisoft to clear well beyond HitmanPro's 36
|
||||
(incl. the `.js` droppers HitmanPro ignores). Then confirm `results.json` `total_threats` /
|
||||
`reboot_required` + reboot-cleanup task.
|
||||
- **Clean up DESKTOP-MS42HNC** — remove samples + zip + EICAR + test tasks, clear scanner
|
||||
quarantine, **re-enable Windows Defender RTP + Tamper Protection**.
|
||||
- **guru-scan gitlink in main repo** — intentionally bumped to `fb09102` this session; held no
|
||||
longer (guru-rmm is unblocked).
|
||||
- **Follow-up (server):** the proper black-hole connection eviction / keepalive-drop is still
|
||||
absent (was reverted in `80df458`). If a half-open agent recurs (heartbeats `online` but
|
||||
commands dispatch `running` → reaper-fail on timeout), finish that work rather than relying on
|
||||
`try_send` alone. Held DoS-hardening on `wip/held-2026-06-22` is unreviewed/unmerged.
|
||||
|
||||
## Reference Information
|
||||
|
||||
- guru-rmm fix commit: `9dae20c` on `main` (pushed to `https://git.azcomputerguru.com/azcomputerguru/gururmm.git`).
|
||||
- Preserved held work: branch `wip/held-2026-06-22` (`3a6277a` over `4ca1d06`), pushed to origin.
|
||||
- Related prior commits: `7c578fd` (evict non-delivering connections — reverted by `80df458`),
|
||||
`ca1657b` (reaper re-delivers black-holed commands), `b93f2ef` (durable command-ACK foundation).
|
||||
- guru-scan hardened commit: `fb09102`.
|
||||
- Build model + verifier: `gururmm-build` skill; `projects/msp-tools/guru-rmm/docs/BUILD.md`.
|
||||
- Yesterday's GuruScan hardening log: `session-logs/2026-06/2026-06-22-howard-guruscan-hardening.md`.
|
||||
Reference in New Issue
Block a user