diff --git a/errorlog.md b/errorlog.md index d4be5cab..153cea23 100644 --- a/errorlog.md +++ b/errorlog.md @@ -43,6 +43,8 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure · 2026-07-04 | Howard-Home | screenconnect | ScreenConnect API error [SendCommandToSession]: HTTP 500: {"errorType":"","message":"An session manager fault error occurred while processing your request. Please contact support if the problem persists."} [ctx: cmd=send-command] +2026-07-04 | GURU-5070 | ps-encoded | encode produced empty output [ctx: src=/dev/fd/63] + 2026-07-03 | GURU-5070 | agy/gemini-cli | old gemini npm CLI dead on this account: throwIneligibleOrProjectIdError (needs GOOGLE_CLOUD_PROJECT); replaced by Antigravity 'agy' binary [ctx: fix=rewired-to-agy] 2026-07-03 | GURU-5070 | grok | grok returned no text [ctx: mode=text stopReason=Cancelled] diff --git a/projects/msp-tools/security-assessment b/projects/msp-tools/security-assessment index 0f6927b6..1a582e4a 160000 --- a/projects/msp-tools/security-assessment +++ b/projects/msp-tools/security-assessment @@ -1 +1 @@ -Subproject commit 0f6927b6359616e5be8962175a93e9903bf0ef21 +Subproject commit 1a582e4afac05bc44a50c96fe781238d4a1ff33a diff --git a/session-logs/2026-07/2026-07-04-mike-gururmm-vss-build-merge-deploy.md b/session-logs/2026-07/2026-07-04-mike-gururmm-vss-build-merge-deploy.md new file mode 100644 index 00000000..21278bf2 --- /dev/null +++ b/session-logs/2026-07/2026-07-04-mike-gururmm-vss-build-merge-deploy.md @@ -0,0 +1,170 @@ +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Continued and completed the GuruRMM VSS policy-configurator redesign (SPEC-016 / spec +`vss-policy-config`), building Tasks 2 through 7 on branch `feat/vss-native-com`, running a +workflow-backed code review, fixing every finding, merging to main, and confirming the deploy on +the live T1490 canary (NEPTUNE). Work order: Task 2 (the `vss-create` verb + create-only pass + +scheduled-task registration with both trigger modes + legacy-task migration), Task 3 (set the +per-volume size cap governor on policy apply), Task 4 (retire the scheduled create+prune pass and +`prune` — the T1490 surface — reworking the SPEC-025 compliance heal to create-only via a shared +`create_one_volume`), Task 5 (status/compliance detail reflects the configurator model), Task 6 +(both-target release build), Task 7 (runtime verification). Each task was `cargo check`ed on the +Pluto build host (stable + legacy) and pushed incrementally. + +Runtime-verified the configurator end to end on a Windows Server 2019 box (Pluto, no Falcon) and a +Win11 Pro client (GURU-5070, Falcon present) using a new hidden `vss-apply-test ` verb +that drives the real `apply_policy` path. Confirmed: `GuruRMM-VSS-Create` task registered (daily-times +AND every-N-hours interval triggers both work), legacy `GuruRMM-VSS-Snapshot` removed, size cap set, +native COM create yields a Persistent/Client-accessible shadow on both SKUs including the Falcon client. + +Ran a high-effort workflow-backed code review (23 agents, 19 verified findings) BEFORE merge. It +caught a genuine data-loss bug: the original design set the machine-global `MaxShadowCopies` registry +value to `retention_count`, which FIFO-evicts OTHER VSS consumers' (System Restore, third-party +backup) shadows below that count. Fixed that (dropped MaxShadowCopies management entirely; size cap +is the sole governor) plus nine other findings (cap-before-create for newly-eligible volumes, a +`vss-snapshot` alias to bridge the upgrade window, interval-aware compliance window, bounded +IVssAsync::Wait+Cancel instead of INFINITE, provision error unmasking, ExecutionTimeLimit 1h->2h, +filesystem task-presence check instead of a per-eval PowerShell spawn, a warning on the now-ignored +retention_max_age_days). Rebuilt + re-verified on Pluto; both fixes (data-loss + upgrade-gap) +confirmed at runtime. + +Merged `feat/vss-native-com` -> main (merge commit `de30b2b`), which fired the webhook build +pipeline: version auto-bumped 0.6.75 -> 0.6.76, signed Windows MSI built on Beast, published. KEY +CORRECTION discovered post-merge: the CI publishes to the **beta** channel, not stable. Stable +channel is 0.6.66 (233 agents — NOT stuck, by design); beta head is now 0.6.76 (8 agents auto-updated). +So the merge is a beta canary deploy, not a fleet-wide push; reaching stable is a separate deliberate +promotion. Set NEPTUNE (stable, 0.6.66) to the beta channel via `PATCH /api/agents/:id/channel` so its +native binary-swap updater pulled 0.6.76 (updated in 90s, enrollment preserved). Confirmed the +configurator on NEPTUNE in production: create task present, legacy removed, MaxShadowCopies UNSET (its +17 existing shadows preserved — the data-loss fix proven in prod), cap 15%, and a live `vss-create` +created shadows on C: and F: under Falcon with no T1490. + +Final thread: while attempting a full beta-cohort health sweep, `sops.exe` on GURU-5070 became +execution-blocked mid-session by **WDAC / Windows Application Control** (enforced; +`CodeIntegrityPolicyEnforcementStatus=2`), error "An Application Control policy has blocked this file." +This is NOT Falcon (user removed Falcon; block persisted). It kills all vault decryption -> RMM API +auth on this box. Could not complete the beta sweep (DB-direct route blocked by root-only DATABASE_URL, +no passwordless sudo). Beta head (0.6.76) looks healthy from the observable sample; whether the ~65 +beta agents still on 0.6.75 converge to 0.6.76 is unconfirmed. + +## Key Decisions + +- **Do NOT write machine-global `MaxShadowCopies`** (code-review #1, Mike chose "drop it"). It is a + per-volume cap shared by ALL VSS consumers; lowering it to retention_count evicts other products' + recovery points. Size cap (per-volume diff-area) is the sole governor now; retention_count is advisory. +- **`retention_max_age_days` -> warn when set** (Mike). Age pruning required deletion (T1490); it is + unsupported post-pivot, so the agent logs a WARNING rather than silently ignoring it. +- **Keep a hidden `vss-snapshot` alias -> create pass.** The legacy scheduled task (renamed subcommand + after upgrade) would otherwise error until a policy-hash change; the alias bridges the upgrade window. +- **create_one_volume provisions the size cap BEFORE each create** (strict: skip create if cap fails). + Restores the old "never unbounded" invariant for a volume that becomes eligible after policy apply. +- **Bounded IVssAsync::Wait + Cancel** (was INFINITE) so a wedged DoSnapshotSet can't poison the single + shared COM worker thread for the process lifetime; the cancel also prevents an orphaned shadow. +- **Merge = BETA deploy, not fleet.** The webhook build publishes to the beta channel; stable (0.6.66, + 233 agents) is untouched. Reaching the fleet is a separate promote-to-stable step (deliberate, Mike-gated). +- **Update NEPTUNE via channel flip, not manual msiexec.** The agent updater is a server-driven binary + swap (download .exe -> verify SHA256 -> sc stop/replace/start), enrollment-preserving. Setting + update_channel=beta lets that native path run rather than a risky hand-rolled MSI install. +- **policy_hash change forces reconcile on upgrade.** Adding `schedule_interval_hours` to policy_hash + means the old stored hash never matches the new agent -> forced reconcile on first apply, which + registered the create task on NEPTUNE despite its unchanged policy (closes the code-review #3 residual). + +## Problems Encountered + +- **snapshot_one_volume shared with the SPEC-025 heal.** Removing prune required reworking the heal; + extracted a create-only `create_one_volume` used by both the scheduled pass and the heal. +- **Dead code after removals** (firstrun staggering, volume_jitter_minutes, DEFAULT_RETENTION_COUNT, + FIRSTRUN_PREFIX). Removed; both variants build warning-free for vss.rs. +- **Pluto SSH shell is cmd.exe**, not bash — `tail`/quoting broke remote one-liners. Fixed by driving + via `powershell -NoProfile -Command` and filtering locally. +- **First release build lacked `vss-apply-test`** — I rebuilt without re-syncing Pluto to the commit + that added the verb. Fixed by git reset --hard origin/feat/vss-native-com before rebuild. +- **Client test cap-restore regex didn't match** GURU-5070's shadowstorage format, leaving C: cap at + 10% instead of prior 1%. Restored manually to 10 GB. +- **Deploy channel misread.** Initially reported merge as fleet-wide; it is beta-only. 233 agents on + 0.6.66 are the STABLE fleet on current stable, NOT a stuck cohort (0.6.66.msi.channel=stable, + 0.6.67+ = beta). Corrected. +- **`sops.exe` blocked by WDAC mid-session** (not Falcon). Blocks vault/RMM-API auth on GURU-5070. + Unresolved at session end; beta cohort sweep left for Mike to run via the server DB (or after WDAC fix). +- **errorlog.md write also failed** ("could not write D:/claudetools/errorlog.md") — a second permission + symptom on this box, noted. + +## Configuration Changes + +Repo (guru-rmm submodule, `feat/vss-native-com` -> merged to main as `de30b2b`, then CI bumped to `9de44d3`): +- `agent/src/vss.rs` — create_pass/create_one_volume (create-only + cap-before-create), register_scheduled_task + (GuruRMM-VSS-Create, dual trigger modes, legacy removal), ensure_governors (size cap only; age warn), + removed prune/run_snapshot_pass/snapshot_one_volume + dead helpers, compliance_window_hours, evaluate_compliance + detail (size-cap-fifo model), create_task_present (filesystem stat), read_max_shadow_copies (status only). +- `agent/src/vss_com.rs` — bounded Wait+Cancel in create_blocking; provision_blocking error unmasking. +- `agent/src/main.rs` — VssCreate verb, VssSnapshot alias, VssApplyTest hidden verb. +- `agent/src/transport/mod.rs` — schedule_interval_hours field. +- `specs/vss-policy-config/plan.md` — Tasks 1-7 + code-review sections marked DONE. +- CI auto-bumped `agent/Cargo.toml` 0.6.75 -> 0.6.76. + +RMM state: +- NEPTUNE (`b3a9b454-...`) update_channel set to `beta`; updated 0.6.66 -> 0.6.76; VSS reconciled onto + GuruRMM-VSS-Create (12:00/18:00), legacy task removed, cap 15%, MaxShadowCopies untouched. + +Repo (ClaudeTools main): this session log; submodule pointer -> guru-rmm `9de44d3`. + +## Credentials & Secrets + +- No new secrets created. Gitea push creds via vault `services/gitea.sops.yaml` (used before the sops block). +- RMM admin API creds: vault `infrastructure/gururmm-server.sops.yaml` credentials.gururmm-api.admin-email / + admin-password (consumed by rmm-auth.sh; BLOCKED now by the WDAC sops issue). +- RMM server DB: real DATABASE_URL is in root-only `/opt/gururmm/.env` (`postgres://gururmm:@localhost:5432/gururmm`); + guru-readable copies are templates with non-working passwords. + +## Infrastructure & Servers + +- **NEPTUNE** — 172.16.3.11, Win Server 2022, RMM agent `b3a9b454-86eb-491c-ac67-c1f98987d8dc`, Falcon + present, now on agent 0.6.76 / beta channel. VSS: C: (cap 279GB/15%, 18 shadows), F: also eligible. +- **Pluto** (build host) — Administrator@172.16.3.36, Windows Server 2019 Standard (ProductType=3, no Falcon), + C:\gururmm checkout, MSVC + Rust stable + 1.77 legacy. `cargo build` from C:\gururmm\agent (no workspace root). +- **GURU-5070** (this box) — Win11 Pro, RMM agent `819df0c8-...`, Falcon WAS present (Mike removed it this + session), agent on 0.6.76/beta. WDAC/Application Control ENFORCED (blocks sops.exe). +- **RMM server** — guru@172.16.3.30 (Ubuntu). Webhook build pipeline at /opt/gururmm (webhook-handler.py on + :9000, build-windows.sh on Beast primary / Pluto fallback). Downloads at /var/www/gururmm/downloads + + https://rmm.azcomputerguru.com/downloads. RMM API at http://172.16.3.30:3001. Postgres localhost:5432 db=gururmm. +- **GURU-BEAST-ROG** ("Beast") — primary Windows build host for the pipeline. +- Internal Gitea — http://172.16.3.20:3000/azcomputerguru/gururmm.git. + +## Commands & Outputs + +- Build channel truth: `gururmm-agent-base-0.6.66.msi.channel = stable`; 0.6.67+ = `beta`. Stable fleet = 0.6.66 (233). +- NEPTUNE channel flip: `curl -X PATCH $RMM/api/agents//channel -d '{"channel":"beta"}'` -> HTTP 204. +- NEPTUNE update: 0.6.66 -> 0.6.76 in ~90s (native binary-swap updater). +- NEPTUNE live create: `vss-create pass complete (2/2 volume(s))` — shadows on C: {50171974...} + F: {A324F987...}, + Persistent/Client-accessible/No-writers/Differential; C: 17 -> 18 shadows; MaxShadowCopies UNSET. +- sops block: `& sops.exe --version` -> "An Application Control policy has blocked this file"; + Win32_DeviceGuard CodeIntegrityPolicyEnforcementStatus=2 / UsermodeCodeIntegrityPolicyEnforcementStatus=2. +- Beta sweep query for Mike (run on .30): + `sudo bash -c 'set -a; . /opt/gururmm/.env; psql "$DATABASE_URL"' <<'SQL' ... where update_channel='beta' group by 1 ... SQL` + +## Pending / Incomplete Tasks + +- **WDAC/sops block on GURU-5070** — unresolved. Blocks all vault decryption + RMM API auth + gitea pushes + from this box. Fix: allowlist sops.exe in the WDAC/Smart App Control policy, or disable SAC (one-way on Win11). + Not Falcon (removed; block persisted). +- **Beta cohort convergence unconfirmed** — verify whether ~65 beta agents on 0.6.75 advance to 0.6.76 or stall. + Run the psql query above (or resume once sops works). +- **Promote 0.6.76 -> stable** when beta soak is satisfactory (deliberate step; mechanism TBD — likely re-tag + the .msi.channel or a stable pointer; no agent promote script found, only promote-dashboard.sh). +- **NEPTUNE channel** — left on beta (canary). Revert to stable if desired (stays 0.6.76 either way; no downgrades). +- Leftover test shadows: {8FEFDAE3} NEPTUNE, {79676B16} GURU-5070, {50171974} NEPTUNE — all FIFO-evict via cap. +- Parent ClaudeTools submodule pointer -> 9de44d3 (folds into this sync). + +## Reference Information + +- guru-rmm branch merged: `feat/vss-native-com` -> main `de30b2b` (merge), CI bump `9de44d3`. Agent version 0.6.76. +- Key commits: Task2 0cdcff5, Task3 4f78513, Task4 8f59706/feeb168, Task5 74e7d6b, review fixes d73c086/691dd62. +- Code review workflow output: `C:\Users\guru\AppData\Local\Temp\claude\...\tasks\wyosoguyd.output` (10 findings). +- RMM channel API: `PATCH /api/agents/:id/channel {"channel":"stable"|"beta"}` (server api/mod.rs:306). +- Spec: `projects/msp-tools/guru-rmm/specs/vss-policy-config/{plan,shape,references,standards}.md`. +- Published artifact: `gururmm-agent-base-0.6.76.msi` (sha256 f9eee26d6ee61acaee69747d945cbeca0f448120a1013845f0c553d48ac55f1d), channel beta.