sync: auto-sync from GURU-5070 at 2026-07-04 07:30:32

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-07-04 07:30:32
This commit is contained in:
2026-07-04 07:30:50 -07:00
parent 85c8149495
commit 10b1ea7528
3 changed files with 173 additions and 1 deletions

View File

@@ -43,6 +43,8 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure ·
2026-07-04 | Howard-Home | screenconnect | ScreenConnect API error [SendCommandToSession]: HTTP 500: {"errorType":"","message":"An session manager fault error occurred while processing your request. Please contact support if the problem persists."} [ctx: cmd=send-command]
2026-07-04 | GURU-5070 | ps-encoded | encode produced empty output [ctx: src=/dev/fd/63]
2026-07-03 | GURU-5070 | agy/gemini-cli | old gemini npm CLI dead on this account: throwIneligibleOrProjectIdError (needs GOOGLE_CLOUD_PROJECT); replaced by Antigravity 'agy' binary [ctx: fix=rewired-to-agy]
2026-07-03 | GURU-5070 | grok | grok returned no text [ctx: mode=text stopReason=Cancelled]

View File

@@ -0,0 +1,170 @@
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Continued and completed the GuruRMM VSS policy-configurator redesign (SPEC-016 / spec
`vss-policy-config`), building Tasks 2 through 7 on branch `feat/vss-native-com`, running a
workflow-backed code review, fixing every finding, merging to main, and confirming the deploy on
the live T1490 canary (NEPTUNE). Work order: Task 2 (the `vss-create` verb + create-only pass +
scheduled-task registration with both trigger modes + legacy-task migration), Task 3 (set the
per-volume size cap governor on policy apply), Task 4 (retire the scheduled create+prune pass and
`prune` — the T1490 surface — reworking the SPEC-025 compliance heal to create-only via a shared
`create_one_volume`), Task 5 (status/compliance detail reflects the configurator model), Task 6
(both-target release build), Task 7 (runtime verification). Each task was `cargo check`ed on the
Pluto build host (stable + legacy) and pushed incrementally.
Runtime-verified the configurator end to end on a Windows Server 2019 box (Pluto, no Falcon) and a
Win11 Pro client (GURU-5070, Falcon present) using a new hidden `vss-apply-test <policy.json>` verb
that drives the real `apply_policy` path. Confirmed: `GuruRMM-VSS-Create` task registered (daily-times
AND every-N-hours interval triggers both work), legacy `GuruRMM-VSS-Snapshot` removed, size cap set,
native COM create yields a Persistent/Client-accessible shadow on both SKUs including the Falcon client.
Ran a high-effort workflow-backed code review (23 agents, 19 verified findings) BEFORE merge. It
caught a genuine data-loss bug: the original design set the machine-global `MaxShadowCopies` registry
value to `retention_count`, which FIFO-evicts OTHER VSS consumers' (System Restore, third-party
backup) shadows below that count. Fixed that (dropped MaxShadowCopies management entirely; size cap
is the sole governor) plus nine other findings (cap-before-create for newly-eligible volumes, a
`vss-snapshot` alias to bridge the upgrade window, interval-aware compliance window, bounded
IVssAsync::Wait+Cancel instead of INFINITE, provision error unmasking, ExecutionTimeLimit 1h->2h,
filesystem task-presence check instead of a per-eval PowerShell spawn, a warning on the now-ignored
retention_max_age_days). Rebuilt + re-verified on Pluto; both fixes (data-loss + upgrade-gap)
confirmed at runtime.
Merged `feat/vss-native-com` -> main (merge commit `de30b2b`), which fired the webhook build
pipeline: version auto-bumped 0.6.75 -> 0.6.76, signed Windows MSI built on Beast, published. KEY
CORRECTION discovered post-merge: the CI publishes to the **beta** channel, not stable. Stable
channel is 0.6.66 (233 agents — NOT stuck, by design); beta head is now 0.6.76 (8 agents auto-updated).
So the merge is a beta canary deploy, not a fleet-wide push; reaching stable is a separate deliberate
promotion. Set NEPTUNE (stable, 0.6.66) to the beta channel via `PATCH /api/agents/:id/channel` so its
native binary-swap updater pulled 0.6.76 (updated in 90s, enrollment preserved). Confirmed the
configurator on NEPTUNE in production: create task present, legacy removed, MaxShadowCopies UNSET (its
17 existing shadows preserved — the data-loss fix proven in prod), cap 15%, and a live `vss-create`
created shadows on C: and F: under Falcon with no T1490.
Final thread: while attempting a full beta-cohort health sweep, `sops.exe` on GURU-5070 became
execution-blocked mid-session by **WDAC / Windows Application Control** (enforced;
`CodeIntegrityPolicyEnforcementStatus=2`), error "An Application Control policy has blocked this file."
This is NOT Falcon (user removed Falcon; block persisted). It kills all vault decryption -> RMM API
auth on this box. Could not complete the beta sweep (DB-direct route blocked by root-only DATABASE_URL,
no passwordless sudo). Beta head (0.6.76) looks healthy from the observable sample; whether the ~65
beta agents still on 0.6.75 converge to 0.6.76 is unconfirmed.
## Key Decisions
- **Do NOT write machine-global `MaxShadowCopies`** (code-review #1, Mike chose "drop it"). It is a
per-volume cap shared by ALL VSS consumers; lowering it to retention_count evicts other products'
recovery points. Size cap (per-volume diff-area) is the sole governor now; retention_count is advisory.
- **`retention_max_age_days` -> warn when set** (Mike). Age pruning required deletion (T1490); it is
unsupported post-pivot, so the agent logs a WARNING rather than silently ignoring it.
- **Keep a hidden `vss-snapshot` alias -> create pass.** The legacy scheduled task (renamed subcommand
after upgrade) would otherwise error until a policy-hash change; the alias bridges the upgrade window.
- **create_one_volume provisions the size cap BEFORE each create** (strict: skip create if cap fails).
Restores the old "never unbounded" invariant for a volume that becomes eligible after policy apply.
- **Bounded IVssAsync::Wait + Cancel** (was INFINITE) so a wedged DoSnapshotSet can't poison the single
shared COM worker thread for the process lifetime; the cancel also prevents an orphaned shadow.
- **Merge = BETA deploy, not fleet.** The webhook build publishes to the beta channel; stable (0.6.66,
233 agents) is untouched. Reaching the fleet is a separate promote-to-stable step (deliberate, Mike-gated).
- **Update NEPTUNE via channel flip, not manual msiexec.** The agent updater is a server-driven binary
swap (download .exe -> verify SHA256 -> sc stop/replace/start), enrollment-preserving. Setting
update_channel=beta lets that native path run rather than a risky hand-rolled MSI install.
- **policy_hash change forces reconcile on upgrade.** Adding `schedule_interval_hours` to policy_hash
means the old stored hash never matches the new agent -> forced reconcile on first apply, which
registered the create task on NEPTUNE despite its unchanged policy (closes the code-review #3 residual).
## Problems Encountered
- **snapshot_one_volume shared with the SPEC-025 heal.** Removing prune required reworking the heal;
extracted a create-only `create_one_volume` used by both the scheduled pass and the heal.
- **Dead code after removals** (firstrun staggering, volume_jitter_minutes, DEFAULT_RETENTION_COUNT,
FIRSTRUN_PREFIX). Removed; both variants build warning-free for vss.rs.
- **Pluto SSH shell is cmd.exe**, not bash — `tail`/quoting broke remote one-liners. Fixed by driving
via `powershell -NoProfile -Command` and filtering locally.
- **First release build lacked `vss-apply-test`** — I rebuilt without re-syncing Pluto to the commit
that added the verb. Fixed by git reset --hard origin/feat/vss-native-com before rebuild.
- **Client test cap-restore regex didn't match** GURU-5070's shadowstorage format, leaving C: cap at
10% instead of prior 1%. Restored manually to 10 GB.
- **Deploy channel misread.** Initially reported merge as fleet-wide; it is beta-only. 233 agents on
0.6.66 are the STABLE fleet on current stable, NOT a stuck cohort (0.6.66.msi.channel=stable,
0.6.67+ = beta). Corrected.
- **`sops.exe` blocked by WDAC mid-session** (not Falcon). Blocks vault/RMM-API auth on GURU-5070.
Unresolved at session end; beta cohort sweep left for Mike to run via the server DB (or after WDAC fix).
- **errorlog.md write also failed** ("could not write D:/claudetools/errorlog.md") — a second permission
symptom on this box, noted.
## Configuration Changes
Repo (guru-rmm submodule, `feat/vss-native-com` -> merged to main as `de30b2b`, then CI bumped to `9de44d3`):
- `agent/src/vss.rs` — create_pass/create_one_volume (create-only + cap-before-create), register_scheduled_task
(GuruRMM-VSS-Create, dual trigger modes, legacy removal), ensure_governors (size cap only; age warn),
removed prune/run_snapshot_pass/snapshot_one_volume + dead helpers, compliance_window_hours, evaluate_compliance
detail (size-cap-fifo model), create_task_present (filesystem stat), read_max_shadow_copies (status only).
- `agent/src/vss_com.rs` — bounded Wait+Cancel in create_blocking; provision_blocking error unmasking.
- `agent/src/main.rs` — VssCreate verb, VssSnapshot alias, VssApplyTest hidden verb.
- `agent/src/transport/mod.rs` — schedule_interval_hours field.
- `specs/vss-policy-config/plan.md` — Tasks 1-7 + code-review sections marked DONE.
- CI auto-bumped `agent/Cargo.toml` 0.6.75 -> 0.6.76.
RMM state:
- NEPTUNE (`b3a9b454-...`) update_channel set to `beta`; updated 0.6.66 -> 0.6.76; VSS reconciled onto
GuruRMM-VSS-Create (12:00/18:00), legacy task removed, cap 15%, MaxShadowCopies untouched.
Repo (ClaudeTools main): this session log; submodule pointer -> guru-rmm `9de44d3`.
## Credentials & Secrets
- No new secrets created. Gitea push creds via vault `services/gitea.sops.yaml` (used before the sops block).
- RMM admin API creds: vault `infrastructure/gururmm-server.sops.yaml` credentials.gururmm-api.admin-email /
admin-password (consumed by rmm-auth.sh; BLOCKED now by the WDAC sops issue).
- RMM server DB: real DATABASE_URL is in root-only `/opt/gururmm/.env` (`postgres://gururmm:<pass>@localhost:5432/gururmm`);
guru-readable copies are templates with non-working passwords.
## Infrastructure & Servers
- **NEPTUNE** — 172.16.3.11, Win Server 2022, RMM agent `b3a9b454-86eb-491c-ac67-c1f98987d8dc`, Falcon
present, now on agent 0.6.76 / beta channel. VSS: C: (cap 279GB/15%, 18 shadows), F: also eligible.
- **Pluto** (build host) — Administrator@172.16.3.36, Windows Server 2019 Standard (ProductType=3, no Falcon),
C:\gururmm checkout, MSVC + Rust stable + 1.77 legacy. `cargo build` from C:\gururmm\agent (no workspace root).
- **GURU-5070** (this box) — Win11 Pro, RMM agent `819df0c8-...`, Falcon WAS present (Mike removed it this
session), agent on 0.6.76/beta. WDAC/Application Control ENFORCED (blocks sops.exe).
- **RMM server** — guru@172.16.3.30 (Ubuntu). Webhook build pipeline at /opt/gururmm (webhook-handler.py on
:9000, build-windows.sh on Beast primary / Pluto fallback). Downloads at /var/www/gururmm/downloads +
https://rmm.azcomputerguru.com/downloads. RMM API at http://172.16.3.30:3001. Postgres localhost:5432 db=gururmm.
- **GURU-BEAST-ROG** ("Beast") — primary Windows build host for the pipeline.
- Internal Gitea — http://172.16.3.20:3000/azcomputerguru/gururmm.git.
## Commands & Outputs
- Build channel truth: `gururmm-agent-base-0.6.66.msi.channel = stable`; 0.6.67+ = `beta`. Stable fleet = 0.6.66 (233).
- NEPTUNE channel flip: `curl -X PATCH $RMM/api/agents/<id>/channel -d '{"channel":"beta"}'` -> HTTP 204.
- NEPTUNE update: 0.6.66 -> 0.6.76 in ~90s (native binary-swap updater).
- NEPTUNE live create: `vss-create pass complete (2/2 volume(s))` — shadows on C: {50171974...} + F: {A324F987...},
Persistent/Client-accessible/No-writers/Differential; C: 17 -> 18 shadows; MaxShadowCopies UNSET.
- sops block: `& sops.exe --version` -> "An Application Control policy has blocked this file";
Win32_DeviceGuard CodeIntegrityPolicyEnforcementStatus=2 / UsermodeCodeIntegrityPolicyEnforcementStatus=2.
- Beta sweep query for Mike (run on .30):
`sudo bash -c 'set -a; . /opt/gururmm/.env; psql "$DATABASE_URL"' <<'SQL' ... where update_channel='beta' group by 1 ... SQL`
## Pending / Incomplete Tasks
- **WDAC/sops block on GURU-5070** — unresolved. Blocks all vault decryption + RMM API auth + gitea pushes
from this box. Fix: allowlist sops.exe in the WDAC/Smart App Control policy, or disable SAC (one-way on Win11).
Not Falcon (removed; block persisted).
- **Beta cohort convergence unconfirmed** — verify whether ~65 beta agents on 0.6.75 advance to 0.6.76 or stall.
Run the psql query above (or resume once sops works).
- **Promote 0.6.76 -> stable** when beta soak is satisfactory (deliberate step; mechanism TBD — likely re-tag
the .msi.channel or a stable pointer; no agent promote script found, only promote-dashboard.sh).
- **NEPTUNE channel** — left on beta (canary). Revert to stable if desired (stays 0.6.76 either way; no downgrades).
- Leftover test shadows: {8FEFDAE3} NEPTUNE, {79676B16} GURU-5070, {50171974} NEPTUNE — all FIFO-evict via cap.
- Parent ClaudeTools submodule pointer -> 9de44d3 (folds into this sync).
## Reference Information
- guru-rmm branch merged: `feat/vss-native-com` -> main `de30b2b` (merge), CI bump `9de44d3`. Agent version 0.6.76.
- Key commits: Task2 0cdcff5, Task3 4f78513, Task4 8f59706/feeb168, Task5 74e7d6b, review fixes d73c086/691dd62.
- Code review workflow output: `C:\Users\guru\AppData\Local\Temp\claude\...\tasks\wyosoguyd.output` (10 findings).
- RMM channel API: `PATCH /api/agents/:id/channel {"channel":"stable"|"beta"}` (server api/mod.rs:306).
- Spec: `projects/msp-tools/guru-rmm/specs/vss-policy-config/{plan,shape,references,standards}.md`.
- Published artifact: `gururmm-agent-base-0.6.76.msi` (sha256 f9eee26d6ee61acaee69747d945cbeca0f448120a1013845f0c553d48ac55f1d), channel beta.