From 738d64efb980cd18fef1a610e30a1bf35c320f91 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Mon, 25 May 2026 15:52:28 -0700 Subject: [PATCH] sync: auto-sync from GURU-BEAST-ROG at 2026-05-25 15:52:25 Author: Mike Swanson Machine: GURU-BEAST-ROG Timestamp: 2026-05-25 15:52:25 --- ...05-25-beast-gururmm-audit-2-remediation.md | 138 ++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md diff --git a/session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md b/session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md new file mode 100644 index 0000000..9f317f0 --- /dev/null +++ b/session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md @@ -0,0 +1,138 @@ +# Session Log — 2026-05-25 — BEAST: GuruRMM re-audit #2 remediation + deploy coordination + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-BEAST-ROG +- **Role:** admin +- **Session span:** 2026-05-25 (interactive coordinator session on BEAST). GuruRMM project work → + logged to root `session-logs/` per CLAUDE.md (not the submodule). Namespaced (distinct topic; + GURU-5070 already authored `2026-05-25-session.md` today, and this machine has a separate + `2026-05-25-beast-chrome-fetch-and-identity-audit.md`). + +## Session Summary +Picked up GURU-KALI's "RMM re-audit #2" handoff (coord message; 45 findings, 0 CRITICAL, 5 HIGH, +6 MEDIUM) and worked it end to end under the coordinator model: read the full report, claimed a +coord lock, then triaged and remediated via the Coding Agent + mandatory Code Review Agent. The +audit branch (`audit/2026-05-25-rmm-audit-2`) was docs-only — the fixes were recommendations to +implement. All work landed on a new gururmm branch `fix/audit-2-remediation` (pushed, NOT merged, +NOT deployed). + +Triaged all findings into `gururmm/docs/FEATURE_ROADMAP.md` as BUG-002..BUG-012 and brought the +report + UI_GAPS delta onto the branch. Implemented the safe, well-specified HIGH fixes: #1 crash +detection (re-keyed `health.rs` from the never-emitted `update_applied` to `update_success`, +converted that query + `increment_crash_count` to runtime sqlx to avoid a `.sqlx` cache break and +chip at the BUG-007 macro MEDIUM, dropped the paired `total_attempts` double-count, added 2 +`#[sqlx::test]` regression tests) and #5 (`update_channel` — found it IS read by 3 dashboard pages, +so ADDED it to the server SELECTs rather than dropping from TS). Code Review returned +REQUEST-CHANGES on one item: the migration-046 header-comment fix would itself break +`sqlx::migrate!`'s checksum on the already-applied migration — the exact outage class the audit +flagged — so it was reverted; the rest was approved. + +Hardened `build-server.sh` (#2/BUG-003) on the branch — build lock + previous-binary backup + +auto-rollback on failed `is-active` — and documented that the outage-causing script is the +divergent webhook copy at `/opt/gururmm/build-server.sh` (unchecked `git reset`), which still needs +reconciling. Mid-session, Mikes-MacBook-Air sent a deploy-coordination coord message; replied with a +HOLD recommendation and three hazards (046 already applied in prod, the feature is inert, the server +build script is unhardened). + +For #3 (`update_rollouts`), Mike chose option (b): keep the feature inert and clearly labeled now, +move the automated health-gated promotion (a) to the roadmap as a Phase-2 item that must be +re-spec'd. Implemented the label via a code comment at `resolve_agent_channel` + roadmap entries, +and reconciled the four Phase 5/6 docs in the ClaudeTools repo with a "feature inert" banner (this +sync hit a rebase conflict from a concurrent GURU-5070 edit — resolved keeping both sides). + +Finally, with build-server access granted, investigated #4 (Mac builds "40 behind"). The audit was +wrong: the trigger is not broken — `build-mac.sh` is an intentional stub (no Mac build machine), the +webhook invokes it every push and it no-ops, and Mac builds were never implemented in this pipeline. +Corrected BUG-005 from HIGH "trigger broken" to LOW "unimplemented stub — product decision," and +notified GURU-KALI. + +## Key Decisions +- **#5 update_channel: ADD to server, not drop from TS** — grep showed it's read by AgentDetail/ + ClientDetail/SiteDetail; the column already exists (migration 026), so adding to the SELECT is + zero-risk and preserves working UX. +- **Reverted the migration-046 comment fix** — editing an already-applied migration changes its + sqlx checksum and fails `sqlx::migrate!` at startup (BUG-003 outage class). Left 046 byte-for-byte + intact; marked won't-fix. Same hazard blocks the scaffolding-label-in-046-header idea. +- **#3 = option (b)** (Mike) — keep `update_rollouts`/health metrics inert + labeled (code comment + + roadmap), defer automated gating (a) to a Phase-2 roadmap item requiring a full re-spec. Rationale: + (a) adds a failure mode to an already-fragile pipeline, depends on the unmerged/unverified #1 + signal, and is undesigned work — not safe to rush mid-deploy. +- **Recommended MacBook HOLD the deploy** — not a git conflict (work is on a branch), but 046 is + already applied in prod, the feature is inert, and the server build script is unhardened. +- **#4: corrected rather than "fixed"** — verified via SSH the trigger fires and stubs out; Mac was + never built and no runner is configured. Re-triggering (the audit's rec) is a no-op. Did not + provision hardware or advance `last-built-commit-mac` artificially (would falsely claim a build). +- **Used the Coding Agent + Code Review Agent** for the production Rust changes (coordinator model); + did doc/triage edits and the bash hardening directly. + +## Problems Encountered +- **Migration-046 checksum hazard (caught in review)** — the cosmetic header fix would reproduce the + outage. Reverted (`8ef8159`); the applied migration is untouched. +- **Rebase conflict during the ClaudeTools sync** — GURU-5070 concurrently edited `verify-rollout- + system.sh` (renamed host `Saturn`→`gururmm-build`) and a prior auto-sync had bumped the gururmm + submodule pointer. Resolved: kept both the rename and my banner; kept the submodule at `938fa33` + (gururmm `main` tip — a legitimate advance, not my branch). +- **No SSH key access to the build server from BEAST** (`Permission denied (publickey,password)`). + Resolved using the vaulted password over paramiko for read-only commands. +- **`.sqlx` macro cache risk in the #1 fix** — changing an inline SQL literal in a `sqlx::query!` + macro would break the offline build. Resolved by converting that query to runtime sqlx (also + addresses part of BUG-007). + +## Configuration Changes +gururmm repo — branch `fix/audit-2-remediation` (origin `azcomputerguru/gururmm`): +- `server/src/updates/health.rs` — crash-detection fix (`update_success`), runtime-sqlx conversion of + 2 queries, dropped `total_attempts` double-count, 2 new `#[sqlx::test]` tests; 2 stale `.sqlx` + cache files removed. +- `server/src/db/agents.rs` — added `update_channel: Option` to `Agent`/`AgentResponse`/ + `AgentWithDetails` + both list SELECTs. +- `server/src/db/updates.rs` — scaffolding comment at `resolve_agent_channel` (health gating NOT + wired; Phase-2). +- `server/migrations/046_safe_rollout.sql` — header edit made then REVERTED (untouched, intact). +- `build-server.sh` — build lock + binary backup + auto-rollback (BUG-003 hardening). +- `docs/FEATURE_ROADMAP.md` — BUG-002..012; BUG-004 decision (b); BUG-005 corrected; safe-rollout + Phase-2 roadmap item. +- `docs/UI_GAPS.md`, `reports/2026-05-25-rmm-audit-2.md` — brought from the audit branch. + +ClaudeTools repo (main): +- `PHASE_6_TEST_PLAN.md`, `PHASE_5_COMPLETE.md`, `IMPLEMENTATION_SUMMARY.md`, + `verify-rollout-system.sh` — `[WARNING] STATUS` banner (feature inert / Phase-2 aspirational). +- Submodule pointer `projects/msp-tools/guru-rmm` advanced to `938fa33` (gururmm main tip). + +## Credentials & Secrets +- **Build server SSH:** `guru@172.16.3.30:22` (Ubuntu 22.04, "gururmm-build"). Password (and same + sudo password) is vaulted at `infrastructure/gururmm-server.sops.yaml` field `credentials.password` + — value not reproduced here. Retrieved via the vault wrapper and used over paramiko for read-only + commands only; no server changes made. +- No new credentials created or rotated this session. + +## Infrastructure & Servers +- **Build/prod server:** 172.16.3.30 (gururmm-build, Ubuntu 22.04). Gitea: `azcomputerguru/gururmm`. +- **Build pipeline:** `/opt/gururmm/` — `build-shared.sh` (version bump + repo sync from + `/home/guru/gururmm`, change-gate), `build-linux.sh`, `build-windows.sh`, `build-mac.sh` (STUB), + `webhook-handler.py`; logs `/var/log/gururmm-build{,-linux,-windows,-mac}.log`. `build-agents.sh` + is a deprecated wrapper. `last-built-commit-{mac=1ed5596, linux/windows=373c5ce}`. +- **gururmm main tip:** `938fa33` (deployed server was v0.3.22 at audit time / `3dcb30e`). +- **Coord API:** `http://172.16.3.30:8001/api/coord`. + +## Commands & Outputs +- Read audit report: `git -C projects/msp-tools/guru-rmm show origin/audit/2026-05-25-rmm-audit-2:reports/2026-05-25-rmm-audit-2.md`. +- SSH (read-only) via paramiko: `paramiko.SSHClient().connect("172.16.3.30",22,"guru",)`; ran `cat`/`grep`/`git rev-list` only. +- Mac log confirmed stub no-op: `/var/log/gururmm-build-mac.log` → 24+ "Mac build: no build machine configured, skipping" on 2026-05-25. +- Agent gap since last mac build: `git -C /home/guru/gururmm rev-list --count 1ed5596..HEAD` = 48 total, `-- agent/` = 5. +- Build verification (on BEAST): `cargo build` (SQLX_OFFLINE) clean; `#[sqlx::test]` compile clean but need a live DB to run (pending on build server). + +## Pending / Incomplete Tasks +- **Branch `fix/audit-2-remediation` not merged / not deployed** — awaiting Mike's review + merge. +- **Run the 2 `#[sqlx::test]` crash-detection tests against a live DB** (build server) — they only compile offline here. +- **#2 server-side:** reconcile the divergent `/opt/gururmm/build-server.sh` (unchecked `git reset`) with the hardened repo copy, then deploy — needs SSH + sign-off. +- **#3 (a):** safe-rollout automated gating — Phase-2 roadmap item, MUST be re-spec'd before implementation (depends on #1 merged + verified). +- **#4 decision:** (A) ship Mac agents (provision Apple HW / osxcross + implement build-mac.sh + signing) or (B) defer + quiet the freshness alarm (treat stubbed platform as N/A). +- **#6 MEDIUMs + LOWs (BUG-007..012):** open. +- **MacBook deploy:** flagged HOLD; go/no-go is Mike's. + +## Reference Information +- gururmm branch commits: `943edd0` (#1), `98b97bd` (#5), `8818a7f`→reverted by `8ef8159` (046), `5aa6cd9` (triage), `ad261fd` (046 won't-fix), `7146f4b` (build-server harden), `a10fa24` (#3 label), `1bbf5c8` (BUG-005 correction). +- ClaudeTools: `67182e0` (Phase-doc banners), `6a53072` (submodule→938fa33). +- Coord: lock `e2aa72fe` (remediation) + `75db292b` (build-server) both released. Replies sent — GURU-KALI `1742acda` (progress), MacBook `3590db06` (HOLD), GURU-KALI `9846eb32` (BUG-005 corrected). +- Audit report: `gururmm reports/2026-05-25-rmm-audit-2.md`. Bug list: `gururmm docs/FEATURE_ROADMAP.md` "Known Bugs".