sync: auto-sync from GURU-BEAST-ROG at 2026-05-25 15:52:25
Author: Mike Swanson Machine: GURU-BEAST-ROG Timestamp: 2026-05-25 15:52:25
This commit is contained in:
138
session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md
Normal file
138
session-logs/2026-05-25-beast-gururmm-audit-2-remediation.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Session Log — 2026-05-25 — BEAST: GuruRMM re-audit #2 remediation + deploy coordination
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** GURU-BEAST-ROG
|
||||
- **Role:** admin
|
||||
- **Session span:** 2026-05-25 (interactive coordinator session on BEAST). GuruRMM project work →
|
||||
logged to root `session-logs/` per CLAUDE.md (not the submodule). Namespaced (distinct topic;
|
||||
GURU-5070 already authored `2026-05-25-session.md` today, and this machine has a separate
|
||||
`2026-05-25-beast-chrome-fetch-and-identity-audit.md`).
|
||||
|
||||
## Session Summary
|
||||
Picked up GURU-KALI's "RMM re-audit #2" handoff (coord message; 45 findings, 0 CRITICAL, 5 HIGH,
|
||||
6 MEDIUM) and worked it end to end under the coordinator model: read the full report, claimed a
|
||||
coord lock, then triaged and remediated via the Coding Agent + mandatory Code Review Agent. The
|
||||
audit branch (`audit/2026-05-25-rmm-audit-2`) was docs-only — the fixes were recommendations to
|
||||
implement. All work landed on a new gururmm branch `fix/audit-2-remediation` (pushed, NOT merged,
|
||||
NOT deployed).
|
||||
|
||||
Triaged all findings into `gururmm/docs/FEATURE_ROADMAP.md` as BUG-002..BUG-012 and brought the
|
||||
report + UI_GAPS delta onto the branch. Implemented the safe, well-specified HIGH fixes: #1 crash
|
||||
detection (re-keyed `health.rs` from the never-emitted `update_applied` to `update_success`,
|
||||
converted that query + `increment_crash_count` to runtime sqlx to avoid a `.sqlx` cache break and
|
||||
chip at the BUG-007 macro MEDIUM, dropped the paired `total_attempts` double-count, added 2
|
||||
`#[sqlx::test]` regression tests) and #5 (`update_channel` — found it IS read by 3 dashboard pages,
|
||||
so ADDED it to the server SELECTs rather than dropping from TS). Code Review returned
|
||||
REQUEST-CHANGES on one item: the migration-046 header-comment fix would itself break
|
||||
`sqlx::migrate!`'s checksum on the already-applied migration — the exact outage class the audit
|
||||
flagged — so it was reverted; the rest was approved.
|
||||
|
||||
Hardened `build-server.sh` (#2/BUG-003) on the branch — build lock + previous-binary backup +
|
||||
auto-rollback on failed `is-active` — and documented that the outage-causing script is the
|
||||
divergent webhook copy at `/opt/gururmm/build-server.sh` (unchecked `git reset`), which still needs
|
||||
reconciling. Mid-session, Mikes-MacBook-Air sent a deploy-coordination coord message; replied with a
|
||||
HOLD recommendation and three hazards (046 already applied in prod, the feature is inert, the server
|
||||
build script is unhardened).
|
||||
|
||||
For #3 (`update_rollouts`), Mike chose option (b): keep the feature inert and clearly labeled now,
|
||||
move the automated health-gated promotion (a) to the roadmap as a Phase-2 item that must be
|
||||
re-spec'd. Implemented the label via a code comment at `resolve_agent_channel` + roadmap entries,
|
||||
and reconciled the four Phase 5/6 docs in the ClaudeTools repo with a "feature inert" banner (this
|
||||
sync hit a rebase conflict from a concurrent GURU-5070 edit — resolved keeping both sides).
|
||||
|
||||
Finally, with build-server access granted, investigated #4 (Mac builds "40 behind"). The audit was
|
||||
wrong: the trigger is not broken — `build-mac.sh` is an intentional stub (no Mac build machine), the
|
||||
webhook invokes it every push and it no-ops, and Mac builds were never implemented in this pipeline.
|
||||
Corrected BUG-005 from HIGH "trigger broken" to LOW "unimplemented stub — product decision," and
|
||||
notified GURU-KALI.
|
||||
|
||||
## Key Decisions
|
||||
- **#5 update_channel: ADD to server, not drop from TS** — grep showed it's read by AgentDetail/
|
||||
ClientDetail/SiteDetail; the column already exists (migration 026), so adding to the SELECT is
|
||||
zero-risk and preserves working UX.
|
||||
- **Reverted the migration-046 comment fix** — editing an already-applied migration changes its
|
||||
sqlx checksum and fails `sqlx::migrate!` at startup (BUG-003 outage class). Left 046 byte-for-byte
|
||||
intact; marked won't-fix. Same hazard blocks the scaffolding-label-in-046-header idea.
|
||||
- **#3 = option (b)** (Mike) — keep `update_rollouts`/health metrics inert + labeled (code comment +
|
||||
roadmap), defer automated gating (a) to a Phase-2 roadmap item requiring a full re-spec. Rationale:
|
||||
(a) adds a failure mode to an already-fragile pipeline, depends on the unmerged/unverified #1
|
||||
signal, and is undesigned work — not safe to rush mid-deploy.
|
||||
- **Recommended MacBook HOLD the deploy** — not a git conflict (work is on a branch), but 046 is
|
||||
already applied in prod, the feature is inert, and the server build script is unhardened.
|
||||
- **#4: corrected rather than "fixed"** — verified via SSH the trigger fires and stubs out; Mac was
|
||||
never built and no runner is configured. Re-triggering (the audit's rec) is a no-op. Did not
|
||||
provision hardware or advance `last-built-commit-mac` artificially (would falsely claim a build).
|
||||
- **Used the Coding Agent + Code Review Agent** for the production Rust changes (coordinator model);
|
||||
did doc/triage edits and the bash hardening directly.
|
||||
|
||||
## Problems Encountered
|
||||
- **Migration-046 checksum hazard (caught in review)** — the cosmetic header fix would reproduce the
|
||||
outage. Reverted (`8ef8159`); the applied migration is untouched.
|
||||
- **Rebase conflict during the ClaudeTools sync** — GURU-5070 concurrently edited `verify-rollout-
|
||||
system.sh` (renamed host `Saturn`→`gururmm-build`) and a prior auto-sync had bumped the gururmm
|
||||
submodule pointer. Resolved: kept both the rename and my banner; kept the submodule at `938fa33`
|
||||
(gururmm `main` tip — a legitimate advance, not my branch).
|
||||
- **No SSH key access to the build server from BEAST** (`Permission denied (publickey,password)`).
|
||||
Resolved using the vaulted password over paramiko for read-only commands.
|
||||
- **`.sqlx` macro cache risk in the #1 fix** — changing an inline SQL literal in a `sqlx::query!`
|
||||
macro would break the offline build. Resolved by converting that query to runtime sqlx (also
|
||||
addresses part of BUG-007).
|
||||
|
||||
## Configuration Changes
|
||||
gururmm repo — branch `fix/audit-2-remediation` (origin `azcomputerguru/gururmm`):
|
||||
- `server/src/updates/health.rs` — crash-detection fix (`update_success`), runtime-sqlx conversion of
|
||||
2 queries, dropped `total_attempts` double-count, 2 new `#[sqlx::test]` tests; 2 stale `.sqlx`
|
||||
cache files removed.
|
||||
- `server/src/db/agents.rs` — added `update_channel: Option<String>` to `Agent`/`AgentResponse`/
|
||||
`AgentWithDetails` + both list SELECTs.
|
||||
- `server/src/db/updates.rs` — scaffolding comment at `resolve_agent_channel` (health gating NOT
|
||||
wired; Phase-2).
|
||||
- `server/migrations/046_safe_rollout.sql` — header edit made then REVERTED (untouched, intact).
|
||||
- `build-server.sh` — build lock + binary backup + auto-rollback (BUG-003 hardening).
|
||||
- `docs/FEATURE_ROADMAP.md` — BUG-002..012; BUG-004 decision (b); BUG-005 corrected; safe-rollout
|
||||
Phase-2 roadmap item.
|
||||
- `docs/UI_GAPS.md`, `reports/2026-05-25-rmm-audit-2.md` — brought from the audit branch.
|
||||
|
||||
ClaudeTools repo (main):
|
||||
- `PHASE_6_TEST_PLAN.md`, `PHASE_5_COMPLETE.md`, `IMPLEMENTATION_SUMMARY.md`,
|
||||
`verify-rollout-system.sh` — `[WARNING] STATUS` banner (feature inert / Phase-2 aspirational).
|
||||
- Submodule pointer `projects/msp-tools/guru-rmm` advanced to `938fa33` (gururmm main tip).
|
||||
|
||||
## Credentials & Secrets
|
||||
- **Build server SSH:** `guru@172.16.3.30:22` (Ubuntu 22.04, "gururmm-build"). Password (and same
|
||||
sudo password) is vaulted at `infrastructure/gururmm-server.sops.yaml` field `credentials.password`
|
||||
— value not reproduced here. Retrieved via the vault wrapper and used over paramiko for read-only
|
||||
commands only; no server changes made.
|
||||
- No new credentials created or rotated this session.
|
||||
|
||||
## Infrastructure & Servers
|
||||
- **Build/prod server:** 172.16.3.30 (gururmm-build, Ubuntu 22.04). Gitea: `azcomputerguru/gururmm`.
|
||||
- **Build pipeline:** `/opt/gururmm/` — `build-shared.sh` (version bump + repo sync from
|
||||
`/home/guru/gururmm`, change-gate), `build-linux.sh`, `build-windows.sh`, `build-mac.sh` (STUB),
|
||||
`webhook-handler.py`; logs `/var/log/gururmm-build{,-linux,-windows,-mac}.log`. `build-agents.sh`
|
||||
is a deprecated wrapper. `last-built-commit-{mac=1ed5596, linux/windows=373c5ce}`.
|
||||
- **gururmm main tip:** `938fa33` (deployed server was v0.3.22 at audit time / `3dcb30e`).
|
||||
- **Coord API:** `http://172.16.3.30:8001/api/coord`.
|
||||
|
||||
## Commands & Outputs
|
||||
- Read audit report: `git -C projects/msp-tools/guru-rmm show origin/audit/2026-05-25-rmm-audit-2:reports/2026-05-25-rmm-audit-2.md`.
|
||||
- SSH (read-only) via paramiko: `paramiko.SSHClient().connect("172.16.3.30",22,"guru",<vault pw>)`; ran `cat`/`grep`/`git rev-list` only.
|
||||
- Mac log confirmed stub no-op: `/var/log/gururmm-build-mac.log` → 24+ "Mac build: no build machine configured, skipping" on 2026-05-25.
|
||||
- Agent gap since last mac build: `git -C /home/guru/gururmm rev-list --count 1ed5596..HEAD` = 48 total, `-- agent/` = 5.
|
||||
- Build verification (on BEAST): `cargo build` (SQLX_OFFLINE) clean; `#[sqlx::test]` compile clean but need a live DB to run (pending on build server).
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
- **Branch `fix/audit-2-remediation` not merged / not deployed** — awaiting Mike's review + merge.
|
||||
- **Run the 2 `#[sqlx::test]` crash-detection tests against a live DB** (build server) — they only compile offline here.
|
||||
- **#2 server-side:** reconcile the divergent `/opt/gururmm/build-server.sh` (unchecked `git reset`) with the hardened repo copy, then deploy — needs SSH + sign-off.
|
||||
- **#3 (a):** safe-rollout automated gating — Phase-2 roadmap item, MUST be re-spec'd before implementation (depends on #1 merged + verified).
|
||||
- **#4 decision:** (A) ship Mac agents (provision Apple HW / osxcross + implement build-mac.sh + signing) or (B) defer + quiet the freshness alarm (treat stubbed platform as N/A).
|
||||
- **#6 MEDIUMs + LOWs (BUG-007..012):** open.
|
||||
- **MacBook deploy:** flagged HOLD; go/no-go is Mike's.
|
||||
|
||||
## Reference Information
|
||||
- gururmm branch commits: `943edd0` (#1), `98b97bd` (#5), `8818a7f`→reverted by `8ef8159` (046), `5aa6cd9` (triage), `ad261fd` (046 won't-fix), `7146f4b` (build-server harden), `a10fa24` (#3 label), `1bbf5c8` (BUG-005 correction).
|
||||
- ClaudeTools: `67182e0` (Phase-doc banners), `6a53072` (submodule→938fa33).
|
||||
- Coord: lock `e2aa72fe` (remediation) + `75db292b` (build-server) both released. Replies sent — GURU-KALI `1742acda` (progress), MacBook `3590db06` (HOLD), GURU-KALI `9846eb32` (BUG-005 corrected).
|
||||
- Audit report: `gururmm reports/2026-05-25-rmm-audit-2.md`. Bug list: `gururmm docs/FEATURE_ROADMAP.md` "Known Bugs".
|
||||
Reference in New Issue
Block a user