sync: auto-sync from GURU-KALI at 2026-05-25 13:49:31

Author: Mike Swanson Machine: GURU-KALI Timestamp: 2026-05-25 13:49:31
2026-05-25 13:49:31 -07:00
parent c5f7c73381
commit 0d1085b145
3 changed files with 198 additions and 114 deletions
--- a/.claude/scheduled_tasks.lock
+++ b/.claude/scheduled_tasks.lock
@@ -1 +1 @@
-{"sessionId":"2158f2e7-8168-4859-b2cf-e0b05d6517b2","pid":18624,"acquiredAt":1779727871169}
+{"sessionId":"eda9a628-252f-4dd7-b4cf-1d987ea11512","pid":16195,"procStart":"259600","acquiredAt":1779740400025}
--- a/projects/msp-tools/guru-rmm
+++ b/projects/msp-tools/guru-rmm
--- a/session-logs/2026-05-25-session.md
+++ b/session-logs/2026-05-25-session.md
@@ -1200,3 +1200,87 @@ None. Ticket is complete, skill is complete, ESXi cron is configured and persist
  - `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset_v2.py` — final cron setup script (SFTP method)
  - `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset.py` — v1 (heredoc method, superseded)
  - `C:\Users\guru\AppData\Local\Temp\esxi124_hostd_restart.py` — hostd restart + verification
 ---
 ## Update: 13:48 MST — GuruRMM CRITICAL auth fix + run_analysis UX fix DEPLOYED; migration incident recovered (Mike Swanson / GURU-KALI)
 ### Session Summary
 Continuation of the 09:34 GURU-KALI session (audit + submodule fixes). First corrected the CLAUDE.md guru-rmm submodule wording (it called the submodule a "stale reference copy"; it actually tracks the active azcomputerguru/gururmm repo, pinned commit just lags main) — committed `f2ece8e`.
 Then implemented and DEPLOYED the two CRITICAL auth findings from the morning's audit. Root cause: the server has no router-level auth — every route is gated only by whether its handler includes the `AuthUser` extractor, and `metrics.rs` + `logs.rs` omitted it, leaving per-agent and fleet-wide metrics/logs anonymously readable (plus `/logs/analyze` firing an outbound LLM call and `/agents/:id/logs/request` commanding agents). Coding Agent (opus) added `AuthUser` to all 8 handlers, scoping per-agent endpoints to the caller's orgs (matching the `get_agent` pattern), fleet aggregates require-auth + `TODO(authz)`, and `run_analysis` admin-only. Code Review APPROVED. Merged to main (`1d5a08f`), deployed via build-server.sh as v0.3.14, verified anon -> 401 on all six endpoints (login still 422, so public routes intact).
 Mike then asked to fix the run_analysis UX regression (admin-only `/logs/analyze` 403'd non-admin techs doing per-agent analysis). Coding Agent relaxed it: per-agent analysis (agent_id present) -> `authorize_agent_access` org check; fleet (no agent_id) stays admin-only; dashboard hides the fleet Analyze button for non-admins (`useAuth` role check matching backend `is_admin()`). Reviewed APPROVED, merged (`7be2f52`).
 Deploying run_analysis surfaced that main did not compile — the unrelated crash-detection health-monitoring feature (`health.rs`, committed earlier today under the shared azcomputerguru account) had a type error. Per Mike's choice, coordinated with the owner (GURU-5070) via coord message rather than fixing it. This also exposed a hostname issue: I'd addressed the message to the stale `DESKTOP-0O8A1RL` session id (the retired hostname); re-sent to `GURU-5070/claude-main` + a fallback. GURU-5070 launched a fleet-wide identity audit in response; GURU-KALI verified clean (identity.json user=mike/machine=GURU-KALI, git user.name normalized to "Mike Swanson", in known_machines) and replied.
 GURU-5070 committed a health.rs fix (`42790f5`) but it was incomplete — it assumed os_type AND architecture are non-null String; per migrations + .sqlx, os_type IS NOT NULL but architecture is nullable, so `&crashed.architecture` gave E0308. Fixed forward (`646eb0a`: as_deref() on version_to + architecture, &os_type direct) — the first version of this code with a verified-clean cargo check; reviewed, merged. Deploying via build-server.sh then hit a MIGRATION INCIDENT and brief outage: migration 046 (safe_rollout) had been applied to the DB out-of-band (3 tables existed) but never recorded in `_sqlx_migrations`, so the new binary crash-looped on boot ("relation update_rollouts already exists"). Since build-server.sh stops the old service before validating the new binary, the server went down. Database Agent recovered: confirmed all 3 tables empty (0 rows, no FK deps), dropped them, restart -> sqlx ran 046 fresh + recorded it. Server v0.3.22 live; dashboard redeployed; anon -> 401 confirmed; no data lost.
 ### Key Decisions
 - **Coordinate vs. fix the health.rs blocker:** initially coordinated with GURU-5070 (Mike's choice, to avoid stepping on WIP). After their committed fix was still broken and they'd declared "done" (no active WIP), fixed it forward — aligned with Mike's "resume the deploy" intent.
 - **Database recovery = drop empty tables, not checksum-insert:** Database Agent chose dropping the 3 empty tables (letting sqlx re-run 046 and self-record) over manually inserting a `_sqlx_migrations` row — avoids a fragile hand-computed SHA-384 and eliminates any out-of-band schema drift. Safe only because all 3 tables were empty.
 - **Branch-not-main for the audit report; non-main pushes don't build:** verified the webhook builds on `refs/heads/main` only with no path filtering — so the audit branch and feature branches don't trigger builds; merging to main does.
 - **Delegated all code/DB/git through agents (opus for auth/migration/security):** coordinator never hand-edited production code or ran DB writes; mandatory Code Review on every change caught that even my own prescribed health.rs fix was wrong.
 ### Problems Encountered
 - **Self-inflicted git race (first run_analysis server build):** ran build-server.sh right after the merge push, which had triggered the webhook build on the same /home/guru/gururmm repo; concurrent `git reset --hard` left a stale tree and a false build failure. Fix: always check for in-flight builds before build-server.sh; resolved by waiting for idle.
 - **health.rs compile saga (3 attempts):** original .as_ref() tuple (E0277 x3) -> GURU-5070's partial fix (E0308, architecture nullable) -> correct fix `646eb0a` (as_deref on the two Option fields). Root issue: nobody ran a clean `cargo check` before committing the prior attempts.
 - **Migration 046 unrecorded -> crash-loop + outage:** see summary; recovered by Database Agent. Lesson sent to GURU-5070: don't apply migration SQL manually during dev; let the server apply via sqlx.
 - **Coord message misaddressed to retired hostname:** DESKTOP-0O8A1RL is retired (now GURU-5070); re-sent + fallback. Triggered the fleet identity audit.
 - **Public dashboard 403:** Cloudflare bot-mitigation on a server-side curl, not an nginx/deploy fault (origin serves the new bundle at local 200).
 ### Configuration Changes
 - claudetools `f2ece8e` — `.claude/CLAUDE.md` guru-rmm submodule wording corrected.
 - gururmm `1d5a08f` — `server/src/api/metrics.rs` + `logs.rs`: AuthUser on 8 handlers (CRITICAL auth fix).
 - gururmm `7be2f52` — `server/src/api/logs.rs` (run_analysis per-agent authz) + `dashboard/src/pages/Logs.tsx` (hide fleet Analyze for non-admins).
 - gururmm `646eb0a` — `server/src/updates/health.rs`: as_deref() fix for nullable Option fields (follow-up to GURU-5070's `42790f5`).
 - DB: dropped + sqlx-recreated `update_rollouts`, `update_health_metrics`, `agent_update_events`; migration 046 now recorded in `_sqlx_migrations`.
 - Deployed: gururmm-server v0.3.22 (`/opt/gururmm/gururmm-server`); dashboard rebuilt + copied to `/var/www/gururmm/dashboard/` (bundle `index-DUF78gxN.js`).
 - `.claude/current-mode` -> infra during deploy.
 ### Credentials & Secrets
 - No new credentials. Build server DB access via `DATABASE_URL` in `/home/guru/.cargo/env` (build server builds ONLINE, which is why health.rs query! macros validated against the live DB). GuruRMM API admin creds: vault `infrastructure/gururmm-server.sops.yaml`.
 ### Infrastructure & Servers
 - gururmm-server: `172.16.3.30:3001`, systemd `gururmm-server`, binary `/opt/gururmm/gururmm-server` (the `/usr/local/bin` path in old CONTEXT.md is stale). Running **v0.3.22**.
 - Server deploy = MANUAL `sudo /opt/gururmm/build-server.sh` (git reset --hard origin/main -> cargo build --release -> stop/cp/start). NOT triggered by the webhook (webhook = agents only). **Latent bug:** stops the service BEFORE validating the new binary's migrations -> a bad migration causes an outage; also doesn't check `git reset` exit code (race) and has no build lock.
 - Dashboard: nginx serves `/var/www/gururmm/dashboard` (root-owned, server_name _); `/api/` proxied to `:3001`; second vhost `server_name rmm-api.azcomputerguru.com`. Dashboard `API_BASE_URL` defaults to `https://rmm-api.azcomputerguru.com` (no .env), so a plain `npm run build` is correct for prod. Public `rmm.azcomputerguru.com` is behind Cloudflare (IPv6 2606:4700; 403s bare curls via bot-mitigation).
 - DB: PostgreSQL `localhost:5432/gururmm` on .30. `_sqlx_migrations` now at version 46.
 ### Commands & Outputs
 ```bash
 # Server deploy (manual, intended path):
 ssh guru@172.16.3.30 'sudo /opt/gururmm/build-server.sh'   # ~4min build, then stop/cp/start
 # Dashboard deploy:
 ssh guru@172.16.3.30 'cd /home/guru/gururmm/dashboard && npm ci && npm run build && sudo cp -r dist/* /var/www/gururmm/dashboard/'
 # Migration recovery (Database Agent, after confirming 3 tables empty):
 #   BEGIN; <guard: raise if any rows>; DROP TABLE IF EXISTS update_rollouts, update_health_metrics, agent_update_events CASCADE; COMMIT;
 #   then systemctl restart gururmm-server -> sqlx runs 046 fresh + records it
 # Smoke test (auth enforcement live):
 curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/api/metrics/summary   # -> 401
 curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/status                # -> 200
 ```
 ### Pending / Incomplete Tasks
 - **HIGH follow-ups from the audit (not started):** validate Entra SSO ID-token signature (`sso.rs:212`); auth+scope the agent-status SSE (`agents.rs:583`); add `client_id`/`update_channel` to the agent response structs (dead frontend links); org-scope the 3 fleet endpoints (`/metrics/summary`, `/logs`, `/logs/analysis` — TODO(authz), need client_ids-filtered queries); mac build gate stuck (mac builder offline since Pluto outage).
 - **Structural:** add a router-level auth layer so "public" is opt-in (kills the missing-AuthUser bug class).
 - **Hand to GURU-5070 (coord msg 2d518a70):** don't apply migration SQL manually; harden build-server.sh (validate migrations before service swap; check git reset exit; add build lock); `046_safe_rollout.sql` header comment mislabeled "Migration 045".
 - Audit report still only on branch `audit/2026-05-25-rmm-audit` (merge to main when bundling code).
 ### Reference Information
 - gururmm commits: `1d5a08f` (CRITICAL auth), `7be2f52` (run_analysis), `646eb0a` (health fix), `42790f5` (GURU-5070 partial health fix). Audit report: `reports/2026-05-25-rmm-audit.md` on branch `audit/2026-05-25-rmm-audit` (`da1d4ee`).
 - claudetools commits: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording).
 - Coord: component `gururmm/server` = deployed 0.3.22. Messages: `16aa12fb`/`74a1a3e5` (build-blocked to GURU-5070 + DESKTOP fallback), `b99f718c` (identity check-in reply), `2d518a70` (deploy-done + lessons). DESKTOP-0O8A1RL retired; GURU-5070 is Mike's current session id.
 - Audit tally: 61 findings (2 critical [both now FIXED+deployed], 10 high, 16 medium, 7 low, 26 info).
		`@@ -1 +1 @@`
			`{"sessionId":"2158f2e7-8168-4859-b2cf-e0b05d6517b2","pid":18624,"acquiredAt":1779727871169}`				`{"sessionId":"eda9a628-252f-4dd7-b4cf-1d987ea11512","pid":16195,"procStart":"259600","acquiredAt":1779740400025}`