diff --git a/projects/msp-tools/guru-rmm b/projects/msp-tools/guru-rmm index 0a4db53..a42bd60 160000 --- a/projects/msp-tools/guru-rmm +++ b/projects/msp-tools/guru-rmm @@ -1 +1 @@ -Subproject commit 0a4db5385461f693f5e2b92fe0bd9582e3bea237 +Subproject commit a42bd60a12ab9987d0cc0752d788672324eba639 diff --git a/session-logs/2026-05-25-session.md b/session-logs/2026-05-25-session.md index 6fb3be9..cc26b25 100644 --- a/session-logs/2026-05-25-session.md +++ b/session-logs/2026-05-25-session.md @@ -890,3 +890,201 @@ git submodule update -- projects/msp-tools/guru-rmm - claudetools commits this session: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording). - Findings tally: API Coverage 14 (0C/5H/4M/1L), Rust+Auth 10 (2C/2H/1M), TypeScript 17 (0C/2H/7M/6L), Data Integrity 10 (0C/0H/4M), Build Pipeline 10 (0C/1H). Total 61 (2C/10H/16M/7L/26I). - Prior GuruRMM audits: `reports/2026-05-23-rmm-audit.md`, `reports/2026-05-19-rmm-audit.md`. +## Update: 12:40 PT — Safe Agent Rollout System Phases 1-3 + +## User +- **User:** Mike Swanson (mike) +- **Machine:** Mikes-MacBook-Air +- **Role:** admin +- **Session Span:** 2026-05-25 10:15 - 12:40 PT + +## Session Summary + +Implemented Phases 1-3 of the GuruRMM Safe Agent Update Rollout System to eliminate production risk from auto-deployed updates. The system introduces a beta-first deployment model where all new agent builds default to a beta channel and require manual promotion before reaching stable production clients. + +Phase 1 modified the build pipeline on Saturn (172.16.3.30) by adding beta channel marking to both `/opt/gururmm/build-linux.sh` and `/opt/gururmm/build-windows.sh`. After code signing and checksum generation, the scripts now create `.channel` sidecar files containing "beta" for every binary. Triggered test build v0.6.41 successfully created 6 channel files (2 Linux amd64, 4 Windows amd64/arm64/base MSI). The existing scanner already supported reading these files from previous work. + +Phase 2 created database migration 046_safe_rollout.sql with three new tables: update_rollouts (tracks promotion state per version), update_health_metrics (aggregates success/failure/crash rates), and agent_update_events (detailed timeline with JSONB metadata). Applied migration to PostgreSQL on Saturn with 5 custom indexes for efficient queries. Resolved migration numbering conflict (originally 045, renamed to 046). + +Phase 3 implemented the health monitoring system with crash detection. Created `server/src/updates/health.rs` (270 lines) containing a background task that runs every 60 seconds to detect agents that go offline within 5 minutes of receiving an update. The system calculates health metrics (crash rate, failure rate) and evaluates status using defined thresholds: critical (>25% crash OR >50% failure), warning (>10% crash OR >25% failure), healthy (100% success, ≥5 attempts, no crashes), unknown (<5 attempts). Integrated event logging into `server/src/ws/mod.rs` at two update dispatch points and spawned the monitor task in `server/src/main.rs`. Successfully compiled on Saturn after resolving Option type handling and tuple destructuring errors. Server binary built cleanly (13 MB, 4m8s build time). + +Phases 4-6 remain pending: promotion/rollback API endpoints (3 REST endpoints), dashboard UI (Updates.tsx with table view and controls), and end-to-end testing. The foundation is now in place for safe, controlled agent rollouts with automatic crash detection and manual promotion gating. + +## Key Decisions + +- **Beta-first by default**: All new builds start as beta-only, preventing production exposure until manually promoted. This is enforced at build time rather than requiring policy configuration. +- **5-minute crash window**: Agents offline within 5 minutes of update are flagged as crashed. Chosen to balance false positives (network blips, reboots) against detection speed. +- **Health status thresholds**: Critical at >25% crash rate (blocks promotion), warning at >10% (flags for review), healthy requires 100% success with ≥5 attempts. These objective criteria prevent subjective promotion decisions. +- **Per-platform health tracking**: Metrics tracked separately for each version-os-arch combination since update issues often affect specific platforms. +- **Event-driven monitoring**: Background task polls every 60 seconds rather than event-triggered to ensure crash detection even if agent disconnects silently. +- **Migration numbering**: Renamed from 045 to 046 after discovering conflict with existing migration. Checked database to confirm 045 was already applied. + +## Problems Encountered + +- **Option vs String type mismatch**: Database schema has `os_type` as NOT NULL String but `version_to` and `architecture` as nullable. Fixed tuple destructuring by removing os_type from Option check and passing as reference. +- **Option arithmetic**: Query results return Option for counter fields. Added `.unwrap_or(0)` before all comparisons and f64 casts. +- **Build script structure changed**: Plan referenced deprecated `/opt/gururmm/build-agents.sh` wrapper. Modified `build-linux.sh` and `build-windows.sh` directly instead. +- **PostgreSQL connection refused**: Tried using 172.16.3.30:5432 but PostgreSQL listens only on localhost. Changed DATABASE_URL to localhost:5432 when running sqlx prepare on Saturn. +- **sqlx offline cache missing**: New queries in health.rs not in `.sqlx/` cache. Ran `cargo sqlx prepare --workspace` on Saturn to generate cached query data. +- **Merge conflicts in ws/mod.rs**: Local health logging changes conflicted with upstream improvements to update re-dispatch logic. Kept upstream's cleaner flag-based implementation and added health logging calls to both dispatch points. + +## Configuration Changes + +**Files Modified:** +- `/opt/gururmm/build-linux.sh` (Saturn) - Added beta channel marking phase (lines 54-62) +- `/opt/gururmm/build-windows.sh` (Saturn) - Added beta channel marking phase (lines 177-185) +- `projects/msp-tools/guru-rmm/server/src/ws/mod.rs` - Added health event logging at 2 dispatch points (lines 867-877, 940-949) +- `projects/msp-tools/guru-rmm/server/src/main.rs` - Spawned health monitor task (line 190) + +**Files Created:** +- `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql` - New tables: update_rollouts, update_health_metrics, agent_update_events +- `projects/msp-tools/guru-rmm/server/src/updates/health.rs` - Health monitoring implementation (270 lines) +- `projects/msp-tools/guru-rmm/server/src/updates/mod.rs` - Module declaration (pub mod health) +- `/var/www/gururmm/downloads/gururmm-agent-*.channel` (Saturn) - 6 channel sidecar files for v0.6.41 + +**Files Deleted:** +- None + +## Credentials & Secrets + +No new credentials created or discovered. Used existing Saturn SSH access (azcomputerguru@172.16.3.30) and PostgreSQL connection (localhost:5432, credentials unchanged). + +## Infrastructure & Servers + +**Saturn (172.16.3.30):** +- Build server: Linux, hosts `/opt/gururmm/build-linux.sh` and `build-windows.sh` +- Downloads directory: `/var/www/gururmm/downloads/` +- PostgreSQL: localhost:5432, database `gururmm_production` +- GuruRMM server: systemd service `gururmm-server.service`, binary at `/opt/gururmm/gururmm-server` +- Logs: `/var/log/gururmm-build.log` (build output), server logs via journalctl + +**New Database Tables (Saturn PostgreSQL):** +- `update_rollouts` - Promotion tracking (version, os, arch, channel, promoted_at, promoted_by) +- `update_health_metrics` - Health aggregation (total_attempts, successful_updates, failed_updates, rollback_count, crash_count, health_status) +- `agent_update_events` - Event timeline (agent_id, update_id, event_type, version_from, version_to, details JSONB) + +## Commands & Outputs + +**Phase 1 - Build script modification:** +```bash +ssh azcomputerguru@172.16.3.30 +sudo nano /opt/gururmm/build-linux.sh # Added beta marking at line 54 +sudo nano /opt/gururmm/build-windows.sh # Added beta marking at line 177 +cd /opt/gururmm +sudo ./build-linux.sh # Triggered v0.6.41 build +sudo ./build-windows.sh # Triggered v0.6.41 build +ls -la /var/www/gururmm/downloads/*.channel # Verified 6 files created +cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.41.channel # Output: beta +``` + +**Phase 2 - Database migration:** +```bash +ssh azcomputerguru@172.16.3.30 +cd /opt/gururmm/server +sudo -u postgres psql gururmm_production -c "\d" | grep agent # Found existing migration 045 +sudo -u postgres psql gururmm_production -f migrations/046_safe_rollout.sql +# Output: CREATE TABLE (x3), CREATE INDEX (x5) +sudo -u postgres psql gururmm_production -c "\d update_rollouts" # Verified schema +``` + +**Phase 3 - Health monitoring implementation:** +```bash +ssh azcomputerguru@172.16.3.30 +cd /opt/gururmm/server +export DATABASE_URL="postgresql://gururmm_user:PASSWORD@localhost:5432/gururmm_production" +cargo sqlx prepare --workspace # Generated .sqlx/ cache for new queries +cargo build --release --features production # 4m8s build, 13 MB binary +# Output: Finished `release` profile [optimized] target(s) in 4m 08s +``` + +**Key error resolution:** +```rust +// Before (error): +if let (Some(version), Some(os), Some(arch)) = + (crashed.version_to.as_ref(), crashed.os_type.as_ref(), crashed.architecture.as_ref()) + +// After (fixed): +if let (Some(version), Some(arch)) = ( + crashed.version_to.as_deref(), + crashed.architecture.as_deref() +) { + increment_crash_count(pool, version, &crashed.os_type, arch).await?; +} +``` + +## Pending / Incomplete Tasks + +**Immediate:** +- Deploy Phase 3 code to production: copy binary to `/opt/gururmm/gururmm-server`, restart systemd service, verify health monitor spawned +- Test health monitoring: mark GURU-KALI and GURU-5070 as beta agents, dispatch update, verify event logging and metrics + +**Phase 4 - Promotion/Rollback API (not started):** +- Create `server/src/api/updates.rs` with 3 endpoints: + - GET /api/updates/rollouts - List versions with health metrics + - POST /api/updates/rollouts/:version/promote - Update .channel files to "stable" + - POST /api/updates/rollouts/:version/rollback - Remove .channel files, block version, force downgrade +- Add routes to `server/src/main.rs` +- Test promotion: verify .channel files updated, scanner rescans, stable agents receive update +- Test rollback: verify .channel files removed, agents downgraded to previous stable + +**Phase 5 - Dashboard UI (not started):** +- Create `dashboard/src/pages/Updates.tsx` with: + - Table view of rollouts with health status badges + - Real-time success rate calculation + - "Promote to Stable" button (enabled only for healthy versions) + - "Rollback" button with reason prompt + - Beta vs. stable agent counts per version +- Add navigation link to `dashboard/src/components/Layout.tsx` + +**Phase 6 - E2E Testing (not started):** +- Test beta-first workflow: trigger build, verify beta-only, promote, verify stable receives +- Test crash detection: simulate crash (update agent, stop service), wait 60s, verify crash event logged +- Test health thresholds: trigger multiple failures, verify warning/critical status, verify promotion blocked +- Test rollback: execute rollback, verify version blocked, agents downgraded + +## Reference Information + +**Plan Document:** `/Users/azcomputerguru/.claude/plans/frolicking-herding-chipmunk.md` + +**Migration:** `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql` + +**Health Module:** `projects/msp-tools/guru-rmm/server/src/updates/health.rs:1-270` + +**Key Functions:** +- `monitor_update_health(state)` - Background task, 60s interval (health.rs:16) +- `check_for_crashes(pool)` - Query offline agents post-update (health.rs:34) +- `evaluate_health_status(pool, version, os, arch)` - Calculate status thresholds (health.rs:123) +- `log_update_event(pool, agent_id, update_id, event_type, ...)` - Write event timeline (health.rs:187) +- `record_update_success/failure(pool, version, os, arch)` - Increment counters (health.rs:216, 244) + +**Build Artifacts:** +- Server binary: `/opt/gururmm/gururmm-server` (Saturn, 13 MB, v0.6.41) +- Channel files: `/var/www/gururmm/downloads/*.channel` (6 files, content "beta") + +**Database Event Types:** +- `update_dispatched` - Server sent update to agent +- `download_started` - Agent began downloading binary +- `download_complete` - Agent finished downloading +- `update_applied` - Agent successfully applied update +- `update_failed` - Agent reported update failure +- `crash_detected` - Monitor detected agent offline <5min post-update + +**Health Status Thresholds:** +- `healthy` - 100% success, ≥5 attempts, 0 crashes +- `warning` - 10-25% crash rate OR 25-50% failure rate +- `critical` - >25% crash rate OR >50% failure rate +- `unknown` - <5 attempts (insufficient data) +- `blocked` - Manually blocked after rollback + +**Commit SHA:** (pending /sync) + +**Timeline:** +- 10:15 PT - Session start, loaded plan, began Phase 1 +- 10:45 PT - Phase 1 complete, modified build scripts, triggered test build v0.6.41 +- 11:00 PT - Phase 2 complete, created migration 046, applied to database +- 11:15 PT - Phase 3 started, created health.rs module +- 11:45 PT - Resolved Option type errors, fixed tuple destructuring +- 12:10 PT - Resolved merge conflicts in ws/mod.rs +- 12:25 PT - Final compilation successful on Saturn +- 12:40 PT - Session log written, ready to sync +