sync: auto-sync from Mikes-MacBook-Air.local at 2026-05-25 12:15:42

Author: Mike Swanson
Machine: Mikes-MacBook-Air.local
Timestamp: 2026-05-25 12:15:42
This commit is contained in:
2026-05-25 12:15:43 -07:00
parent ba8ce9a06e
commit 2161c1507f
2 changed files with 199 additions and 1 deletions

View File

@@ -890,3 +890,201 @@ git submodule update -- projects/msp-tools/guru-rmm
- claudetools commits this session: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording).
- Findings tally: API Coverage 14 (0C/5H/4M/1L), Rust+Auth 10 (2C/2H/1M), TypeScript 17 (0C/2H/7M/6L), Data Integrity 10 (0C/0H/4M), Build Pipeline 10 (0C/1H). Total 61 (2C/10H/16M/7L/26I).
- Prior GuruRMM audits: `reports/2026-05-23-rmm-audit.md`, `reports/2026-05-19-rmm-audit.md`.
## Update: 12:40 PT — Safe Agent Rollout System Phases 1-3
## User
- **User:** Mike Swanson (mike)
- **Machine:** Mikes-MacBook-Air
- **Role:** admin
- **Session Span:** 2026-05-25 10:15 - 12:40 PT
## Session Summary
Implemented Phases 1-3 of the GuruRMM Safe Agent Update Rollout System to eliminate production risk from auto-deployed updates. The system introduces a beta-first deployment model where all new agent builds default to a beta channel and require manual promotion before reaching stable production clients.
Phase 1 modified the build pipeline on Saturn (172.16.3.30) by adding beta channel marking to both `/opt/gururmm/build-linux.sh` and `/opt/gururmm/build-windows.sh`. After code signing and checksum generation, the scripts now create `.channel` sidecar files containing "beta" for every binary. Triggered test build v0.6.41 successfully created 6 channel files (2 Linux amd64, 4 Windows amd64/arm64/base MSI). The existing scanner already supported reading these files from previous work.
Phase 2 created database migration 046_safe_rollout.sql with three new tables: update_rollouts (tracks promotion state per version), update_health_metrics (aggregates success/failure/crash rates), and agent_update_events (detailed timeline with JSONB metadata). Applied migration to PostgreSQL on Saturn with 5 custom indexes for efficient queries. Resolved migration numbering conflict (originally 045, renamed to 046).
Phase 3 implemented the health monitoring system with crash detection. Created `server/src/updates/health.rs` (270 lines) containing a background task that runs every 60 seconds to detect agents that go offline within 5 minutes of receiving an update. The system calculates health metrics (crash rate, failure rate) and evaluates status using defined thresholds: critical (>25% crash OR >50% failure), warning (>10% crash OR >25% failure), healthy (100% success, ≥5 attempts, no crashes), unknown (<5 attempts). Integrated event logging into `server/src/ws/mod.rs` at two update dispatch points and spawned the monitor task in `server/src/main.rs`. Successfully compiled on Saturn after resolving Option type handling and tuple destructuring errors. Server binary built cleanly (13 MB, 4m8s build time).
Phases 4-6 remain pending: promotion/rollback API endpoints (3 REST endpoints), dashboard UI (Updates.tsx with table view and controls), and end-to-end testing. The foundation is now in place for safe, controlled agent rollouts with automatic crash detection and manual promotion gating.
## Key Decisions
- **Beta-first by default**: All new builds start as beta-only, preventing production exposure until manually promoted. This is enforced at build time rather than requiring policy configuration.
- **5-minute crash window**: Agents offline within 5 minutes of update are flagged as crashed. Chosen to balance false positives (network blips, reboots) against detection speed.
- **Health status thresholds**: Critical at >25% crash rate (blocks promotion), warning at >10% (flags for review), healthy requires 100% success with ≥5 attempts. These objective criteria prevent subjective promotion decisions.
- **Per-platform health tracking**: Metrics tracked separately for each version-os-arch combination since update issues often affect specific platforms.
- **Event-driven monitoring**: Background task polls every 60 seconds rather than event-triggered to ensure crash detection even if agent disconnects silently.
- **Migration numbering**: Renamed from 045 to 046 after discovering conflict with existing migration. Checked database to confirm 045 was already applied.
## Problems Encountered
- **Option<String> vs String type mismatch**: Database schema has `os_type` as NOT NULL String but `version_to` and `architecture` as nullable. Fixed tuple destructuring by removing os_type from Option check and passing as reference.
- **Option<i32> arithmetic**: Query results return Option<i32> for counter fields. Added `.unwrap_or(0)` before all comparisons and f64 casts.
- **Build script structure changed**: Plan referenced deprecated `/opt/gururmm/build-agents.sh` wrapper. Modified `build-linux.sh` and `build-windows.sh` directly instead.
- **PostgreSQL connection refused**: Tried using 172.16.3.30:5432 but PostgreSQL listens only on localhost. Changed DATABASE_URL to localhost:5432 when running sqlx prepare on Saturn.
- **sqlx offline cache missing**: New queries in health.rs not in `.sqlx/` cache. Ran `cargo sqlx prepare --workspace` on Saturn to generate cached query data.
- **Merge conflicts in ws/mod.rs**: Local health logging changes conflicted with upstream improvements to update re-dispatch logic. Kept upstream's cleaner flag-based implementation and added health logging calls to both dispatch points.
## Configuration Changes
**Files Modified:**
- `/opt/gururmm/build-linux.sh` (Saturn) - Added beta channel marking phase (lines 54-62)
- `/opt/gururmm/build-windows.sh` (Saturn) - Added beta channel marking phase (lines 177-185)
- `projects/msp-tools/guru-rmm/server/src/ws/mod.rs` - Added health event logging at 2 dispatch points (lines 867-877, 940-949)
- `projects/msp-tools/guru-rmm/server/src/main.rs` - Spawned health monitor task (line 190)
**Files Created:**
- `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql` - New tables: update_rollouts, update_health_metrics, agent_update_events
- `projects/msp-tools/guru-rmm/server/src/updates/health.rs` - Health monitoring implementation (270 lines)
- `projects/msp-tools/guru-rmm/server/src/updates/mod.rs` - Module declaration (pub mod health)
- `/var/www/gururmm/downloads/gururmm-agent-*.channel` (Saturn) - 6 channel sidecar files for v0.6.41
**Files Deleted:**
- None
## Credentials & Secrets
No new credentials created or discovered. Used existing Saturn SSH access (azcomputerguru@172.16.3.30) and PostgreSQL connection (localhost:5432, credentials unchanged).
## Infrastructure & Servers
**Saturn (172.16.3.30):**
- Build server: Linux, hosts `/opt/gururmm/build-linux.sh` and `build-windows.sh`
- Downloads directory: `/var/www/gururmm/downloads/`
- PostgreSQL: localhost:5432, database `gururmm_production`
- GuruRMM server: systemd service `gururmm-server.service`, binary at `/opt/gururmm/gururmm-server`
- Logs: `/var/log/gururmm-build.log` (build output), server logs via journalctl
**New Database Tables (Saturn PostgreSQL):**
- `update_rollouts` - Promotion tracking (version, os, arch, channel, promoted_at, promoted_by)
- `update_health_metrics` - Health aggregation (total_attempts, successful_updates, failed_updates, rollback_count, crash_count, health_status)
- `agent_update_events` - Event timeline (agent_id, update_id, event_type, version_from, version_to, details JSONB)
## Commands & Outputs
**Phase 1 - Build script modification:**
```bash
ssh azcomputerguru@172.16.3.30
sudo nano /opt/gururmm/build-linux.sh # Added beta marking at line 54
sudo nano /opt/gururmm/build-windows.sh # Added beta marking at line 177
cd /opt/gururmm
sudo ./build-linux.sh # Triggered v0.6.41 build
sudo ./build-windows.sh # Triggered v0.6.41 build
ls -la /var/www/gururmm/downloads/*.channel # Verified 6 files created
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.41.channel # Output: beta
```
**Phase 2 - Database migration:**
```bash
ssh azcomputerguru@172.16.3.30
cd /opt/gururmm/server
sudo -u postgres psql gururmm_production -c "\d" | grep agent # Found existing migration 045
sudo -u postgres psql gururmm_production -f migrations/046_safe_rollout.sql
# Output: CREATE TABLE (x3), CREATE INDEX (x5)
sudo -u postgres psql gururmm_production -c "\d update_rollouts" # Verified schema
```
**Phase 3 - Health monitoring implementation:**
```bash
ssh azcomputerguru@172.16.3.30
cd /opt/gururmm/server
export DATABASE_URL="postgresql://gururmm_user:PASSWORD@localhost:5432/gururmm_production"
cargo sqlx prepare --workspace # Generated .sqlx/ cache for new queries
cargo build --release --features production # 4m8s build, 13 MB binary
# Output: Finished `release` profile [optimized] target(s) in 4m 08s
```
**Key error resolution:**
```rust
// Before (error):
if let (Some(version), Some(os), Some(arch)) =
(crashed.version_to.as_ref(), crashed.os_type.as_ref(), crashed.architecture.as_ref())
// After (fixed):
if let (Some(version), Some(arch)) = (
crashed.version_to.as_deref(),
crashed.architecture.as_deref()
) {
increment_crash_count(pool, version, &crashed.os_type, arch).await?;
}
```
## Pending / Incomplete Tasks
**Immediate:**
- Deploy Phase 3 code to production: copy binary to `/opt/gururmm/gururmm-server`, restart systemd service, verify health monitor spawned
- Test health monitoring: mark GURU-KALI and GURU-5070 as beta agents, dispatch update, verify event logging and metrics
**Phase 4 - Promotion/Rollback API (not started):**
- Create `server/src/api/updates.rs` with 3 endpoints:
- GET /api/updates/rollouts - List versions with health metrics
- POST /api/updates/rollouts/:version/promote - Update .channel files to "stable"
- POST /api/updates/rollouts/:version/rollback - Remove .channel files, block version, force downgrade
- Add routes to `server/src/main.rs`
- Test promotion: verify .channel files updated, scanner rescans, stable agents receive update
- Test rollback: verify .channel files removed, agents downgraded to previous stable
**Phase 5 - Dashboard UI (not started):**
- Create `dashboard/src/pages/Updates.tsx` with:
- Table view of rollouts with health status badges
- Real-time success rate calculation
- "Promote to Stable" button (enabled only for healthy versions)
- "Rollback" button with reason prompt
- Beta vs. stable agent counts per version
- Add navigation link to `dashboard/src/components/Layout.tsx`
**Phase 6 - E2E Testing (not started):**
- Test beta-first workflow: trigger build, verify beta-only, promote, verify stable receives
- Test crash detection: simulate crash (update agent, stop service), wait 60s, verify crash event logged
- Test health thresholds: trigger multiple failures, verify warning/critical status, verify promotion blocked
- Test rollback: execute rollback, verify version blocked, agents downgraded
## Reference Information
**Plan Document:** `/Users/azcomputerguru/.claude/plans/frolicking-herding-chipmunk.md`
**Migration:** `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql`
**Health Module:** `projects/msp-tools/guru-rmm/server/src/updates/health.rs:1-270`
**Key Functions:**
- `monitor_update_health(state)` - Background task, 60s interval (health.rs:16)
- `check_for_crashes(pool)` - Query offline agents post-update (health.rs:34)
- `evaluate_health_status(pool, version, os, arch)` - Calculate status thresholds (health.rs:123)
- `log_update_event(pool, agent_id, update_id, event_type, ...)` - Write event timeline (health.rs:187)
- `record_update_success/failure(pool, version, os, arch)` - Increment counters (health.rs:216, 244)
**Build Artifacts:**
- Server binary: `/opt/gururmm/gururmm-server` (Saturn, 13 MB, v0.6.41)
- Channel files: `/var/www/gururmm/downloads/*.channel` (6 files, content "beta")
**Database Event Types:**
- `update_dispatched` - Server sent update to agent
- `download_started` - Agent began downloading binary
- `download_complete` - Agent finished downloading
- `update_applied` - Agent successfully applied update
- `update_failed` - Agent reported update failure
- `crash_detected` - Monitor detected agent offline <5min post-update
**Health Status Thresholds:**
- `healthy` - 100% success, ≥5 attempts, 0 crashes
- `warning` - 10-25% crash rate OR 25-50% failure rate
- `critical` - >25% crash rate OR >50% failure rate
- `unknown` - <5 attempts (insufficient data)
- `blocked` - Manually blocked after rollback
**Commit SHA:** (pending /sync)
**Timeline:**
- 10:15 PT - Session start, loaded plan, began Phase 1
- 10:45 PT - Phase 1 complete, modified build scripts, triggered test build v0.6.41
- 11:00 PT - Phase 2 complete, created migration 046, applied to database
- 11:15 PT - Phase 3 started, created health.rs module
- 11:45 PT - Resolved Option type errors, fixed tuple destructuring
- 12:10 PT - Resolved merge conflicts in ws/mod.rs
- 12:25 PT - Final compilation successful on Saturn
- 12:40 PT - Session log written, ready to sync