sync: auto-sync from Mikes-MacBook-Air.local at 2026-05-25 12:15:42
Author: Mike Swanson Machine: Mikes-MacBook-Air.local Timestamp: 2026-05-25 12:15:42
This commit is contained in:
Submodule projects/msp-tools/guru-rmm updated: 0a4db53854...a42bd60a12
@@ -890,3 +890,201 @@ git submodule update -- projects/msp-tools/guru-rmm
|
||||
- claudetools commits this session: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording).
|
||||
- Findings tally: API Coverage 14 (0C/5H/4M/1L), Rust+Auth 10 (2C/2H/1M), TypeScript 17 (0C/2H/7M/6L), Data Integrity 10 (0C/0H/4M), Build Pipeline 10 (0C/1H). Total 61 (2C/10H/16M/7L/26I).
|
||||
- Prior GuruRMM audits: `reports/2026-05-23-rmm-audit.md`, `reports/2026-05-19-rmm-audit.md`.
|
||||
## Update: 12:40 PT — Safe Agent Rollout System Phases 1-3
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** Mikes-MacBook-Air
|
||||
- **Role:** admin
|
||||
- **Session Span:** 2026-05-25 10:15 - 12:40 PT
|
||||
|
||||
## Session Summary
|
||||
|
||||
Implemented Phases 1-3 of the GuruRMM Safe Agent Update Rollout System to eliminate production risk from auto-deployed updates. The system introduces a beta-first deployment model where all new agent builds default to a beta channel and require manual promotion before reaching stable production clients.
|
||||
|
||||
Phase 1 modified the build pipeline on Saturn (172.16.3.30) by adding beta channel marking to both `/opt/gururmm/build-linux.sh` and `/opt/gururmm/build-windows.sh`. After code signing and checksum generation, the scripts now create `.channel` sidecar files containing "beta" for every binary. Triggered test build v0.6.41 successfully created 6 channel files (2 Linux amd64, 4 Windows amd64/arm64/base MSI). The existing scanner already supported reading these files from previous work.
|
||||
|
||||
Phase 2 created database migration 046_safe_rollout.sql with three new tables: update_rollouts (tracks promotion state per version), update_health_metrics (aggregates success/failure/crash rates), and agent_update_events (detailed timeline with JSONB metadata). Applied migration to PostgreSQL on Saturn with 5 custom indexes for efficient queries. Resolved migration numbering conflict (originally 045, renamed to 046).
|
||||
|
||||
Phase 3 implemented the health monitoring system with crash detection. Created `server/src/updates/health.rs` (270 lines) containing a background task that runs every 60 seconds to detect agents that go offline within 5 minutes of receiving an update. The system calculates health metrics (crash rate, failure rate) and evaluates status using defined thresholds: critical (>25% crash OR >50% failure), warning (>10% crash OR >25% failure), healthy (100% success, ≥5 attempts, no crashes), unknown (<5 attempts). Integrated event logging into `server/src/ws/mod.rs` at two update dispatch points and spawned the monitor task in `server/src/main.rs`. Successfully compiled on Saturn after resolving Option type handling and tuple destructuring errors. Server binary built cleanly (13 MB, 4m8s build time).
|
||||
|
||||
Phases 4-6 remain pending: promotion/rollback API endpoints (3 REST endpoints), dashboard UI (Updates.tsx with table view and controls), and end-to-end testing. The foundation is now in place for safe, controlled agent rollouts with automatic crash detection and manual promotion gating.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Beta-first by default**: All new builds start as beta-only, preventing production exposure until manually promoted. This is enforced at build time rather than requiring policy configuration.
|
||||
- **5-minute crash window**: Agents offline within 5 minutes of update are flagged as crashed. Chosen to balance false positives (network blips, reboots) against detection speed.
|
||||
- **Health status thresholds**: Critical at >25% crash rate (blocks promotion), warning at >10% (flags for review), healthy requires 100% success with ≥5 attempts. These objective criteria prevent subjective promotion decisions.
|
||||
- **Per-platform health tracking**: Metrics tracked separately for each version-os-arch combination since update issues often affect specific platforms.
|
||||
- **Event-driven monitoring**: Background task polls every 60 seconds rather than event-triggered to ensure crash detection even if agent disconnects silently.
|
||||
- **Migration numbering**: Renamed from 045 to 046 after discovering conflict with existing migration. Checked database to confirm 045 was already applied.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Option<String> vs String type mismatch**: Database schema has `os_type` as NOT NULL String but `version_to` and `architecture` as nullable. Fixed tuple destructuring by removing os_type from Option check and passing as reference.
|
||||
- **Option<i32> arithmetic**: Query results return Option<i32> for counter fields. Added `.unwrap_or(0)` before all comparisons and f64 casts.
|
||||
- **Build script structure changed**: Plan referenced deprecated `/opt/gururmm/build-agents.sh` wrapper. Modified `build-linux.sh` and `build-windows.sh` directly instead.
|
||||
- **PostgreSQL connection refused**: Tried using 172.16.3.30:5432 but PostgreSQL listens only on localhost. Changed DATABASE_URL to localhost:5432 when running sqlx prepare on Saturn.
|
||||
- **sqlx offline cache missing**: New queries in health.rs not in `.sqlx/` cache. Ran `cargo sqlx prepare --workspace` on Saturn to generate cached query data.
|
||||
- **Merge conflicts in ws/mod.rs**: Local health logging changes conflicted with upstream improvements to update re-dispatch logic. Kept upstream's cleaner flag-based implementation and added health logging calls to both dispatch points.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
**Files Modified:**
|
||||
- `/opt/gururmm/build-linux.sh` (Saturn) - Added beta channel marking phase (lines 54-62)
|
||||
- `/opt/gururmm/build-windows.sh` (Saturn) - Added beta channel marking phase (lines 177-185)
|
||||
- `projects/msp-tools/guru-rmm/server/src/ws/mod.rs` - Added health event logging at 2 dispatch points (lines 867-877, 940-949)
|
||||
- `projects/msp-tools/guru-rmm/server/src/main.rs` - Spawned health monitor task (line 190)
|
||||
|
||||
**Files Created:**
|
||||
- `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql` - New tables: update_rollouts, update_health_metrics, agent_update_events
|
||||
- `projects/msp-tools/guru-rmm/server/src/updates/health.rs` - Health monitoring implementation (270 lines)
|
||||
- `projects/msp-tools/guru-rmm/server/src/updates/mod.rs` - Module declaration (pub mod health)
|
||||
- `/var/www/gururmm/downloads/gururmm-agent-*.channel` (Saturn) - 6 channel sidecar files for v0.6.41
|
||||
|
||||
**Files Deleted:**
|
||||
- None
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
No new credentials created or discovered. Used existing Saturn SSH access (azcomputerguru@172.16.3.30) and PostgreSQL connection (localhost:5432, credentials unchanged).
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
**Saturn (172.16.3.30):**
|
||||
- Build server: Linux, hosts `/opt/gururmm/build-linux.sh` and `build-windows.sh`
|
||||
- Downloads directory: `/var/www/gururmm/downloads/`
|
||||
- PostgreSQL: localhost:5432, database `gururmm_production`
|
||||
- GuruRMM server: systemd service `gururmm-server.service`, binary at `/opt/gururmm/gururmm-server`
|
||||
- Logs: `/var/log/gururmm-build.log` (build output), server logs via journalctl
|
||||
|
||||
**New Database Tables (Saturn PostgreSQL):**
|
||||
- `update_rollouts` - Promotion tracking (version, os, arch, channel, promoted_at, promoted_by)
|
||||
- `update_health_metrics` - Health aggregation (total_attempts, successful_updates, failed_updates, rollback_count, crash_count, health_status)
|
||||
- `agent_update_events` - Event timeline (agent_id, update_id, event_type, version_from, version_to, details JSONB)
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
**Phase 1 - Build script modification:**
|
||||
```bash
|
||||
ssh azcomputerguru@172.16.3.30
|
||||
sudo nano /opt/gururmm/build-linux.sh # Added beta marking at line 54
|
||||
sudo nano /opt/gururmm/build-windows.sh # Added beta marking at line 177
|
||||
cd /opt/gururmm
|
||||
sudo ./build-linux.sh # Triggered v0.6.41 build
|
||||
sudo ./build-windows.sh # Triggered v0.6.41 build
|
||||
ls -la /var/www/gururmm/downloads/*.channel # Verified 6 files created
|
||||
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.41.channel # Output: beta
|
||||
```
|
||||
|
||||
**Phase 2 - Database migration:**
|
||||
```bash
|
||||
ssh azcomputerguru@172.16.3.30
|
||||
cd /opt/gururmm/server
|
||||
sudo -u postgres psql gururmm_production -c "\d" | grep agent # Found existing migration 045
|
||||
sudo -u postgres psql gururmm_production -f migrations/046_safe_rollout.sql
|
||||
# Output: CREATE TABLE (x3), CREATE INDEX (x5)
|
||||
sudo -u postgres psql gururmm_production -c "\d update_rollouts" # Verified schema
|
||||
```
|
||||
|
||||
**Phase 3 - Health monitoring implementation:**
|
||||
```bash
|
||||
ssh azcomputerguru@172.16.3.30
|
||||
cd /opt/gururmm/server
|
||||
export DATABASE_URL="postgresql://gururmm_user:PASSWORD@localhost:5432/gururmm_production"
|
||||
cargo sqlx prepare --workspace # Generated .sqlx/ cache for new queries
|
||||
cargo build --release --features production # 4m8s build, 13 MB binary
|
||||
# Output: Finished `release` profile [optimized] target(s) in 4m 08s
|
||||
```
|
||||
|
||||
**Key error resolution:**
|
||||
```rust
|
||||
// Before (error):
|
||||
if let (Some(version), Some(os), Some(arch)) =
|
||||
(crashed.version_to.as_ref(), crashed.os_type.as_ref(), crashed.architecture.as_ref())
|
||||
|
||||
// After (fixed):
|
||||
if let (Some(version), Some(arch)) = (
|
||||
crashed.version_to.as_deref(),
|
||||
crashed.architecture.as_deref()
|
||||
) {
|
||||
increment_crash_count(pool, version, &crashed.os_type, arch).await?;
|
||||
}
|
||||
```
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
**Immediate:**
|
||||
- Deploy Phase 3 code to production: copy binary to `/opt/gururmm/gururmm-server`, restart systemd service, verify health monitor spawned
|
||||
- Test health monitoring: mark GURU-KALI and GURU-5070 as beta agents, dispatch update, verify event logging and metrics
|
||||
|
||||
**Phase 4 - Promotion/Rollback API (not started):**
|
||||
- Create `server/src/api/updates.rs` with 3 endpoints:
|
||||
- GET /api/updates/rollouts - List versions with health metrics
|
||||
- POST /api/updates/rollouts/:version/promote - Update .channel files to "stable"
|
||||
- POST /api/updates/rollouts/:version/rollback - Remove .channel files, block version, force downgrade
|
||||
- Add routes to `server/src/main.rs`
|
||||
- Test promotion: verify .channel files updated, scanner rescans, stable agents receive update
|
||||
- Test rollback: verify .channel files removed, agents downgraded to previous stable
|
||||
|
||||
**Phase 5 - Dashboard UI (not started):**
|
||||
- Create `dashboard/src/pages/Updates.tsx` with:
|
||||
- Table view of rollouts with health status badges
|
||||
- Real-time success rate calculation
|
||||
- "Promote to Stable" button (enabled only for healthy versions)
|
||||
- "Rollback" button with reason prompt
|
||||
- Beta vs. stable agent counts per version
|
||||
- Add navigation link to `dashboard/src/components/Layout.tsx`
|
||||
|
||||
**Phase 6 - E2E Testing (not started):**
|
||||
- Test beta-first workflow: trigger build, verify beta-only, promote, verify stable receives
|
||||
- Test crash detection: simulate crash (update agent, stop service), wait 60s, verify crash event logged
|
||||
- Test health thresholds: trigger multiple failures, verify warning/critical status, verify promotion blocked
|
||||
- Test rollback: execute rollback, verify version blocked, agents downgraded
|
||||
|
||||
## Reference Information
|
||||
|
||||
**Plan Document:** `/Users/azcomputerguru/.claude/plans/frolicking-herding-chipmunk.md`
|
||||
|
||||
**Migration:** `projects/msp-tools/guru-rmm/server/migrations/046_safe_rollout.sql`
|
||||
|
||||
**Health Module:** `projects/msp-tools/guru-rmm/server/src/updates/health.rs:1-270`
|
||||
|
||||
**Key Functions:**
|
||||
- `monitor_update_health(state)` - Background task, 60s interval (health.rs:16)
|
||||
- `check_for_crashes(pool)` - Query offline agents post-update (health.rs:34)
|
||||
- `evaluate_health_status(pool, version, os, arch)` - Calculate status thresholds (health.rs:123)
|
||||
- `log_update_event(pool, agent_id, update_id, event_type, ...)` - Write event timeline (health.rs:187)
|
||||
- `record_update_success/failure(pool, version, os, arch)` - Increment counters (health.rs:216, 244)
|
||||
|
||||
**Build Artifacts:**
|
||||
- Server binary: `/opt/gururmm/gururmm-server` (Saturn, 13 MB, v0.6.41)
|
||||
- Channel files: `/var/www/gururmm/downloads/*.channel` (6 files, content "beta")
|
||||
|
||||
**Database Event Types:**
|
||||
- `update_dispatched` - Server sent update to agent
|
||||
- `download_started` - Agent began downloading binary
|
||||
- `download_complete` - Agent finished downloading
|
||||
- `update_applied` - Agent successfully applied update
|
||||
- `update_failed` - Agent reported update failure
|
||||
- `crash_detected` - Monitor detected agent offline <5min post-update
|
||||
|
||||
**Health Status Thresholds:**
|
||||
- `healthy` - 100% success, ≥5 attempts, 0 crashes
|
||||
- `warning` - 10-25% crash rate OR 25-50% failure rate
|
||||
- `critical` - >25% crash rate OR >50% failure rate
|
||||
- `unknown` - <5 attempts (insufficient data)
|
||||
- `blocked` - Manually blocked after rollback
|
||||
|
||||
**Commit SHA:** (pending /sync)
|
||||
|
||||
**Timeline:**
|
||||
- 10:15 PT - Session start, loaded plan, began Phase 1
|
||||
- 10:45 PT - Phase 1 complete, modified build scripts, triggered test build v0.6.41
|
||||
- 11:00 PT - Phase 2 complete, created migration 046, applied to database
|
||||
- 11:15 PT - Phase 3 started, created health.rs module
|
||||
- 11:45 PT - Resolved Option type errors, fixed tuple destructuring
|
||||
- 12:10 PT - Resolved merge conflicts in ws/mod.rs
|
||||
- 12:25 PT - Final compilation successful on Saturn
|
||||
- 12:40 PT - Session log written, ready to sync
|
||||
|
||||
|
||||
Reference in New Issue
Block a user