Per the 2026-05-25 re-audit + Mike's decision (option b): the safe-rollout promotion gating these docs describe/test is NOT live (update_rollouts / update_health_metrics written-but-never-read; crash detection dead until the unmerged BUG-002 fix). Added a [WARNING] STATUS banner to the test plan, verify script, and the two 'complete' summaries so they aren't trusted as validating a working feature. Automation is a roadmap Phase-2 item requiring a full re-spec.
16 KiB
Phase 6: End-to-End Testing Plan
[WARNING] STATUS — 2026-05-25 re-audit: the safe-rollout gating this plan tests is NOT live.
update_rolloutsandupdate_health_metricsare written but never read to gate promotion (see gururmmdocs/FEATURE_ROADMAP.mdBUG-004); crash detection was dead code until the BUG-002 fix (branchfix/audit-2-remediation, unmerged). Promotion is currently 100% manual via channel columns. Treat this document as Phase-2 aspirational — passing/failing it does not reflect a working safe-rollout feature. Decision 2026-05-25 (Mike): keep the feature inert and labeled; automation deferred to a roadmap Phase-2 item that must be properly re-spec'd.
Safe Agent Rollout System
Date: 2026-05-25 Version: GuruRMM v0.6.41+ Tester: Mike Swanson
Prerequisites
Environment Setup
- SSH access to gururmm-build (172.16.3.30)
- Access to GuruRMM dashboard (https://rmm.azcomputerguru.com)
- JWT token for API testing
- At least 2 test agents (GURU-KALI, GURU-5070 recommended)
Pre-Test Verification
# On Saturn
ssh azcomputerguru@172.16.3.30
# 1. Verify migration 046 is applied
sudo -u postgres psql gururmm_production -c "\d update_rollouts"
sudo -u postgres psql gururmm_production -c "\d update_health_metrics"
sudo -u postgres psql gururmm_production -c "\d agent_update_events"
# 2. Verify server build is current
cd /opt/gururmm/server
git status # Should show Phase 4 code
cargo build --release --features production
# 3. Verify dashboard build is current
cd /opt/gururmm/dashboard
git status # Should show Phase 5 code
npm run build
# 4. Verify health monitor is running
sudo systemctl status gururmm-server
sudo journalctl -u gururmm-server -n 50 | grep "Health monitoring task spawned"
Test 1: Beta-First Build Workflow
Objective: Verify new builds default to beta channel and stable agents don't receive them.
Steps
- Trigger a test build
# On Saturn
cd /opt/gururmm
sudo ./build-linux.sh # Will auto-increment to next version
sudo ./build-windows.sh
- Verify .channel files created
cd /var/www/gururmm/downloads
ls -la *.channel | tail -10
# Expected: New version should have .channel files containing "beta"
VERSION=$(ls -t gururmm-agent-linux-amd64-*.tar.gz | head -1 | grep -oP '\d+\.\d+\.\d+')
cat gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Should output: beta
- Mark test agents as beta
# Via API or SQL
curl -X PATCH https://rmm.azcomputerguru.com/api/agents/GURU-KALI-UUID/channel \
-H "Authorization: Bearer $TOKEN" \
-d '{"update_channel": "beta"}'
- Verify beta agent receives update
- Open dashboard → Agents → GURU-KALI
- Wait for agent connection (heartbeat every 60s)
- Check agent state for pending update
- Expected: Should see update_available = true
- Verify stable agent does NOT receive update
- Ensure GURU-5070 is on "stable" channel
- Check agent state
- Expected: update_available = false (version not in stable channel)
Success Criteria
- ✅ .channel files exist for new version
- ✅ .channel files contain "beta"
- ✅ Beta agents offered the update
- ✅ Stable agents NOT offered the update
- ✅ Scanner logs show beta/stable filtering
Test 2: Health Monitoring & Crash Detection
Objective: Verify health monitor detects crashes and updates metrics.
Steps
- Clear existing health data (optional)
sudo -u postgres psql gururmm_production -c "DELETE FROM update_health_metrics WHERE version = '$VERSION';"
sudo -u postgres psql gururmm_production -c "DELETE FROM agent_update_events WHERE version_to = '$VERSION';"
- Simulate successful update
# On test agent (GURU-KALI)
# Let update complete normally
# Wait 5 minutes
- Check event logging
SELECT event_type, version_to, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
ORDER BY created_at DESC
LIMIT 5;
# Expected events:
# - update_dispatched
# - download_started (if implemented)
# - download_complete (if implemented)
# - update_applied
- Check health metrics incremented
SELECT version, total_attempts, successful_updates, failed_updates, crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';
# Expected:
# total_attempts = 1
# successful_updates = 1
# health_status = 'unknown' (< 5 attempts)
- Simulate crash
# On test agent
# 1. Trigger update dispatch
# 2. Immediately after "update_applied" event, stop agent service
sudo systemctl stop gururmm-agent
# 3. Wait 60-90 seconds for health monitor scan
- Verify crash detection
SELECT event_type, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
AND event_type = 'crash_detected'
ORDER BY created_at DESC;
# Expected: Should see crash_detected event
SELECT crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';
# Expected: crash_count incremented, health_status may change
- Check server logs
sudo journalctl -u gururmm-server -n 100 | grep -E "crash|health"
# Expected: "Detected crash: agent X went offline after updating to Y"
Success Criteria
- ✅ Events logged correctly (update_dispatched, update_applied)
- ✅ Health metrics incremented on success
- ✅ Crash detected within 90 seconds
- ✅ crash_detected event logged
- ✅ Crash counter incremented
- ✅ Health status updated based on thresholds
Test 3: Promotion Workflow
Objective: Verify promotion from beta to stable with health gates.
Steps
- Attempt promotion with insufficient data
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"os": "linux", "arch": "amd64", "force": false}'
# Expected: May succeed (unknown status allows promotion) or fail if health check implemented
- Generate healthy metrics
# Simulate 5+ successful updates
# Option A: Manually insert via SQL (for testing)
# Option B: Trigger real updates on multiple beta agents
# SQL approach for testing:
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET total_attempts = 5,
successful_updates = 5,
failed_updates = 0,
crash_count = 0,
health_status = 'healthy'
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
EOF
- Verify health status
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
# Expected: health.status = "healthy"
- Promote to stable
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"os": "linux", "arch": "amd64", "force": false}'
# Expected: {"success": true, "message": "Promoted...", "files_updated": 2}
- Verify .channel files updated
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: stable
- Verify database updated
SELECT channel, promoted_at, promoted_by
FROM update_rollouts
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
# Expected: channel = 'stable', promoted_at = NOW(), promoted_by = user_id
- Verify stable agents receive update
- Ensure test agent is on "stable" channel
- Wait for scanner rescan (happens immediately after promotion)
- Check agent state
- Expected: update_available = true
- Test force promotion
# Set health to warning
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET health_status = 'warning'
WHERE version = '$VERSION' AND os = 'windows' AND arch = 'amd64';
EOF
# Try promotion without force
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-d '{"os": "windows", "arch": "amd64", "force": false}'
# Expected: 403 error with message about health status
# Try with force flag
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-d '{"os": "windows", "arch": "amd64", "force": true}'
# Expected: 200 success (overridden health check)
Success Criteria
- ✅ Promotion blocked for unhealthy versions (unless forced)
- ✅ Promotion succeeds for healthy versions
- ✅ .channel files updated from "beta" to "stable"
- ✅ Database rollouts table updated
- ✅ Scanner rescans immediately
- ✅ Stable agents receive update after promotion
- ✅ Force flag overrides health checks
- ✅ Dashboard shows updated channel
Test 4: Rollback Workflow
Objective: Verify rollback blocks version and force-downgrades agents.
Steps
- Prepare for rollback
# Ensure test agent is running the rollback target version
# Verify previous stable version exists
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.channel == "stable") | .version'
- Execute rollback
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/rollback \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"os": "linux",
"arch": "amd64",
"reason": "Test rollback: simulating critical bug in version '$VERSION'"
}'
# Expected: {"success": true, "agents_affected": 1, "downgrade_version": "0.6.40"}
- Verify .channel files removed
ls /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: File not found (removed)
- Verify health status blocked
SELECT health_status, last_incident
FROM update_health_metrics
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
# Expected: health_status = 'blocked', last_incident = reason text
- Verify forced downgrade dispatched
# Check server logs for WebSocket dispatch
sudo journalctl -u gururmm-server -n 100 | grep -i "downgrade\|rollback"
# Check agent receives forced update
# Monitor agent logs for update trigger
- Verify agent downgrades
- Agent should receive UpdateAvailable message with previous version
- Agent should download and install previous version
- Check agent version after completion
- Expected: agent_version = previous stable version
- Verify blocked version not offered again
# Scanner should skip files without .channel files
# Verify version is not in available updates list
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
# If present, should show channel = null or health.status = "blocked"
Success Criteria
- ✅ .channel files removed
- ✅ Health status set to "blocked"
- ✅ Last incident reason recorded
- ✅ Connected agents receive forced downgrade
- ✅ Agents successfully downgrade to previous stable
- ✅ Blocked version not offered to new agents
- ✅ Dashboard shows blocked status
Test 5: Dashboard UI Testing
Objective: Verify Updates page displays correctly and actions work.
Steps
- Access Updates page
- Navigate to https://rmm.azcomputerguru.com/updates
- Login if needed
- Verify data display
- Table shows all rollout versions
- Columns: Version, OS/Arch, Channel, Health, Success Rate, Agent Counts, Actions
- Health badges color-coded (green/yellow/red/gray)
- Success rate calculated correctly
- Agent counts accurate
- Test promote button
- Enabled for beta + healthy versions only
- Disabled with tooltip for unhealthy versions
- Click opens confirmation dialog
- Confirm triggers API call
- Success toast appears
- Table refreshes with updated data
- Test rollback button
- Always enabled
- Click opens dialog with reason input
- Reason field is required
- Confirm triggers API call
- Success toast shows agent count
- Table refreshes with updated data
- Test error handling
- Shows loading state during fetch
- Shows error message if API fails
- Retry button works
- Shows empty state if no rollouts
- Test auto-refresh
- Data refreshes every 30 seconds
- Refresh doesn't disrupt UI interactions
- Manual refresh button works
Success Criteria
- ✅ All table columns display correct data
- ✅ Health badges use correct colors
- ✅ Promote button only enabled for healthy beta versions
- ✅ Rollback button always enabled
- ✅ Confirmation dialogs work
- ✅ API calls succeed
- ✅ Toasts display success/error
- ✅ Auto-refresh works
- ✅ Responsive on mobile
Test 6: Integration Testing
Objective: Test complete workflows end-to-end.
Workflow 1: New Build → Beta Testing → Promotion → Stable Deployment
- Trigger new build (auto-bumps version)
- Verify .channel files = "beta"
- Mark GURU-KALI as beta agent
- Wait for update dispatch
- Monitor update installation
- Verify success event logged
- Repeat 4 more times for healthy status
- Promote via dashboard
- Verify GURU-5070 (stable) receives update
- Monitor stable deployment
- Verify all agents updated
Expected: Beta testing prevents bad updates from reaching production.
Workflow 2: Critical Bug → Rollback → Fleet Downgrade
- Simulate critical bug discovered post-promotion
- Execute rollback via dashboard
- Verify all agents receive forced downgrade
- Verify agents revert to previous stable
- Verify new agents don't receive blocked version
- Verify health metrics show blocked status
Expected: Rollback protects fleet from bad updates.
Workflow 3: Crash Detection → Auto-Block (Future Enhancement)
- Deploy update to beta agents
- Simulate crash (stop service after update)
- Wait for health monitor (60s)
- Verify crash detected and logged
- Check if crash rate >25%
- Verify health status = "critical"
- Attempt promotion
- Verify promotion blocked
Expected: High crash rates prevent automatic promotion.
Performance Testing
Load Testing
- 100+ agents checking for updates simultaneously
- Scanner performance with 50+ versions
- Health monitor with 1000+ update events
- Dashboard with 20+ rollouts displayed
Stress Testing
- Rapid version releases (5 builds in 10 minutes)
- Mass rollback (100+ agents)
- Concurrent API calls (multiple users promoting/rolling back)
Security Testing
Authentication
- All API endpoints require valid JWT
- Expired tokens rejected
- Invalid tokens rejected
Authorization
- Admin role can promote/rollback
- Non-admin role blocked (if RBAC implemented)
Input Validation
- SQL injection attempts blocked
- XSS attempts in reason field sanitized
- Invalid version strings rejected
- Invalid OS/arch values rejected
File System Security
- .channel files have correct permissions
- Path traversal attempts blocked
- Only authorized processes can modify .channel files
Regression Testing
Existing Functionality
- Agent registration still works
- Heartbeat processing unaffected
- Command execution unaffected
- Metrics collection unaffected
- Alert generation unaffected
- Policy enforcement unaffected
Database Performance
- No slow queries introduced
- Indexes used efficiently
- No lock contention
Documentation Verification
- API endpoints documented
- Database schema documented
- Dashboard user guide accurate
- Admin procedures documented
- Troubleshooting guide created
Sign-Off
Phase 6 Test Results
Tester: ___________________________ Date: ___________________________
Test 1 - Beta-First Workflow: ⬜ PASS ⬜ FAIL Test 2 - Health Monitoring: ⬜ PASS ⬜ FAIL Test 3 - Promotion: ⬜ PASS ⬜ FAIL Test 4 - Rollback: ⬜ PASS ⬜ FAIL Test 5 - Dashboard UI: ⬜ PASS ⬜ FAIL Test 6 - Integration: ⬜ PASS ⬜ FAIL
Overall Status: ⬜ APPROVED FOR PRODUCTION ⬜ NEEDS FIXES
Notes:
Blockers/Issues:
Deployment Date: ___________________________