Per the 2026-05-25 re-audit + Mike's decision (option b): the safe-rollout promotion gating these docs describe/test is NOT live (update_rollouts / update_health_metrics written-but-never-read; crash detection dead until the unmerged BUG-002 fix). Added a [WARNING] STATUS banner to the test plan, verify script, and the two 'complete' summaries so they aren't trusted as validating a working feature. Automation is a roadmap Phase-2 item requiring a full re-spec.
575 lines
16 KiB
Markdown
575 lines
16 KiB
Markdown
# Phase 6: End-to-End Testing Plan
|
|
|
|
> **[WARNING] STATUS — 2026-05-25 re-audit: the safe-rollout gating this plan tests is NOT live.** `update_rollouts` and `update_health_metrics` are written but never read to gate promotion (see gururmm `docs/FEATURE_ROADMAP.md` BUG-004); crash detection was dead code until the BUG-002 fix (branch `fix/audit-2-remediation`, unmerged). Promotion is currently 100% manual via channel columns. Treat this document as **Phase-2 aspirational** — passing/failing it does not reflect a working safe-rollout feature. Decision 2026-05-25 (Mike): keep the feature inert and labeled; automation deferred to a roadmap Phase-2 item that must be properly re-spec'd.
|
|
|
|
## Safe Agent Rollout System
|
|
|
|
**Date:** 2026-05-25
|
|
**Version:** GuruRMM v0.6.41+
|
|
**Tester:** Mike Swanson
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
### Environment Setup
|
|
- [ ] SSH access to gururmm-build (172.16.3.30)
|
|
- [ ] Access to GuruRMM dashboard (https://rmm.azcomputerguru.com)
|
|
- [ ] JWT token for API testing
|
|
- [ ] At least 2 test agents (GURU-KALI, GURU-5070 recommended)
|
|
|
|
### Pre-Test Verification
|
|
```bash
|
|
# On Saturn
|
|
ssh azcomputerguru@172.16.3.30
|
|
|
|
# 1. Verify migration 046 is applied
|
|
sudo -u postgres psql gururmm_production -c "\d update_rollouts"
|
|
sudo -u postgres psql gururmm_production -c "\d update_health_metrics"
|
|
sudo -u postgres psql gururmm_production -c "\d agent_update_events"
|
|
|
|
# 2. Verify server build is current
|
|
cd /opt/gururmm/server
|
|
git status # Should show Phase 4 code
|
|
cargo build --release --features production
|
|
|
|
# 3. Verify dashboard build is current
|
|
cd /opt/gururmm/dashboard
|
|
git status # Should show Phase 5 code
|
|
npm run build
|
|
|
|
# 4. Verify health monitor is running
|
|
sudo systemctl status gururmm-server
|
|
sudo journalctl -u gururmm-server -n 50 | grep "Health monitoring task spawned"
|
|
```
|
|
|
|
---
|
|
|
|
## Test 1: Beta-First Build Workflow
|
|
|
|
**Objective:** Verify new builds default to beta channel and stable agents don't receive them.
|
|
|
|
### Steps
|
|
|
|
1. **Trigger a test build**
|
|
```bash
|
|
# On Saturn
|
|
cd /opt/gururmm
|
|
sudo ./build-linux.sh # Will auto-increment to next version
|
|
sudo ./build-windows.sh
|
|
```
|
|
|
|
2. **Verify .channel files created**
|
|
```bash
|
|
cd /var/www/gururmm/downloads
|
|
ls -la *.channel | tail -10
|
|
|
|
# Expected: New version should have .channel files containing "beta"
|
|
VERSION=$(ls -t gururmm-agent-linux-amd64-*.tar.gz | head -1 | grep -oP '\d+\.\d+\.\d+')
|
|
cat gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
|
# Should output: beta
|
|
```
|
|
|
|
3. **Mark test agents as beta**
|
|
```bash
|
|
# Via API or SQL
|
|
curl -X PATCH https://rmm.azcomputerguru.com/api/agents/GURU-KALI-UUID/channel \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"update_channel": "beta"}'
|
|
```
|
|
|
|
4. **Verify beta agent receives update**
|
|
- Open dashboard → Agents → GURU-KALI
|
|
- Wait for agent connection (heartbeat every 60s)
|
|
- Check agent state for pending update
|
|
- Expected: Should see update_available = true
|
|
|
|
5. **Verify stable agent does NOT receive update**
|
|
- Ensure GURU-5070 is on "stable" channel
|
|
- Check agent state
|
|
- Expected: update_available = false (version not in stable channel)
|
|
|
|
### Success Criteria
|
|
- ✅ .channel files exist for new version
|
|
- ✅ .channel files contain "beta"
|
|
- ✅ Beta agents offered the update
|
|
- ✅ Stable agents NOT offered the update
|
|
- ✅ Scanner logs show beta/stable filtering
|
|
|
|
---
|
|
|
|
## Test 2: Health Monitoring & Crash Detection
|
|
|
|
**Objective:** Verify health monitor detects crashes and updates metrics.
|
|
|
|
### Steps
|
|
|
|
1. **Clear existing health data (optional)**
|
|
```sql
|
|
sudo -u postgres psql gururmm_production -c "DELETE FROM update_health_metrics WHERE version = '$VERSION';"
|
|
sudo -u postgres psql gururmm_production -c "DELETE FROM agent_update_events WHERE version_to = '$VERSION';"
|
|
```
|
|
|
|
2. **Simulate successful update**
|
|
```bash
|
|
# On test agent (GURU-KALI)
|
|
# Let update complete normally
|
|
# Wait 5 minutes
|
|
```
|
|
|
|
3. **Check event logging**
|
|
```sql
|
|
SELECT event_type, version_to, created_at
|
|
FROM agent_update_events
|
|
WHERE agent_id = 'GURU-KALI-UUID'
|
|
ORDER BY created_at DESC
|
|
LIMIT 5;
|
|
|
|
# Expected events:
|
|
# - update_dispatched
|
|
# - download_started (if implemented)
|
|
# - download_complete (if implemented)
|
|
# - update_applied
|
|
```
|
|
|
|
4. **Check health metrics incremented**
|
|
```sql
|
|
SELECT version, total_attempts, successful_updates, failed_updates, crash_count, health_status
|
|
FROM update_health_metrics
|
|
WHERE version = '$VERSION';
|
|
|
|
# Expected:
|
|
# total_attempts = 1
|
|
# successful_updates = 1
|
|
# health_status = 'unknown' (< 5 attempts)
|
|
```
|
|
|
|
5. **Simulate crash**
|
|
```bash
|
|
# On test agent
|
|
# 1. Trigger update dispatch
|
|
# 2. Immediately after "update_applied" event, stop agent service
|
|
sudo systemctl stop gururmm-agent
|
|
# 3. Wait 60-90 seconds for health monitor scan
|
|
```
|
|
|
|
6. **Verify crash detection**
|
|
```sql
|
|
SELECT event_type, created_at
|
|
FROM agent_update_events
|
|
WHERE agent_id = 'GURU-KALI-UUID'
|
|
AND event_type = 'crash_detected'
|
|
ORDER BY created_at DESC;
|
|
|
|
# Expected: Should see crash_detected event
|
|
|
|
SELECT crash_count, health_status
|
|
FROM update_health_metrics
|
|
WHERE version = '$VERSION';
|
|
|
|
# Expected: crash_count incremented, health_status may change
|
|
```
|
|
|
|
7. **Check server logs**
|
|
```bash
|
|
sudo journalctl -u gururmm-server -n 100 | grep -E "crash|health"
|
|
# Expected: "Detected crash: agent X went offline after updating to Y"
|
|
```
|
|
|
|
### Success Criteria
|
|
- ✅ Events logged correctly (update_dispatched, update_applied)
|
|
- ✅ Health metrics incremented on success
|
|
- ✅ Crash detected within 90 seconds
|
|
- ✅ crash_detected event logged
|
|
- ✅ Crash counter incremented
|
|
- ✅ Health status updated based on thresholds
|
|
|
|
---
|
|
|
|
## Test 3: Promotion Workflow
|
|
|
|
**Objective:** Verify promotion from beta to stable with health gates.
|
|
|
|
### Steps
|
|
|
|
1. **Attempt promotion with insufficient data**
|
|
```bash
|
|
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"os": "linux", "arch": "amd64", "force": false}'
|
|
|
|
# Expected: May succeed (unknown status allows promotion) or fail if health check implemented
|
|
```
|
|
|
|
2. **Generate healthy metrics**
|
|
```bash
|
|
# Simulate 5+ successful updates
|
|
# Option A: Manually insert via SQL (for testing)
|
|
# Option B: Trigger real updates on multiple beta agents
|
|
|
|
# SQL approach for testing:
|
|
sudo -u postgres psql gururmm_production << EOF
|
|
UPDATE update_health_metrics
|
|
SET total_attempts = 5,
|
|
successful_updates = 5,
|
|
failed_updates = 0,
|
|
crash_count = 0,
|
|
health_status = 'healthy'
|
|
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
|
EOF
|
|
```
|
|
|
|
3. **Verify health status**
|
|
```bash
|
|
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
|
|
|
|
# Expected: health.status = "healthy"
|
|
```
|
|
|
|
4. **Promote to stable**
|
|
```bash
|
|
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"os": "linux", "arch": "amd64", "force": false}'
|
|
|
|
# Expected: {"success": true, "message": "Promoted...", "files_updated": 2}
|
|
```
|
|
|
|
5. **Verify .channel files updated**
|
|
```bash
|
|
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
|
# Expected: stable
|
|
```
|
|
|
|
6. **Verify database updated**
|
|
```sql
|
|
SELECT channel, promoted_at, promoted_by
|
|
FROM update_rollouts
|
|
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
|
|
|
# Expected: channel = 'stable', promoted_at = NOW(), promoted_by = user_id
|
|
```
|
|
|
|
7. **Verify stable agents receive update**
|
|
- Ensure test agent is on "stable" channel
|
|
- Wait for scanner rescan (happens immediately after promotion)
|
|
- Check agent state
|
|
- Expected: update_available = true
|
|
|
|
8. **Test force promotion**
|
|
```bash
|
|
# Set health to warning
|
|
sudo -u postgres psql gururmm_production << EOF
|
|
UPDATE update_health_metrics
|
|
SET health_status = 'warning'
|
|
WHERE version = '$VERSION' AND os = 'windows' AND arch = 'amd64';
|
|
EOF
|
|
|
|
# Try promotion without force
|
|
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"os": "windows", "arch": "amd64", "force": false}'
|
|
|
|
# Expected: 403 error with message about health status
|
|
|
|
# Try with force flag
|
|
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"os": "windows", "arch": "amd64", "force": true}'
|
|
|
|
# Expected: 200 success (overridden health check)
|
|
```
|
|
|
|
### Success Criteria
|
|
- ✅ Promotion blocked for unhealthy versions (unless forced)
|
|
- ✅ Promotion succeeds for healthy versions
|
|
- ✅ .channel files updated from "beta" to "stable"
|
|
- ✅ Database rollouts table updated
|
|
- ✅ Scanner rescans immediately
|
|
- ✅ Stable agents receive update after promotion
|
|
- ✅ Force flag overrides health checks
|
|
- ✅ Dashboard shows updated channel
|
|
|
|
---
|
|
|
|
## Test 4: Rollback Workflow
|
|
|
|
**Objective:** Verify rollback blocks version and force-downgrades agents.
|
|
|
|
### Steps
|
|
|
|
1. **Prepare for rollback**
|
|
```bash
|
|
# Ensure test agent is running the rollback target version
|
|
# Verify previous stable version exists
|
|
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.channel == "stable") | .version'
|
|
```
|
|
|
|
2. **Execute rollback**
|
|
```bash
|
|
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/rollback \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"os": "linux",
|
|
"arch": "amd64",
|
|
"reason": "Test rollback: simulating critical bug in version '$VERSION'"
|
|
}'
|
|
|
|
# Expected: {"success": true, "agents_affected": 1, "downgrade_version": "0.6.40"}
|
|
```
|
|
|
|
3. **Verify .channel files removed**
|
|
```bash
|
|
ls /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
|
# Expected: File not found (removed)
|
|
```
|
|
|
|
4. **Verify health status blocked**
|
|
```sql
|
|
SELECT health_status, last_incident
|
|
FROM update_health_metrics
|
|
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
|
|
|
# Expected: health_status = 'blocked', last_incident = reason text
|
|
```
|
|
|
|
5. **Verify forced downgrade dispatched**
|
|
```bash
|
|
# Check server logs for WebSocket dispatch
|
|
sudo journalctl -u gururmm-server -n 100 | grep -i "downgrade\|rollback"
|
|
|
|
# Check agent receives forced update
|
|
# Monitor agent logs for update trigger
|
|
```
|
|
|
|
6. **Verify agent downgrades**
|
|
- Agent should receive UpdateAvailable message with previous version
|
|
- Agent should download and install previous version
|
|
- Check agent version after completion
|
|
- Expected: agent_version = previous stable version
|
|
|
|
7. **Verify blocked version not offered again**
|
|
```bash
|
|
# Scanner should skip files without .channel files
|
|
# Verify version is not in available updates list
|
|
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
|
|
|
|
# If present, should show channel = null or health.status = "blocked"
|
|
```
|
|
|
|
### Success Criteria
|
|
- ✅ .channel files removed
|
|
- ✅ Health status set to "blocked"
|
|
- ✅ Last incident reason recorded
|
|
- ✅ Connected agents receive forced downgrade
|
|
- ✅ Agents successfully downgrade to previous stable
|
|
- ✅ Blocked version not offered to new agents
|
|
- ✅ Dashboard shows blocked status
|
|
|
|
---
|
|
|
|
## Test 5: Dashboard UI Testing
|
|
|
|
**Objective:** Verify Updates page displays correctly and actions work.
|
|
|
|
### Steps
|
|
|
|
1. **Access Updates page**
|
|
- Navigate to https://rmm.azcomputerguru.com/updates
|
|
- Login if needed
|
|
|
|
2. **Verify data display**
|
|
- [ ] Table shows all rollout versions
|
|
- [ ] Columns: Version, OS/Arch, Channel, Health, Success Rate, Agent Counts, Actions
|
|
- [ ] Health badges color-coded (green/yellow/red/gray)
|
|
- [ ] Success rate calculated correctly
|
|
- [ ] Agent counts accurate
|
|
|
|
3. **Test promote button**
|
|
- [ ] Enabled for beta + healthy versions only
|
|
- [ ] Disabled with tooltip for unhealthy versions
|
|
- [ ] Click opens confirmation dialog
|
|
- [ ] Confirm triggers API call
|
|
- [ ] Success toast appears
|
|
- [ ] Table refreshes with updated data
|
|
|
|
4. **Test rollback button**
|
|
- [ ] Always enabled
|
|
- [ ] Click opens dialog with reason input
|
|
- [ ] Reason field is required
|
|
- [ ] Confirm triggers API call
|
|
- [ ] Success toast shows agent count
|
|
- [ ] Table refreshes with updated data
|
|
|
|
5. **Test error handling**
|
|
- [ ] Shows loading state during fetch
|
|
- [ ] Shows error message if API fails
|
|
- [ ] Retry button works
|
|
- [ ] Shows empty state if no rollouts
|
|
|
|
6. **Test auto-refresh**
|
|
- [ ] Data refreshes every 30 seconds
|
|
- [ ] Refresh doesn't disrupt UI interactions
|
|
- [ ] Manual refresh button works
|
|
|
|
### Success Criteria
|
|
- ✅ All table columns display correct data
|
|
- ✅ Health badges use correct colors
|
|
- ✅ Promote button only enabled for healthy beta versions
|
|
- ✅ Rollback button always enabled
|
|
- ✅ Confirmation dialogs work
|
|
- ✅ API calls succeed
|
|
- ✅ Toasts display success/error
|
|
- ✅ Auto-refresh works
|
|
- ✅ Responsive on mobile
|
|
|
|
---
|
|
|
|
## Test 6: Integration Testing
|
|
|
|
**Objective:** Test complete workflows end-to-end.
|
|
|
|
### Workflow 1: New Build → Beta Testing → Promotion → Stable Deployment
|
|
|
|
1. Trigger new build (auto-bumps version)
|
|
2. Verify .channel files = "beta"
|
|
3. Mark GURU-KALI as beta agent
|
|
4. Wait for update dispatch
|
|
5. Monitor update installation
|
|
6. Verify success event logged
|
|
7. Repeat 4 more times for healthy status
|
|
8. Promote via dashboard
|
|
9. Verify GURU-5070 (stable) receives update
|
|
10. Monitor stable deployment
|
|
11. Verify all agents updated
|
|
|
|
**Expected:** Beta testing prevents bad updates from reaching production.
|
|
|
|
### Workflow 2: Critical Bug → Rollback → Fleet Downgrade
|
|
|
|
1. Simulate critical bug discovered post-promotion
|
|
2. Execute rollback via dashboard
|
|
3. Verify all agents receive forced downgrade
|
|
4. Verify agents revert to previous stable
|
|
5. Verify new agents don't receive blocked version
|
|
6. Verify health metrics show blocked status
|
|
|
|
**Expected:** Rollback protects fleet from bad updates.
|
|
|
|
### Workflow 3: Crash Detection → Auto-Block (Future Enhancement)
|
|
|
|
1. Deploy update to beta agents
|
|
2. Simulate crash (stop service after update)
|
|
3. Wait for health monitor (60s)
|
|
4. Verify crash detected and logged
|
|
5. Check if crash rate >25%
|
|
6. Verify health status = "critical"
|
|
7. Attempt promotion
|
|
8. Verify promotion blocked
|
|
|
|
**Expected:** High crash rates prevent automatic promotion.
|
|
|
|
---
|
|
|
|
## Performance Testing
|
|
|
|
### Load Testing
|
|
- [ ] 100+ agents checking for updates simultaneously
|
|
- [ ] Scanner performance with 50+ versions
|
|
- [ ] Health monitor with 1000+ update events
|
|
- [ ] Dashboard with 20+ rollouts displayed
|
|
|
|
### Stress Testing
|
|
- [ ] Rapid version releases (5 builds in 10 minutes)
|
|
- [ ] Mass rollback (100+ agents)
|
|
- [ ] Concurrent API calls (multiple users promoting/rolling back)
|
|
|
|
---
|
|
|
|
## Security Testing
|
|
|
|
### Authentication
|
|
- [ ] All API endpoints require valid JWT
|
|
- [ ] Expired tokens rejected
|
|
- [ ] Invalid tokens rejected
|
|
|
|
### Authorization
|
|
- [ ] Admin role can promote/rollback
|
|
- [ ] Non-admin role blocked (if RBAC implemented)
|
|
|
|
### Input Validation
|
|
- [ ] SQL injection attempts blocked
|
|
- [ ] XSS attempts in reason field sanitized
|
|
- [ ] Invalid version strings rejected
|
|
- [ ] Invalid OS/arch values rejected
|
|
|
|
### File System Security
|
|
- [ ] .channel files have correct permissions
|
|
- [ ] Path traversal attempts blocked
|
|
- [ ] Only authorized processes can modify .channel files
|
|
|
|
---
|
|
|
|
## Regression Testing
|
|
|
|
### Existing Functionality
|
|
- [ ] Agent registration still works
|
|
- [ ] Heartbeat processing unaffected
|
|
- [ ] Command execution unaffected
|
|
- [ ] Metrics collection unaffected
|
|
- [ ] Alert generation unaffected
|
|
- [ ] Policy enforcement unaffected
|
|
|
|
### Database Performance
|
|
- [ ] No slow queries introduced
|
|
- [ ] Indexes used efficiently
|
|
- [ ] No lock contention
|
|
|
|
---
|
|
|
|
## Documentation Verification
|
|
|
|
- [ ] API endpoints documented
|
|
- [ ] Database schema documented
|
|
- [ ] Dashboard user guide accurate
|
|
- [ ] Admin procedures documented
|
|
- [ ] Troubleshooting guide created
|
|
|
|
---
|
|
|
|
## Sign-Off
|
|
|
|
### Phase 6 Test Results
|
|
|
|
**Tester:** ___________________________
|
|
**Date:** ___________________________
|
|
|
|
**Test 1 - Beta-First Workflow:** ⬜ PASS ⬜ FAIL
|
|
**Test 2 - Health Monitoring:** ⬜ PASS ⬜ FAIL
|
|
**Test 3 - Promotion:** ⬜ PASS ⬜ FAIL
|
|
**Test 4 - Rollback:** ⬜ PASS ⬜ FAIL
|
|
**Test 5 - Dashboard UI:** ⬜ PASS ⬜ FAIL
|
|
**Test 6 - Integration:** ⬜ PASS ⬜ FAIL
|
|
|
|
**Overall Status:** ⬜ APPROVED FOR PRODUCTION ⬜ NEEDS FIXES
|
|
|
|
**Notes:**
|
|
```
|
|
|
|
|
|
```
|
|
|
|
**Blockers/Issues:**
|
|
```
|
|
|
|
|
|
```
|
|
|
|
**Deployment Date:** ___________________________
|