test: Add Phase 6 testing plan and verification script
Created comprehensive test plan covering: - Test 1: Beta-first build workflow - Test 2: Health monitoring and crash detection - Test 3: Promotion workflow with health gates - Test 4: Rollback and force-downgrade - Test 5: Dashboard UI testing - Test 6: End-to-end integration testing Added verification script for Saturn: - Checks all 5 phases implementation - Verifies database tables, source files, routes - Validates build artifacts and service status - Provides clear pass/fail status Usage: Run verify-rollout-system.sh on Saturn before Phase 6 testing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
571
PHASE_6_TEST_PLAN.md
Normal file
571
PHASE_6_TEST_PLAN.md
Normal file
@@ -0,0 +1,571 @@
|
||||
# Phase 6: End-to-End Testing Plan
|
||||
## Safe Agent Rollout System
|
||||
|
||||
**Date:** 2026-05-25
|
||||
**Version:** GuruRMM v0.6.41+
|
||||
**Tester:** Mike Swanson
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Environment Setup
|
||||
- [ ] SSH access to Saturn (172.16.3.30)
|
||||
- [ ] Access to GuruRMM dashboard (https://rmm.azcomputerguru.com)
|
||||
- [ ] JWT token for API testing
|
||||
- [ ] At least 2 test agents (GURU-KALI, GURU-5070 recommended)
|
||||
|
||||
### Pre-Test Verification
|
||||
```bash
|
||||
# On Saturn
|
||||
ssh azcomputerguru@172.16.3.30
|
||||
|
||||
# 1. Verify migration 046 is applied
|
||||
sudo -u postgres psql gururmm_production -c "\d update_rollouts"
|
||||
sudo -u postgres psql gururmm_production -c "\d update_health_metrics"
|
||||
sudo -u postgres psql gururmm_production -c "\d agent_update_events"
|
||||
|
||||
# 2. Verify server build is current
|
||||
cd /opt/gururmm/server
|
||||
git status # Should show Phase 4 code
|
||||
cargo build --release --features production
|
||||
|
||||
# 3. Verify dashboard build is current
|
||||
cd /opt/gururmm/dashboard
|
||||
git status # Should show Phase 5 code
|
||||
npm run build
|
||||
|
||||
# 4. Verify health monitor is running
|
||||
sudo systemctl status gururmm-server
|
||||
sudo journalctl -u gururmm-server -n 50 | grep "Health monitoring task spawned"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test 1: Beta-First Build Workflow
|
||||
|
||||
**Objective:** Verify new builds default to beta channel and stable agents don't receive them.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Trigger a test build**
|
||||
```bash
|
||||
# On Saturn
|
||||
cd /opt/gururmm
|
||||
sudo ./build-linux.sh # Will auto-increment to next version
|
||||
sudo ./build-windows.sh
|
||||
```
|
||||
|
||||
2. **Verify .channel files created**
|
||||
```bash
|
||||
cd /var/www/gururmm/downloads
|
||||
ls -la *.channel | tail -10
|
||||
|
||||
# Expected: New version should have .channel files containing "beta"
|
||||
VERSION=$(ls -t gururmm-agent-linux-amd64-*.tar.gz | head -1 | grep -oP '\d+\.\d+\.\d+')
|
||||
cat gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
||||
# Should output: beta
|
||||
```
|
||||
|
||||
3. **Mark test agents as beta**
|
||||
```bash
|
||||
# Via API or SQL
|
||||
curl -X PATCH https://rmm.azcomputerguru.com/api/agents/GURU-KALI-UUID/channel \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-d '{"update_channel": "beta"}'
|
||||
```
|
||||
|
||||
4. **Verify beta agent receives update**
|
||||
- Open dashboard → Agents → GURU-KALI
|
||||
- Wait for agent connection (heartbeat every 60s)
|
||||
- Check agent state for pending update
|
||||
- Expected: Should see update_available = true
|
||||
|
||||
5. **Verify stable agent does NOT receive update**
|
||||
- Ensure GURU-5070 is on "stable" channel
|
||||
- Check agent state
|
||||
- Expected: update_available = false (version not in stable channel)
|
||||
|
||||
### Success Criteria
|
||||
- ✅ .channel files exist for new version
|
||||
- ✅ .channel files contain "beta"
|
||||
- ✅ Beta agents offered the update
|
||||
- ✅ Stable agents NOT offered the update
|
||||
- ✅ Scanner logs show beta/stable filtering
|
||||
|
||||
---
|
||||
|
||||
## Test 2: Health Monitoring & Crash Detection
|
||||
|
||||
**Objective:** Verify health monitor detects crashes and updates metrics.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Clear existing health data (optional)**
|
||||
```sql
|
||||
sudo -u postgres psql gururmm_production -c "DELETE FROM update_health_metrics WHERE version = '$VERSION';"
|
||||
sudo -u postgres psql gururmm_production -c "DELETE FROM agent_update_events WHERE version_to = '$VERSION';"
|
||||
```
|
||||
|
||||
2. **Simulate successful update**
|
||||
```bash
|
||||
# On test agent (GURU-KALI)
|
||||
# Let update complete normally
|
||||
# Wait 5 minutes
|
||||
```
|
||||
|
||||
3. **Check event logging**
|
||||
```sql
|
||||
SELECT event_type, version_to, created_at
|
||||
FROM agent_update_events
|
||||
WHERE agent_id = 'GURU-KALI-UUID'
|
||||
ORDER BY created_at DESC
|
||||
LIMIT 5;
|
||||
|
||||
# Expected events:
|
||||
# - update_dispatched
|
||||
# - download_started (if implemented)
|
||||
# - download_complete (if implemented)
|
||||
# - update_applied
|
||||
```
|
||||
|
||||
4. **Check health metrics incremented**
|
||||
```sql
|
||||
SELECT version, total_attempts, successful_updates, failed_updates, crash_count, health_status
|
||||
FROM update_health_metrics
|
||||
WHERE version = '$VERSION';
|
||||
|
||||
# Expected:
|
||||
# total_attempts = 1
|
||||
# successful_updates = 1
|
||||
# health_status = 'unknown' (< 5 attempts)
|
||||
```
|
||||
|
||||
5. **Simulate crash**
|
||||
```bash
|
||||
# On test agent
|
||||
# 1. Trigger update dispatch
|
||||
# 2. Immediately after "update_applied" event, stop agent service
|
||||
sudo systemctl stop gururmm-agent
|
||||
# 3. Wait 60-90 seconds for health monitor scan
|
||||
```
|
||||
|
||||
6. **Verify crash detection**
|
||||
```sql
|
||||
SELECT event_type, created_at
|
||||
FROM agent_update_events
|
||||
WHERE agent_id = 'GURU-KALI-UUID'
|
||||
AND event_type = 'crash_detected'
|
||||
ORDER BY created_at DESC;
|
||||
|
||||
# Expected: Should see crash_detected event
|
||||
|
||||
SELECT crash_count, health_status
|
||||
FROM update_health_metrics
|
||||
WHERE version = '$VERSION';
|
||||
|
||||
# Expected: crash_count incremented, health_status may change
|
||||
```
|
||||
|
||||
7. **Check server logs**
|
||||
```bash
|
||||
sudo journalctl -u gururmm-server -n 100 | grep -E "crash|health"
|
||||
# Expected: "Detected crash: agent X went offline after updating to Y"
|
||||
```
|
||||
|
||||
### Success Criteria
|
||||
- ✅ Events logged correctly (update_dispatched, update_applied)
|
||||
- ✅ Health metrics incremented on success
|
||||
- ✅ Crash detected within 90 seconds
|
||||
- ✅ crash_detected event logged
|
||||
- ✅ Crash counter incremented
|
||||
- ✅ Health status updated based on thresholds
|
||||
|
||||
---
|
||||
|
||||
## Test 3: Promotion Workflow
|
||||
|
||||
**Objective:** Verify promotion from beta to stable with health gates.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Attempt promotion with insufficient data**
|
||||
```bash
|
||||
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"os": "linux", "arch": "amd64", "force": false}'
|
||||
|
||||
# Expected: May succeed (unknown status allows promotion) or fail if health check implemented
|
||||
```
|
||||
|
||||
2. **Generate healthy metrics**
|
||||
```bash
|
||||
# Simulate 5+ successful updates
|
||||
# Option A: Manually insert via SQL (for testing)
|
||||
# Option B: Trigger real updates on multiple beta agents
|
||||
|
||||
# SQL approach for testing:
|
||||
sudo -u postgres psql gururmm_production << EOF
|
||||
UPDATE update_health_metrics
|
||||
SET total_attempts = 5,
|
||||
successful_updates = 5,
|
||||
failed_updates = 0,
|
||||
crash_count = 0,
|
||||
health_status = 'healthy'
|
||||
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
||||
EOF
|
||||
```
|
||||
|
||||
3. **Verify health status**
|
||||
```bash
|
||||
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
||||
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
|
||||
|
||||
# Expected: health.status = "healthy"
|
||||
```
|
||||
|
||||
4. **Promote to stable**
|
||||
```bash
|
||||
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"os": "linux", "arch": "amd64", "force": false}'
|
||||
|
||||
# Expected: {"success": true, "message": "Promoted...", "files_updated": 2}
|
||||
```
|
||||
|
||||
5. **Verify .channel files updated**
|
||||
```bash
|
||||
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
||||
# Expected: stable
|
||||
```
|
||||
|
||||
6. **Verify database updated**
|
||||
```sql
|
||||
SELECT channel, promoted_at, promoted_by
|
||||
FROM update_rollouts
|
||||
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
||||
|
||||
# Expected: channel = 'stable', promoted_at = NOW(), promoted_by = user_id
|
||||
```
|
||||
|
||||
7. **Verify stable agents receive update**
|
||||
- Ensure test agent is on "stable" channel
|
||||
- Wait for scanner rescan (happens immediately after promotion)
|
||||
- Check agent state
|
||||
- Expected: update_available = true
|
||||
|
||||
8. **Test force promotion**
|
||||
```bash
|
||||
# Set health to warning
|
||||
sudo -u postgres psql gururmm_production << EOF
|
||||
UPDATE update_health_metrics
|
||||
SET health_status = 'warning'
|
||||
WHERE version = '$VERSION' AND os = 'windows' AND arch = 'amd64';
|
||||
EOF
|
||||
|
||||
# Try promotion without force
|
||||
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-d '{"os": "windows", "arch": "amd64", "force": false}'
|
||||
|
||||
# Expected: 403 error with message about health status
|
||||
|
||||
# Try with force flag
|
||||
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-d '{"os": "windows", "arch": "amd64", "force": true}'
|
||||
|
||||
# Expected: 200 success (overridden health check)
|
||||
```
|
||||
|
||||
### Success Criteria
|
||||
- ✅ Promotion blocked for unhealthy versions (unless forced)
|
||||
- ✅ Promotion succeeds for healthy versions
|
||||
- ✅ .channel files updated from "beta" to "stable"
|
||||
- ✅ Database rollouts table updated
|
||||
- ✅ Scanner rescans immediately
|
||||
- ✅ Stable agents receive update after promotion
|
||||
- ✅ Force flag overrides health checks
|
||||
- ✅ Dashboard shows updated channel
|
||||
|
||||
---
|
||||
|
||||
## Test 4: Rollback Workflow
|
||||
|
||||
**Objective:** Verify rollback blocks version and force-downgrades agents.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Prepare for rollback**
|
||||
```bash
|
||||
# Ensure test agent is running the rollback target version
|
||||
# Verify previous stable version exists
|
||||
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
||||
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.channel == "stable") | .version'
|
||||
```
|
||||
|
||||
2. **Execute rollback**
|
||||
```bash
|
||||
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/rollback \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"os": "linux",
|
||||
"arch": "amd64",
|
||||
"reason": "Test rollback: simulating critical bug in version '$VERSION'"
|
||||
}'
|
||||
|
||||
# Expected: {"success": true, "agents_affected": 1, "downgrade_version": "0.6.40"}
|
||||
```
|
||||
|
||||
3. **Verify .channel files removed**
|
||||
```bash
|
||||
ls /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
|
||||
# Expected: File not found (removed)
|
||||
```
|
||||
|
||||
4. **Verify health status blocked**
|
||||
```sql
|
||||
SELECT health_status, last_incident
|
||||
FROM update_health_metrics
|
||||
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
|
||||
|
||||
# Expected: health_status = 'blocked', last_incident = reason text
|
||||
```
|
||||
|
||||
5. **Verify forced downgrade dispatched**
|
||||
```bash
|
||||
# Check server logs for WebSocket dispatch
|
||||
sudo journalctl -u gururmm-server -n 100 | grep -i "downgrade\|rollback"
|
||||
|
||||
# Check agent receives forced update
|
||||
# Monitor agent logs for update trigger
|
||||
```
|
||||
|
||||
6. **Verify agent downgrades**
|
||||
- Agent should receive UpdateAvailable message with previous version
|
||||
- Agent should download and install previous version
|
||||
- Check agent version after completion
|
||||
- Expected: agent_version = previous stable version
|
||||
|
||||
7. **Verify blocked version not offered again**
|
||||
```bash
|
||||
# Scanner should skip files without .channel files
|
||||
# Verify version is not in available updates list
|
||||
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
|
||||
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
|
||||
|
||||
# If present, should show channel = null or health.status = "blocked"
|
||||
```
|
||||
|
||||
### Success Criteria
|
||||
- ✅ .channel files removed
|
||||
- ✅ Health status set to "blocked"
|
||||
- ✅ Last incident reason recorded
|
||||
- ✅ Connected agents receive forced downgrade
|
||||
- ✅ Agents successfully downgrade to previous stable
|
||||
- ✅ Blocked version not offered to new agents
|
||||
- ✅ Dashboard shows blocked status
|
||||
|
||||
---
|
||||
|
||||
## Test 5: Dashboard UI Testing
|
||||
|
||||
**Objective:** Verify Updates page displays correctly and actions work.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Access Updates page**
|
||||
- Navigate to https://rmm.azcomputerguru.com/updates
|
||||
- Login if needed
|
||||
|
||||
2. **Verify data display**
|
||||
- [ ] Table shows all rollout versions
|
||||
- [ ] Columns: Version, OS/Arch, Channel, Health, Success Rate, Agent Counts, Actions
|
||||
- [ ] Health badges color-coded (green/yellow/red/gray)
|
||||
- [ ] Success rate calculated correctly
|
||||
- [ ] Agent counts accurate
|
||||
|
||||
3. **Test promote button**
|
||||
- [ ] Enabled for beta + healthy versions only
|
||||
- [ ] Disabled with tooltip for unhealthy versions
|
||||
- [ ] Click opens confirmation dialog
|
||||
- [ ] Confirm triggers API call
|
||||
- [ ] Success toast appears
|
||||
- [ ] Table refreshes with updated data
|
||||
|
||||
4. **Test rollback button**
|
||||
- [ ] Always enabled
|
||||
- [ ] Click opens dialog with reason input
|
||||
- [ ] Reason field is required
|
||||
- [ ] Confirm triggers API call
|
||||
- [ ] Success toast shows agent count
|
||||
- [ ] Table refreshes with updated data
|
||||
|
||||
5. **Test error handling**
|
||||
- [ ] Shows loading state during fetch
|
||||
- [ ] Shows error message if API fails
|
||||
- [ ] Retry button works
|
||||
- [ ] Shows empty state if no rollouts
|
||||
|
||||
6. **Test auto-refresh**
|
||||
- [ ] Data refreshes every 30 seconds
|
||||
- [ ] Refresh doesn't disrupt UI interactions
|
||||
- [ ] Manual refresh button works
|
||||
|
||||
### Success Criteria
|
||||
- ✅ All table columns display correct data
|
||||
- ✅ Health badges use correct colors
|
||||
- ✅ Promote button only enabled for healthy beta versions
|
||||
- ✅ Rollback button always enabled
|
||||
- ✅ Confirmation dialogs work
|
||||
- ✅ API calls succeed
|
||||
- ✅ Toasts display success/error
|
||||
- ✅ Auto-refresh works
|
||||
- ✅ Responsive on mobile
|
||||
|
||||
---
|
||||
|
||||
## Test 6: Integration Testing
|
||||
|
||||
**Objective:** Test complete workflows end-to-end.
|
||||
|
||||
### Workflow 1: New Build → Beta Testing → Promotion → Stable Deployment
|
||||
|
||||
1. Trigger new build (auto-bumps version)
|
||||
2. Verify .channel files = "beta"
|
||||
3. Mark GURU-KALI as beta agent
|
||||
4. Wait for update dispatch
|
||||
5. Monitor update installation
|
||||
6. Verify success event logged
|
||||
7. Repeat 4 more times for healthy status
|
||||
8. Promote via dashboard
|
||||
9. Verify GURU-5070 (stable) receives update
|
||||
10. Monitor stable deployment
|
||||
11. Verify all agents updated
|
||||
|
||||
**Expected:** Beta testing prevents bad updates from reaching production.
|
||||
|
||||
### Workflow 2: Critical Bug → Rollback → Fleet Downgrade
|
||||
|
||||
1. Simulate critical bug discovered post-promotion
|
||||
2. Execute rollback via dashboard
|
||||
3. Verify all agents receive forced downgrade
|
||||
4. Verify agents revert to previous stable
|
||||
5. Verify new agents don't receive blocked version
|
||||
6. Verify health metrics show blocked status
|
||||
|
||||
**Expected:** Rollback protects fleet from bad updates.
|
||||
|
||||
### Workflow 3: Crash Detection → Auto-Block (Future Enhancement)
|
||||
|
||||
1. Deploy update to beta agents
|
||||
2. Simulate crash (stop service after update)
|
||||
3. Wait for health monitor (60s)
|
||||
4. Verify crash detected and logged
|
||||
5. Check if crash rate >25%
|
||||
6. Verify health status = "critical"
|
||||
7. Attempt promotion
|
||||
8. Verify promotion blocked
|
||||
|
||||
**Expected:** High crash rates prevent automatic promotion.
|
||||
|
||||
---
|
||||
|
||||
## Performance Testing
|
||||
|
||||
### Load Testing
|
||||
- [ ] 100+ agents checking for updates simultaneously
|
||||
- [ ] Scanner performance with 50+ versions
|
||||
- [ ] Health monitor with 1000+ update events
|
||||
- [ ] Dashboard with 20+ rollouts displayed
|
||||
|
||||
### Stress Testing
|
||||
- [ ] Rapid version releases (5 builds in 10 minutes)
|
||||
- [ ] Mass rollback (100+ agents)
|
||||
- [ ] Concurrent API calls (multiple users promoting/rolling back)
|
||||
|
||||
---
|
||||
|
||||
## Security Testing
|
||||
|
||||
### Authentication
|
||||
- [ ] All API endpoints require valid JWT
|
||||
- [ ] Expired tokens rejected
|
||||
- [ ] Invalid tokens rejected
|
||||
|
||||
### Authorization
|
||||
- [ ] Admin role can promote/rollback
|
||||
- [ ] Non-admin role blocked (if RBAC implemented)
|
||||
|
||||
### Input Validation
|
||||
- [ ] SQL injection attempts blocked
|
||||
- [ ] XSS attempts in reason field sanitized
|
||||
- [ ] Invalid version strings rejected
|
||||
- [ ] Invalid OS/arch values rejected
|
||||
|
||||
### File System Security
|
||||
- [ ] .channel files have correct permissions
|
||||
- [ ] Path traversal attempts blocked
|
||||
- [ ] Only authorized processes can modify .channel files
|
||||
|
||||
---
|
||||
|
||||
## Regression Testing
|
||||
|
||||
### Existing Functionality
|
||||
- [ ] Agent registration still works
|
||||
- [ ] Heartbeat processing unaffected
|
||||
- [ ] Command execution unaffected
|
||||
- [ ] Metrics collection unaffected
|
||||
- [ ] Alert generation unaffected
|
||||
- [ ] Policy enforcement unaffected
|
||||
|
||||
### Database Performance
|
||||
- [ ] No slow queries introduced
|
||||
- [ ] Indexes used efficiently
|
||||
- [ ] No lock contention
|
||||
|
||||
---
|
||||
|
||||
## Documentation Verification
|
||||
|
||||
- [ ] API endpoints documented
|
||||
- [ ] Database schema documented
|
||||
- [ ] Dashboard user guide accurate
|
||||
- [ ] Admin procedures documented
|
||||
- [ ] Troubleshooting guide created
|
||||
|
||||
---
|
||||
|
||||
## Sign-Off
|
||||
|
||||
### Phase 6 Test Results
|
||||
|
||||
**Tester:** ___________________________
|
||||
**Date:** ___________________________
|
||||
|
||||
**Test 1 - Beta-First Workflow:** ⬜ PASS ⬜ FAIL
|
||||
**Test 2 - Health Monitoring:** ⬜ PASS ⬜ FAIL
|
||||
**Test 3 - Promotion:** ⬜ PASS ⬜ FAIL
|
||||
**Test 4 - Rollback:** ⬜ PASS ⬜ FAIL
|
||||
**Test 5 - Dashboard UI:** ⬜ PASS ⬜ FAIL
|
||||
**Test 6 - Integration:** ⬜ PASS ⬜ FAIL
|
||||
|
||||
**Overall Status:** ⬜ APPROVED FOR PRODUCTION ⬜ NEEDS FIXES
|
||||
|
||||
**Notes:**
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
|
||||
**Blockers/Issues:**
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
|
||||
**Deployment Date:** ___________________________
|
||||
Reference in New Issue
Block a user