Files
claudetools/PHASE_6_TEST_PLAN.md
Mike Swanson 67182e0684 docs(rollout): reconcile Phase 5/6 docs — safe-rollout gating is inert (BUG-004)
Per the 2026-05-25 re-audit + Mike's decision (option b): the safe-rollout
promotion gating these docs describe/test is NOT live (update_rollouts /
update_health_metrics written-but-never-read; crash detection dead until the
unmerged BUG-002 fix). Added a [WARNING] STATUS banner to the test plan, verify
script, and the two 'complete' summaries so they aren't trusted as validating a
working feature. Automation is a roadmap Phase-2 item requiring a full re-spec.
2026-05-25 15:04:44 -07:00

575 lines
16 KiB
Markdown

# Phase 6: End-to-End Testing Plan
> **[WARNING] STATUS — 2026-05-25 re-audit: the safe-rollout gating this plan tests is NOT live.** `update_rollouts` and `update_health_metrics` are written but never read to gate promotion (see gururmm `docs/FEATURE_ROADMAP.md` BUG-004); crash detection was dead code until the BUG-002 fix (branch `fix/audit-2-remediation`, unmerged). Promotion is currently 100% manual via channel columns. Treat this document as **Phase-2 aspirational** — passing/failing it does not reflect a working safe-rollout feature. Decision 2026-05-25 (Mike): keep the feature inert and labeled; automation deferred to a roadmap Phase-2 item that must be properly re-spec'd.
## Safe Agent Rollout System
**Date:** 2026-05-25
**Version:** GuruRMM v0.6.41+
**Tester:** Mike Swanson
---
## Prerequisites
### Environment Setup
- [ ] SSH access to gururmm-build (172.16.3.30)
- [ ] Access to GuruRMM dashboard (https://rmm.azcomputerguru.com)
- [ ] JWT token for API testing
- [ ] At least 2 test agents (GURU-KALI, GURU-5070 recommended)
### Pre-Test Verification
```bash
# On Saturn
ssh azcomputerguru@172.16.3.30
# 1. Verify migration 046 is applied
sudo -u postgres psql gururmm_production -c "\d update_rollouts"
sudo -u postgres psql gururmm_production -c "\d update_health_metrics"
sudo -u postgres psql gururmm_production -c "\d agent_update_events"
# 2. Verify server build is current
cd /opt/gururmm/server
git status # Should show Phase 4 code
cargo build --release --features production
# 3. Verify dashboard build is current
cd /opt/gururmm/dashboard
git status # Should show Phase 5 code
npm run build
# 4. Verify health monitor is running
sudo systemctl status gururmm-server
sudo journalctl -u gururmm-server -n 50 | grep "Health monitoring task spawned"
```
---
## Test 1: Beta-First Build Workflow
**Objective:** Verify new builds default to beta channel and stable agents don't receive them.
### Steps
1. **Trigger a test build**
```bash
# On Saturn
cd /opt/gururmm
sudo ./build-linux.sh # Will auto-increment to next version
sudo ./build-windows.sh
```
2. **Verify .channel files created**
```bash
cd /var/www/gururmm/downloads
ls -la *.channel | tail -10
# Expected: New version should have .channel files containing "beta"
VERSION=$(ls -t gururmm-agent-linux-amd64-*.tar.gz | head -1 | grep -oP '\d+\.\d+\.\d+')
cat gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Should output: beta
```
3. **Mark test agents as beta**
```bash
# Via API or SQL
curl -X PATCH https://rmm.azcomputerguru.com/api/agents/GURU-KALI-UUID/channel \
-H "Authorization: Bearer $TOKEN" \
-d '{"update_channel": "beta"}'
```
4. **Verify beta agent receives update**
- Open dashboard → Agents → GURU-KALI
- Wait for agent connection (heartbeat every 60s)
- Check agent state for pending update
- Expected: Should see update_available = true
5. **Verify stable agent does NOT receive update**
- Ensure GURU-5070 is on "stable" channel
- Check agent state
- Expected: update_available = false (version not in stable channel)
### Success Criteria
- ✅ .channel files exist for new version
- ✅ .channel files contain "beta"
- ✅ Beta agents offered the update
- ✅ Stable agents NOT offered the update
- ✅ Scanner logs show beta/stable filtering
---
## Test 2: Health Monitoring & Crash Detection
**Objective:** Verify health monitor detects crashes and updates metrics.
### Steps
1. **Clear existing health data (optional)**
```sql
sudo -u postgres psql gururmm_production -c "DELETE FROM update_health_metrics WHERE version = '$VERSION';"
sudo -u postgres psql gururmm_production -c "DELETE FROM agent_update_events WHERE version_to = '$VERSION';"
```
2. **Simulate successful update**
```bash
# On test agent (GURU-KALI)
# Let update complete normally
# Wait 5 minutes
```
3. **Check event logging**
```sql
SELECT event_type, version_to, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
ORDER BY created_at DESC
LIMIT 5;
# Expected events:
# - update_dispatched
# - download_started (if implemented)
# - download_complete (if implemented)
# - update_applied
```
4. **Check health metrics incremented**
```sql
SELECT version, total_attempts, successful_updates, failed_updates, crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';
# Expected:
# total_attempts = 1
# successful_updates = 1
# health_status = 'unknown' (< 5 attempts)
```
5. **Simulate crash**
```bash
# On test agent
# 1. Trigger update dispatch
# 2. Immediately after "update_applied" event, stop agent service
sudo systemctl stop gururmm-agent
# 3. Wait 60-90 seconds for health monitor scan
```
6. **Verify crash detection**
```sql
SELECT event_type, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
AND event_type = 'crash_detected'
ORDER BY created_at DESC;
# Expected: Should see crash_detected event
SELECT crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';
# Expected: crash_count incremented, health_status may change
```
7. **Check server logs**
```bash
sudo journalctl -u gururmm-server -n 100 | grep -E "crash|health"
# Expected: "Detected crash: agent X went offline after updating to Y"
```
### Success Criteria
- ✅ Events logged correctly (update_dispatched, update_applied)
- ✅ Health metrics incremented on success
- ✅ Crash detected within 90 seconds
- ✅ crash_detected event logged
- ✅ Crash counter incremented
- ✅ Health status updated based on thresholds
---
## Test 3: Promotion Workflow
**Objective:** Verify promotion from beta to stable with health gates.
### Steps
1. **Attempt promotion with insufficient data**
```bash
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"os": "linux", "arch": "amd64", "force": false}'
# Expected: May succeed (unknown status allows promotion) or fail if health check implemented
```
2. **Generate healthy metrics**
```bash
# Simulate 5+ successful updates
# Option A: Manually insert via SQL (for testing)
# Option B: Trigger real updates on multiple beta agents
# SQL approach for testing:
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET total_attempts = 5,
successful_updates = 5,
failed_updates = 0,
crash_count = 0,
health_status = 'healthy'
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
EOF
```
3. **Verify health status**
```bash
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
# Expected: health.status = "healthy"
```
4. **Promote to stable**
```bash
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"os": "linux", "arch": "amd64", "force": false}'
# Expected: {"success": true, "message": "Promoted...", "files_updated": 2}
```
5. **Verify .channel files updated**
```bash
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: stable
```
6. **Verify database updated**
```sql
SELECT channel, promoted_at, promoted_by
FROM update_rollouts
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
# Expected: channel = 'stable', promoted_at = NOW(), promoted_by = user_id
```
7. **Verify stable agents receive update**
- Ensure test agent is on "stable" channel
- Wait for scanner rescan (happens immediately after promotion)
- Check agent state
- Expected: update_available = true
8. **Test force promotion**
```bash
# Set health to warning
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET health_status = 'warning'
WHERE version = '$VERSION' AND os = 'windows' AND arch = 'amd64';
EOF
# Try promotion without force
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-d '{"os": "windows", "arch": "amd64", "force": false}'
# Expected: 403 error with message about health status
# Try with force flag
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
-H "Authorization: Bearer $TOKEN" \
-d '{"os": "windows", "arch": "amd64", "force": true}'
# Expected: 200 success (overridden health check)
```
### Success Criteria
- ✅ Promotion blocked for unhealthy versions (unless forced)
- ✅ Promotion succeeds for healthy versions
- ✅ .channel files updated from "beta" to "stable"
- ✅ Database rollouts table updated
- ✅ Scanner rescans immediately
- ✅ Stable agents receive update after promotion
- ✅ Force flag overrides health checks
- ✅ Dashboard shows updated channel
---
## Test 4: Rollback Workflow
**Objective:** Verify rollback blocks version and force-downgrades agents.
### Steps
1. **Prepare for rollback**
```bash
# Ensure test agent is running the rollback target version
# Verify previous stable version exists
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.channel == "stable") | .version'
```
2. **Execute rollback**
```bash
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/rollback \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"os": "linux",
"arch": "amd64",
"reason": "Test rollback: simulating critical bug in version '$VERSION'"
}'
# Expected: {"success": true, "agents_affected": 1, "downgrade_version": "0.6.40"}
```
3. **Verify .channel files removed**
```bash
ls /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: File not found (removed)
```
4. **Verify health status blocked**
```sql
SELECT health_status, last_incident
FROM update_health_metrics
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
# Expected: health_status = 'blocked', last_incident = reason text
```
5. **Verify forced downgrade dispatched**
```bash
# Check server logs for WebSocket dispatch
sudo journalctl -u gururmm-server -n 100 | grep -i "downgrade\|rollback"
# Check agent receives forced update
# Monitor agent logs for update trigger
```
6. **Verify agent downgrades**
- Agent should receive UpdateAvailable message with previous version
- Agent should download and install previous version
- Check agent version after completion
- Expected: agent_version = previous stable version
7. **Verify blocked version not offered again**
```bash
# Scanner should skip files without .channel files
# Verify version is not in available updates list
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
-H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'
# If present, should show channel = null or health.status = "blocked"
```
### Success Criteria
- ✅ .channel files removed
- ✅ Health status set to "blocked"
- ✅ Last incident reason recorded
- ✅ Connected agents receive forced downgrade
- ✅ Agents successfully downgrade to previous stable
- ✅ Blocked version not offered to new agents
- ✅ Dashboard shows blocked status
---
## Test 5: Dashboard UI Testing
**Objective:** Verify Updates page displays correctly and actions work.
### Steps
1. **Access Updates page**
- Navigate to https://rmm.azcomputerguru.com/updates
- Login if needed
2. **Verify data display**
- [ ] Table shows all rollout versions
- [ ] Columns: Version, OS/Arch, Channel, Health, Success Rate, Agent Counts, Actions
- [ ] Health badges color-coded (green/yellow/red/gray)
- [ ] Success rate calculated correctly
- [ ] Agent counts accurate
3. **Test promote button**
- [ ] Enabled for beta + healthy versions only
- [ ] Disabled with tooltip for unhealthy versions
- [ ] Click opens confirmation dialog
- [ ] Confirm triggers API call
- [ ] Success toast appears
- [ ] Table refreshes with updated data
4. **Test rollback button**
- [ ] Always enabled
- [ ] Click opens dialog with reason input
- [ ] Reason field is required
- [ ] Confirm triggers API call
- [ ] Success toast shows agent count
- [ ] Table refreshes with updated data
5. **Test error handling**
- [ ] Shows loading state during fetch
- [ ] Shows error message if API fails
- [ ] Retry button works
- [ ] Shows empty state if no rollouts
6. **Test auto-refresh**
- [ ] Data refreshes every 30 seconds
- [ ] Refresh doesn't disrupt UI interactions
- [ ] Manual refresh button works
### Success Criteria
- ✅ All table columns display correct data
- ✅ Health badges use correct colors
- ✅ Promote button only enabled for healthy beta versions
- ✅ Rollback button always enabled
- ✅ Confirmation dialogs work
- ✅ API calls succeed
- ✅ Toasts display success/error
- ✅ Auto-refresh works
- ✅ Responsive on mobile
---
## Test 6: Integration Testing
**Objective:** Test complete workflows end-to-end.
### Workflow 1: New Build → Beta Testing → Promotion → Stable Deployment
1. Trigger new build (auto-bumps version)
2. Verify .channel files = "beta"
3. Mark GURU-KALI as beta agent
4. Wait for update dispatch
5. Monitor update installation
6. Verify success event logged
7. Repeat 4 more times for healthy status
8. Promote via dashboard
9. Verify GURU-5070 (stable) receives update
10. Monitor stable deployment
11. Verify all agents updated
**Expected:** Beta testing prevents bad updates from reaching production.
### Workflow 2: Critical Bug → Rollback → Fleet Downgrade
1. Simulate critical bug discovered post-promotion
2. Execute rollback via dashboard
3. Verify all agents receive forced downgrade
4. Verify agents revert to previous stable
5. Verify new agents don't receive blocked version
6. Verify health metrics show blocked status
**Expected:** Rollback protects fleet from bad updates.
### Workflow 3: Crash Detection → Auto-Block (Future Enhancement)
1. Deploy update to beta agents
2. Simulate crash (stop service after update)
3. Wait for health monitor (60s)
4. Verify crash detected and logged
5. Check if crash rate >25%
6. Verify health status = "critical"
7. Attempt promotion
8. Verify promotion blocked
**Expected:** High crash rates prevent automatic promotion.
---
## Performance Testing
### Load Testing
- [ ] 100+ agents checking for updates simultaneously
- [ ] Scanner performance with 50+ versions
- [ ] Health monitor with 1000+ update events
- [ ] Dashboard with 20+ rollouts displayed
### Stress Testing
- [ ] Rapid version releases (5 builds in 10 minutes)
- [ ] Mass rollback (100+ agents)
- [ ] Concurrent API calls (multiple users promoting/rolling back)
---
## Security Testing
### Authentication
- [ ] All API endpoints require valid JWT
- [ ] Expired tokens rejected
- [ ] Invalid tokens rejected
### Authorization
- [ ] Admin role can promote/rollback
- [ ] Non-admin role blocked (if RBAC implemented)
### Input Validation
- [ ] SQL injection attempts blocked
- [ ] XSS attempts in reason field sanitized
- [ ] Invalid version strings rejected
- [ ] Invalid OS/arch values rejected
### File System Security
- [ ] .channel files have correct permissions
- [ ] Path traversal attempts blocked
- [ ] Only authorized processes can modify .channel files
---
## Regression Testing
### Existing Functionality
- [ ] Agent registration still works
- [ ] Heartbeat processing unaffected
- [ ] Command execution unaffected
- [ ] Metrics collection unaffected
- [ ] Alert generation unaffected
- [ ] Policy enforcement unaffected
### Database Performance
- [ ] No slow queries introduced
- [ ] Indexes used efficiently
- [ ] No lock contention
---
## Documentation Verification
- [ ] API endpoints documented
- [ ] Database schema documented
- [ ] Dashboard user guide accurate
- [ ] Admin procedures documented
- [ ] Troubleshooting guide created
---
## Sign-Off
### Phase 6 Test Results
**Tester:** ___________________________
**Date:** ___________________________
**Test 1 - Beta-First Workflow:** ⬜ PASS ⬜ FAIL
**Test 2 - Health Monitoring:** ⬜ PASS ⬜ FAIL
**Test 3 - Promotion:** ⬜ PASS ⬜ FAIL
**Test 4 - Rollback:** ⬜ PASS ⬜ FAIL
**Test 5 - Dashboard UI:** ⬜ PASS ⬜ FAIL
**Test 6 - Integration:** ⬜ PASS ⬜ FAIL
**Overall Status:** ⬜ APPROVED FOR PRODUCTION ⬜ NEEDS FIXES
**Notes:**
```
```
**Blockers/Issues:**
```
```
**Deployment Date:** ___________________________