Files
claudetools/PHASE_6_TEST_PLAN.md
Mike Swanson 67182e0684 docs(rollout): reconcile Phase 5/6 docs — safe-rollout gating is inert (BUG-004)
Per the 2026-05-25 re-audit + Mike's decision (option b): the safe-rollout
promotion gating these docs describe/test is NOT live (update_rollouts /
update_health_metrics written-but-never-read; crash detection dead until the
unmerged BUG-002 fix). Added a [WARNING] STATUS banner to the test plan, verify
script, and the two 'complete' summaries so they aren't trusted as validating a
working feature. Automation is a roadmap Phase-2 item requiring a full re-spec.
2026-05-25 15:04:44 -07:00

16 KiB

Phase 6: End-to-End Testing Plan

[WARNING] STATUS — 2026-05-25 re-audit: the safe-rollout gating this plan tests is NOT live. update_rollouts and update_health_metrics are written but never read to gate promotion (see gururmm docs/FEATURE_ROADMAP.md BUG-004); crash detection was dead code until the BUG-002 fix (branch fix/audit-2-remediation, unmerged). Promotion is currently 100% manual via channel columns. Treat this document as Phase-2 aspirational — passing/failing it does not reflect a working safe-rollout feature. Decision 2026-05-25 (Mike): keep the feature inert and labeled; automation deferred to a roadmap Phase-2 item that must be properly re-spec'd.

Safe Agent Rollout System

Date: 2026-05-25 Version: GuruRMM v0.6.41+ Tester: Mike Swanson


Prerequisites

Environment Setup

  • SSH access to gururmm-build (172.16.3.30)
  • Access to GuruRMM dashboard (https://rmm.azcomputerguru.com)
  • JWT token for API testing
  • At least 2 test agents (GURU-KALI, GURU-5070 recommended)

Pre-Test Verification

# On Saturn
ssh azcomputerguru@172.16.3.30

# 1. Verify migration 046 is applied
sudo -u postgres psql gururmm_production -c "\d update_rollouts"
sudo -u postgres psql gururmm_production -c "\d update_health_metrics"
sudo -u postgres psql gururmm_production -c "\d agent_update_events"

# 2. Verify server build is current
cd /opt/gururmm/server
git status  # Should show Phase 4 code
cargo build --release --features production

# 3. Verify dashboard build is current
cd /opt/gururmm/dashboard
git status  # Should show Phase 5 code
npm run build

# 4. Verify health monitor is running
sudo systemctl status gururmm-server
sudo journalctl -u gururmm-server -n 50 | grep "Health monitoring task spawned"

Test 1: Beta-First Build Workflow

Objective: Verify new builds default to beta channel and stable agents don't receive them.

Steps

  1. Trigger a test build
# On Saturn
cd /opt/gururmm
sudo ./build-linux.sh  # Will auto-increment to next version
sudo ./build-windows.sh
  1. Verify .channel files created
cd /var/www/gururmm/downloads
ls -la *.channel | tail -10

# Expected: New version should have .channel files containing "beta"
VERSION=$(ls -t gururmm-agent-linux-amd64-*.tar.gz | head -1 | grep -oP '\d+\.\d+\.\d+')
cat gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Should output: beta
  1. Mark test agents as beta
# Via API or SQL
curl -X PATCH https://rmm.azcomputerguru.com/api/agents/GURU-KALI-UUID/channel \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"update_channel": "beta"}'
  1. Verify beta agent receives update
  • Open dashboard → Agents → GURU-KALI
  • Wait for agent connection (heartbeat every 60s)
  • Check agent state for pending update
  • Expected: Should see update_available = true
  1. Verify stable agent does NOT receive update
  • Ensure GURU-5070 is on "stable" channel
  • Check agent state
  • Expected: update_available = false (version not in stable channel)

Success Criteria

  • .channel files exist for new version
  • .channel files contain "beta"
  • Beta agents offered the update
  • Stable agents NOT offered the update
  • Scanner logs show beta/stable filtering

Test 2: Health Monitoring & Crash Detection

Objective: Verify health monitor detects crashes and updates metrics.

Steps

  1. Clear existing health data (optional)
sudo -u postgres psql gururmm_production -c "DELETE FROM update_health_metrics WHERE version = '$VERSION';"
sudo -u postgres psql gururmm_production -c "DELETE FROM agent_update_events WHERE version_to = '$VERSION';"
  1. Simulate successful update
# On test agent (GURU-KALI)
# Let update complete normally
# Wait 5 minutes
  1. Check event logging
SELECT event_type, version_to, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
ORDER BY created_at DESC
LIMIT 5;

# Expected events:
# - update_dispatched
# - download_started (if implemented)
# - download_complete (if implemented)
# - update_applied
  1. Check health metrics incremented
SELECT version, total_attempts, successful_updates, failed_updates, crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';

# Expected:
# total_attempts = 1
# successful_updates = 1
# health_status = 'unknown' (< 5 attempts)
  1. Simulate crash
# On test agent
# 1. Trigger update dispatch
# 2. Immediately after "update_applied" event, stop agent service
sudo systemctl stop gururmm-agent
# 3. Wait 60-90 seconds for health monitor scan
  1. Verify crash detection
SELECT event_type, created_at
FROM agent_update_events
WHERE agent_id = 'GURU-KALI-UUID'
AND event_type = 'crash_detected'
ORDER BY created_at DESC;

# Expected: Should see crash_detected event

SELECT crash_count, health_status
FROM update_health_metrics
WHERE version = '$VERSION';

# Expected: crash_count incremented, health_status may change
  1. Check server logs
sudo journalctl -u gururmm-server -n 100 | grep -E "crash|health"
# Expected: "Detected crash: agent X went offline after updating to Y"

Success Criteria

  • Events logged correctly (update_dispatched, update_applied)
  • Health metrics incremented on success
  • Crash detected within 90 seconds
  • crash_detected event logged
  • Crash counter incremented
  • Health status updated based on thresholds

Test 3: Promotion Workflow

Objective: Verify promotion from beta to stable with health gates.

Steps

  1. Attempt promotion with insufficient data
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"os": "linux", "arch": "amd64", "force": false}'

# Expected: May succeed (unknown status allows promotion) or fail if health check implemented
  1. Generate healthy metrics
# Simulate 5+ successful updates
# Option A: Manually insert via SQL (for testing)
# Option B: Trigger real updates on multiple beta agents

# SQL approach for testing:
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET total_attempts = 5,
    successful_updates = 5,
    failed_updates = 0,
    crash_count = 0,
    health_status = 'healthy'
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';
EOF
  1. Verify health status
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'

# Expected: health.status = "healthy"
  1. Promote to stable
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"os": "linux", "arch": "amd64", "force": false}'

# Expected: {"success": true, "message": "Promoted...", "files_updated": 2}
  1. Verify .channel files updated
cat /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: stable
  1. Verify database updated
SELECT channel, promoted_at, promoted_by
FROM update_rollouts
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';

# Expected: channel = 'stable', promoted_at = NOW(), promoted_by = user_id
  1. Verify stable agents receive update
  • Ensure test agent is on "stable" channel
  • Wait for scanner rescan (happens immediately after promotion)
  • Check agent state
  • Expected: update_available = true
  1. Test force promotion
# Set health to warning
sudo -u postgres psql gururmm_production << EOF
UPDATE update_health_metrics
SET health_status = 'warning'
WHERE version = '$VERSION' AND os = 'windows' AND arch = 'amd64';
EOF

# Try promotion without force
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"os": "windows", "arch": "amd64", "force": false}'

# Expected: 403 error with message about health status

# Try with force flag
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/promote \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"os": "windows", "arch": "amd64", "force": true}'

# Expected: 200 success (overridden health check)

Success Criteria

  • Promotion blocked for unhealthy versions (unless forced)
  • Promotion succeeds for healthy versions
  • .channel files updated from "beta" to "stable"
  • Database rollouts table updated
  • Scanner rescans immediately
  • Stable agents receive update after promotion
  • Force flag overrides health checks
  • Dashboard shows updated channel

Test 4: Rollback Workflow

Objective: Verify rollback blocks version and force-downgrades agents.

Steps

  1. Prepare for rollback
# Ensure test agent is running the rollback target version
# Verify previous stable version exists
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | select(.channel == "stable") | .version'
  1. Execute rollback
curl -X POST https://rmm.azcomputerguru.com/api/updates/rollouts/$VERSION/rollback \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "os": "linux",
    "arch": "amd64",
    "reason": "Test rollback: simulating critical bug in version '$VERSION'"
  }'

# Expected: {"success": true, "agents_affected": 1, "downgrade_version": "0.6.40"}
  1. Verify .channel files removed
ls /var/www/gururmm/downloads/gururmm-agent-linux-amd64-${VERSION}.tar.gz.channel
# Expected: File not found (removed)
  1. Verify health status blocked
SELECT health_status, last_incident
FROM update_health_metrics
WHERE version = '$VERSION' AND os = 'linux' AND arch = 'amd64';

# Expected: health_status = 'blocked', last_incident = reason text
  1. Verify forced downgrade dispatched
# Check server logs for WebSocket dispatch
sudo journalctl -u gururmm-server -n 100 | grep -i "downgrade\|rollback"

# Check agent receives forced update
# Monitor agent logs for update trigger
  1. Verify agent downgrades
  • Agent should receive UpdateAvailable message with previous version
  • Agent should download and install previous version
  • Check agent version after completion
  • Expected: agent_version = previous stable version
  1. Verify blocked version not offered again
# Scanner should skip files without .channel files
# Verify version is not in available updates list
curl https://rmm.azcomputerguru.com/api/updates/rollouts \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | select(.version == "'$VERSION'")'

# If present, should show channel = null or health.status = "blocked"

Success Criteria

  • .channel files removed
  • Health status set to "blocked"
  • Last incident reason recorded
  • Connected agents receive forced downgrade
  • Agents successfully downgrade to previous stable
  • Blocked version not offered to new agents
  • Dashboard shows blocked status

Test 5: Dashboard UI Testing

Objective: Verify Updates page displays correctly and actions work.

Steps

  1. Access Updates page
  1. Verify data display
  • Table shows all rollout versions
  • Columns: Version, OS/Arch, Channel, Health, Success Rate, Agent Counts, Actions
  • Health badges color-coded (green/yellow/red/gray)
  • Success rate calculated correctly
  • Agent counts accurate
  1. Test promote button
  • Enabled for beta + healthy versions only
  • Disabled with tooltip for unhealthy versions
  • Click opens confirmation dialog
  • Confirm triggers API call
  • Success toast appears
  • Table refreshes with updated data
  1. Test rollback button
  • Always enabled
  • Click opens dialog with reason input
  • Reason field is required
  • Confirm triggers API call
  • Success toast shows agent count
  • Table refreshes with updated data
  1. Test error handling
  • Shows loading state during fetch
  • Shows error message if API fails
  • Retry button works
  • Shows empty state if no rollouts
  1. Test auto-refresh
  • Data refreshes every 30 seconds
  • Refresh doesn't disrupt UI interactions
  • Manual refresh button works

Success Criteria

  • All table columns display correct data
  • Health badges use correct colors
  • Promote button only enabled for healthy beta versions
  • Rollback button always enabled
  • Confirmation dialogs work
  • API calls succeed
  • Toasts display success/error
  • Auto-refresh works
  • Responsive on mobile

Test 6: Integration Testing

Objective: Test complete workflows end-to-end.

Workflow 1: New Build → Beta Testing → Promotion → Stable Deployment

  1. Trigger new build (auto-bumps version)
  2. Verify .channel files = "beta"
  3. Mark GURU-KALI as beta agent
  4. Wait for update dispatch
  5. Monitor update installation
  6. Verify success event logged
  7. Repeat 4 more times for healthy status
  8. Promote via dashboard
  9. Verify GURU-5070 (stable) receives update
  10. Monitor stable deployment
  11. Verify all agents updated

Expected: Beta testing prevents bad updates from reaching production.

Workflow 2: Critical Bug → Rollback → Fleet Downgrade

  1. Simulate critical bug discovered post-promotion
  2. Execute rollback via dashboard
  3. Verify all agents receive forced downgrade
  4. Verify agents revert to previous stable
  5. Verify new agents don't receive blocked version
  6. Verify health metrics show blocked status

Expected: Rollback protects fleet from bad updates.

Workflow 3: Crash Detection → Auto-Block (Future Enhancement)

  1. Deploy update to beta agents
  2. Simulate crash (stop service after update)
  3. Wait for health monitor (60s)
  4. Verify crash detected and logged
  5. Check if crash rate >25%
  6. Verify health status = "critical"
  7. Attempt promotion
  8. Verify promotion blocked

Expected: High crash rates prevent automatic promotion.


Performance Testing

Load Testing

  • 100+ agents checking for updates simultaneously
  • Scanner performance with 50+ versions
  • Health monitor with 1000+ update events
  • Dashboard with 20+ rollouts displayed

Stress Testing

  • Rapid version releases (5 builds in 10 minutes)
  • Mass rollback (100+ agents)
  • Concurrent API calls (multiple users promoting/rolling back)

Security Testing

Authentication

  • All API endpoints require valid JWT
  • Expired tokens rejected
  • Invalid tokens rejected

Authorization

  • Admin role can promote/rollback
  • Non-admin role blocked (if RBAC implemented)

Input Validation

  • SQL injection attempts blocked
  • XSS attempts in reason field sanitized
  • Invalid version strings rejected
  • Invalid OS/arch values rejected

File System Security

  • .channel files have correct permissions
  • Path traversal attempts blocked
  • Only authorized processes can modify .channel files

Regression Testing

Existing Functionality

  • Agent registration still works
  • Heartbeat processing unaffected
  • Command execution unaffected
  • Metrics collection unaffected
  • Alert generation unaffected
  • Policy enforcement unaffected

Database Performance

  • No slow queries introduced
  • Indexes used efficiently
  • No lock contention

Documentation Verification

  • API endpoints documented
  • Database schema documented
  • Dashboard user guide accurate
  • Admin procedures documented
  • Troubleshooting guide created

Sign-Off

Phase 6 Test Results

Tester: ___________________________ Date: ___________________________

Test 1 - Beta-First Workflow: PASS FAIL Test 2 - Health Monitoring: PASS FAIL Test 3 - Promotion: PASS FAIL Test 4 - Rollback: PASS FAIL Test 5 - Dashboard UI: PASS FAIL Test 6 - Integration: PASS FAIL

Overall Status: APPROVED FOR PRODUCTION NEEDS FIXES

Notes:



Blockers/Issues:



Deployment Date: ___________________________