Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-17 22:07:52
8.7 KiB
Power Failure Recovery Runbook — ACG Office
Run through these checks IN ORDER after any unplanned power event.
All SSH uses C:\Windows\System32\OpenSSH\ssh.exe (never Git SSH).
0. Confirm you have LAN access
If working remotely, Tailscale must be fixed before anything else can be reached. If on-site LAN, skip to Step 1.
1. pfSense — Tailscale subnet routes
What breaks: After reboot, pfSense loses its advertised Tailscale routes (AdvertiseRoutes: null).
Remote machines can no longer reach 172.16.x.x.
Check:
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll"
Healthy output: "AdvertiseRoutes": ["172.16.0.0/22"] and "RouteAll": true
Fix:
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes"
Verify:
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status"
# pfsense-2 should NOT show "rx 0" after a few seconds
Test-NetConnection -ComputerName 172.16.3.20 -Port 22
2. Jupiter (Unraid) — libvirt / VMs
What breaks: libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot. libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down.
Host: 172.16.3.20 (SSH as root, no password — key auth)
2a. Check if libvirt.img is mounted
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt"
Healthy: shows /dev/loopN on /etc/libvirt
Broken: no output
2b. Check libvirtd process
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep"
2c. Fix — mount image and start libvirtd
# Mount libvirt config image
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img"
# Note the loop device returned (e.g. /dev/loop4)
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu"
# Start libvirtd
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d"
# Verify VMs came up
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all"
Expected VM list:
| Name | Expected State |
|---|---|
| GuruRMM | running |
| Unifi | running |
| OwnCloud | running |
| Claude-Builder | running |
| Windows 7 | shut off |
| Windows Server 2016 | shut off |
| Windows Server 2016_Template | shut off |
2d. Stale socket cleanup (if libvirtd still fails)
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock"
# If it shows as a directory (not a socket), remove it:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock"
# Then retry libvirtd -d
3. Seafile — seahub process
What breaks: Seahub (Django/gunicorn) does not survive container restart cleanly. Containers show "Up" but sync.azcomputerguru.com returns 5xx.
Check:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep"
Healthy: 3+ gunicorn worker processes visible Broken: no gunicorn output
Fix:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'"
Verify:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/"
# Should return 302
4. NPM — iptables port 443 rule
What breaks: The iptables PREROUTING rule that routes :443 → NPM container is added at boot
via /boot/config/go on Jupiter. If that rule is missing (e.g. first boot after it was added),
sync.azcomputerguru.com HTTPS will fail even though NPM is running.
Check:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'"
Healthy: DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443
Fix (if missing):
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443"
5. NPM — nginx health
What breaks: NPM's nginx may not be serving after a container restart.
Check:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1"
Fix (reload nginx config):
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload"
6. End-to-End Verification
Run all of these. Any False or non-2xx is a problem.
# Core network
$checks = @(
@{host="172.16.3.20"; port=22; label="Jupiter SSH"},
@{host="172.16.3.20"; port=3000; label="Gitea"},
@{host="172.16.3.30"; port=22; label="GuruRMM VM SSH"},
@{host="172.16.3.30"; port=3001; label="GuruRMM server"},
@{host="172.16.3.30"; port=8001; label="Coord API"},
@{host="172.16.3.20"; port=443; label="NPM HTTPS (via iptables)"},
@{host="172.16.3.20"; port=8082; label="Seafile direct"}
)
foreach ($c in $checks) {
$r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue
$status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" }
Write-Host "$status $($c.label) ($($c.host):$($c.port))"
}
# DNS
Clear-DnsClientCache
$dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue
$dnsOk = $dns.IPAddress -eq "172.16.3.20"
Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)"
# HTTPS end-to-end
$resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue
Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)"
Infrastructure Reference
| Host | IP | Role |
|---|---|---|
| pfSense | 172.16.0.1 (SSH port 2248) | Router, DNS, Tailscale subnet router |
| Jupiter | 172.16.3.20 | Unraid NAS — hosts all VMs + Docker |
| Uranus | 172.16.3.21 | OwnCloud additional storage (not a proxy) |
| GuruRMM VM | 172.16.3.30 | Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB |
| Pluto | 172.16.3.36 | Windows Server 2019 VM on Jupiter — build server |
| Tailscale range | 172.16.0.0/22 | Advertised via pfSense pfsense-2 node |
Docker containers on Jupiter (172.16.3.20):
| Container | Purpose | Key ports |
|---|---|---|
| npm | Nginx Proxy Manager | 1880 (HTTP), 7818 (admin), 18443 (HTTPS) |
| seafile | Seafile web/app | 8082 (HTTP) |
| seafile-mysql | Seafile DB | internal |
| seafile-elasticsearch | Seafile search | internal |
| seafile-memcached | Seafile cache | internal |
NPM proxy hosts:
| Domain | Backend |
|---|---|
| sync.azcomputerguru.com | 172.16.3.20:8082 (Seafile) |
| rmm.azcomputerguru.com | 172.16.3.30:3001 (GuruRMM) |
| rmm-api.azcomputerguru.com | 172.16.3.30:3001 |
| git.azcomputerguru.com | 172.16.3.20:3000 (Gitea) |
| unifi.azcomputerguru.com | (Unifi VM) |
| emby.azcomputerguru.com | (Emby) |
Known Post-Power-Failure Issue Pattern
Unraid's VM plugin (dynamix.vm.manager) should auto-mount libvirt.img at boot.
When it doesn't, the root cause is usually that the Unraid array came up before emhttp
finished initializing, or the go script ran before the array was fully mounted.
Permanent fix (TODO): Add a user script via Unraid's User Scripts plugin that runs at array start and checks/mounts libvirt.img if not already mounted. This would eliminate the manual step 2c above.
2026-05-17 Post-Mortem
Root cause: Power flicker at the office. UPS batteries were disconnected during a rack reorganization move, so units had no backup capacity and shut down on the flicker instead of riding through it.
Resolution: Mike reconnected batteries and restarted UPS units.
Auto-recovery: Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically.
Manual intervention required: IX server (neptune/exchange host) did NOT auto-restart — required a physical button press at the rack. Note for future: verify whether this is always the case or was a one-off (BIOS power-on-after-failure setting may need adjustment).
Remote fixes applied: All steps 1–5 above were needed. Total recovery time ~1 hour.
Last updated: 2026-05-17 — documented after power failure recovery Checked by: Mike Swanson