# Power Failure Recovery Runbook — ACG Office Run through these checks IN ORDER after any unplanned power event. All SSH uses `C:\Windows\System32\OpenSSH\ssh.exe` (never Git SSH). --- ## 0. Confirm you have LAN access If working remotely, Tailscale must be fixed before anything else can be reached. If on-site LAN, skip to Step 1. --- ## 1. pfSense — Tailscale subnet routes **What breaks:** After reboot, pfSense loses its advertised Tailscale routes (`AdvertiseRoutes: null`). Remote machines can no longer reach 172.16.x.x. **Check:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll" ``` Healthy output: `"AdvertiseRoutes": ["172.16.0.0/22"]` and `"RouteAll": true` **Fix:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes" ``` **Verify:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status" # pfsense-2 should NOT show "rx 0" after a few seconds Test-NetConnection -ComputerName 172.16.3.20 -Port 22 ``` --- ## 2. Jupiter (Unraid) — libvirt / VMs **What breaks:** libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot. libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down. **Host:** 172.16.3.20 (SSH as root, no password — key auth) ### 2a. Check if libvirt.img is mounted ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt" ``` Healthy: shows `/dev/loopN on /etc/libvirt` Broken: no output ### 2b. Check libvirtd process ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep" ``` ### 2c. Fix — mount image and start libvirtd ```powershell # Mount libvirt config image & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img" # Note the loop device returned (e.g. /dev/loop4) & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu" # Start libvirtd & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d" # Verify VMs came up & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all" ``` **Expected VM list:** | Name | Expected State | |------|---------------| | GuruRMM | running | | Unifi | running | | OwnCloud | running | | Claude-Builder | running | | Windows 7 | shut off | | Windows Server 2016 | shut off | | Windows Server 2016_Template | shut off | ### 2d. Stale socket cleanup (if libvirtd still fails) ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock" # If it shows as a directory (not a socket), remove it: & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock" # Then retry libvirtd -d ``` --- ## 3. Seafile — seahub process **What breaks:** Seahub (Django/gunicorn) does not survive container restart cleanly. Containers show "Up" but sync.azcomputerguru.com returns 5xx. **Check:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep" ``` Healthy: 3+ gunicorn worker processes visible Broken: no gunicorn output **Fix:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'" ``` **Verify:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/" # Should return 302 ``` --- ## 4. NPM — iptables port 443 rule **What breaks:** The iptables PREROUTING rule that routes :443 → NPM container is added at boot via `/boot/config/go` on Jupiter. If that rule is missing (e.g. first boot after it was added), sync.azcomputerguru.com HTTPS will fail even though NPM is running. **Check:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'" ``` Healthy: `DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443` **Fix (if missing):** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443" ``` --- ## 5. NPM — nginx health **What breaks:** NPM's nginx may not be serving after a container restart. **Check:** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1" ``` **Fix (reload nginx config):** ```powershell & "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload" ``` --- ## 6. End-to-End Verification Run all of these. Any False or non-2xx is a problem. ```powershell # Core network $checks = @( @{host="172.16.3.20"; port=22; label="Jupiter SSH"}, @{host="172.16.3.20"; port=3000; label="Gitea"}, @{host="172.16.3.30"; port=22; label="GuruRMM VM SSH"}, @{host="172.16.3.30"; port=3001; label="GuruRMM server"}, @{host="172.16.3.30"; port=8001; label="Coord API"}, @{host="172.16.3.20"; port=443; label="NPM HTTPS (via iptables)"}, @{host="172.16.3.20"; port=8082; label="Seafile direct"} ) foreach ($c in $checks) { $r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue $status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" } Write-Host "$status $($c.label) ($($c.host):$($c.port))" } # DNS Clear-DnsClientCache $dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue $dnsOk = $dns.IPAddress -eq "172.16.3.20" Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)" # HTTPS end-to-end $resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)" ``` --- ## Infrastructure Reference | Host | IP | Role | |------|----|------| | pfSense | 172.16.0.1 (SSH port 2248) | Router, DNS, Tailscale subnet router | | Jupiter | 172.16.3.20 | Unraid NAS — hosts all VMs + Docker | | Uranus | 172.16.3.21 | OwnCloud additional storage (not a proxy) | | GuruRMM VM | 172.16.3.30 | Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB | | Pluto | 172.16.3.36 | Windows Server 2019 VM on Jupiter — build server | | Tailscale range | 172.16.0.0/22 | Advertised via pfSense pfsense-2 node | **Docker containers on Jupiter (172.16.3.20):** | Container | Purpose | Key ports | |-----------|---------|-----------| | npm | Nginx Proxy Manager | 1880 (HTTP), 7818 (admin), 18443 (HTTPS) | | seafile | Seafile web/app | 8082 (HTTP) | | seafile-mysql | Seafile DB | internal | | seafile-elasticsearch | Seafile search | internal | | seafile-memcached | Seafile cache | internal | **NPM proxy hosts:** | Domain | Backend | |--------|---------| | sync.azcomputerguru.com | 172.16.3.20:8082 (Seafile) | | rmm.azcomputerguru.com | 172.16.3.30:3001 (GuruRMM) | | rmm-api.azcomputerguru.com | 172.16.3.30:3001 | | git.azcomputerguru.com | 172.16.3.20:3000 (Gitea) | | unifi.azcomputerguru.com | (Unifi VM) | | emby.azcomputerguru.com | (Emby) | --- ## Known Post-Power-Failure Issue Pattern Unraid's VM plugin (`dynamix.vm.manager`) should auto-mount `libvirt.img` at boot. When it doesn't, the root cause is usually that the Unraid array came up before emhttp finished initializing, or the go script ran before the array was fully mounted. **Permanent fix (TODO):** Add a user script via Unraid's User Scripts plugin that runs at array start and checks/mounts libvirt.img if not already mounted. This would eliminate the manual step 2c above. --- --- ## 2026-05-17 Post-Mortem **Root cause:** Power flicker at the office. UPS batteries were disconnected during a rack reorganization move, so units had no backup capacity and shut down on the flicker instead of riding through it. **Resolution:** Mike reconnected batteries and restarted UPS units. **Auto-recovery:** Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically. **Manual intervention required:** IX server (neptune/exchange host) did NOT auto-restart — required a physical button press at the rack. Note for future: verify whether this is always the case or was a one-off (BIOS power-on-after-failure setting may need adjustment). **Remote fixes applied:** All steps 1–5 above were needed. Total recovery time ~1 hour. --- *Last updated: 2026-05-17 — documented after power failure recovery* *Checked by: Mike Swanson*