Files
claudetools/.claude/POWER_FAILURE_RUNBOOK.md
Mike Swanson 3baaf91183 sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-17 22:07:52
Author: Mike Swanson
Machine: DESKTOP-0O8A1RL
Timestamp: 2026-05-17 22:07:52
2026-05-17 22:07:59 -07:00

8.7 KiB
Raw Blame History

Power Failure Recovery Runbook — ACG Office

Run through these checks IN ORDER after any unplanned power event. All SSH uses C:\Windows\System32\OpenSSH\ssh.exe (never Git SSH).


0. Confirm you have LAN access

If working remotely, Tailscale must be fixed before anything else can be reached. If on-site LAN, skip to Step 1.


1. pfSense — Tailscale subnet routes

What breaks: After reboot, pfSense loses its advertised Tailscale routes (AdvertiseRoutes: null). Remote machines can no longer reach 172.16.x.x.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll"

Healthy output: "AdvertiseRoutes": ["172.16.0.0/22"] and "RouteAll": true

Fix:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes"

Verify:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status"
# pfsense-2 should NOT show "rx 0" after a few seconds
Test-NetConnection -ComputerName 172.16.3.20 -Port 22

2. Jupiter (Unraid) — libvirt / VMs

What breaks: libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot. libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down.

Host: 172.16.3.20 (SSH as root, no password — key auth)

2a. Check if libvirt.img is mounted

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt"

Healthy: shows /dev/loopN on /etc/libvirt Broken: no output

2b. Check libvirtd process

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep"

2c. Fix — mount image and start libvirtd

# Mount libvirt config image
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img"
# Note the loop device returned (e.g. /dev/loop4)
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu"

# Start libvirtd
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d"

# Verify VMs came up
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all"

Expected VM list:

Name Expected State
GuruRMM running
Unifi running
OwnCloud running
Claude-Builder running
Windows 7 shut off
Windows Server 2016 shut off
Windows Server 2016_Template shut off

2d. Stale socket cleanup (if libvirtd still fails)

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock"
# If it shows as a directory (not a socket), remove it:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock"
# Then retry libvirtd -d

3. Seafile — seahub process

What breaks: Seahub (Django/gunicorn) does not survive container restart cleanly. Containers show "Up" but sync.azcomputerguru.com returns 5xx.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep"

Healthy: 3+ gunicorn worker processes visible Broken: no gunicorn output

Fix:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'"

Verify:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/"
# Should return 302

4. NPM — iptables port 443 rule

What breaks: The iptables PREROUTING rule that routes :443 → NPM container is added at boot via /boot/config/go on Jupiter. If that rule is missing (e.g. first boot after it was added), sync.azcomputerguru.com HTTPS will fail even though NPM is running.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'"

Healthy: DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443

Fix (if missing):

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443"

5. NPM — nginx health

What breaks: NPM's nginx may not be serving after a container restart.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1"

Fix (reload nginx config):

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload"

6. End-to-End Verification

Run all of these. Any False or non-2xx is a problem.

# Core network
$checks = @(
    @{host="172.16.3.20"; port=22;   label="Jupiter SSH"},
    @{host="172.16.3.20"; port=3000; label="Gitea"},
    @{host="172.16.3.30"; port=22;   label="GuruRMM VM SSH"},
    @{host="172.16.3.30"; port=3001; label="GuruRMM server"},
    @{host="172.16.3.30"; port=8001; label="Coord API"},
    @{host="172.16.3.20"; port=443;  label="NPM HTTPS (via iptables)"},
    @{host="172.16.3.20"; port=8082; label="Seafile direct"}
)
foreach ($c in $checks) {
    $r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue
    $status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" }
    Write-Host "$status $($c.label) ($($c.host):$($c.port))"
}

# DNS
Clear-DnsClientCache
$dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue
$dnsOk = $dns.IPAddress -eq "172.16.3.20"
Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)"

# HTTPS end-to-end
$resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue
Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)"

Infrastructure Reference

Host IP Role
pfSense 172.16.0.1 (SSH port 2248) Router, DNS, Tailscale subnet router
Jupiter 172.16.3.20 Unraid NAS — hosts all VMs + Docker
Uranus 172.16.3.21 OwnCloud additional storage (not a proxy)
GuruRMM VM 172.16.3.30 Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB
Pluto 172.16.3.36 Windows Server 2019 VM on Jupiter — build server
Tailscale range 172.16.0.0/22 Advertised via pfSense pfsense-2 node

Docker containers on Jupiter (172.16.3.20):

Container Purpose Key ports
npm Nginx Proxy Manager 1880 (HTTP), 7818 (admin), 18443 (HTTPS)
seafile Seafile web/app 8082 (HTTP)
seafile-mysql Seafile DB internal
seafile-elasticsearch Seafile search internal
seafile-memcached Seafile cache internal

NPM proxy hosts:

Domain Backend
sync.azcomputerguru.com 172.16.3.20:8082 (Seafile)
rmm.azcomputerguru.com 172.16.3.30:3001 (GuruRMM)
rmm-api.azcomputerguru.com 172.16.3.30:3001
git.azcomputerguru.com 172.16.3.20:3000 (Gitea)
unifi.azcomputerguru.com (Unifi VM)
emby.azcomputerguru.com (Emby)

Known Post-Power-Failure Issue Pattern

Unraid's VM plugin (dynamix.vm.manager) should auto-mount libvirt.img at boot. When it doesn't, the root cause is usually that the Unraid array came up before emhttp finished initializing, or the go script ran before the array was fully mounted.

Permanent fix (TODO): Add a user script via Unraid's User Scripts plugin that runs at array start and checks/mounts libvirt.img if not already mounted. This would eliminate the manual step 2c above.



2026-05-17 Post-Mortem

Root cause: Power flicker at the office. UPS batteries were disconnected during a rack reorganization move, so units had no backup capacity and shut down on the flicker instead of riding through it.

Resolution: Mike reconnected batteries and restarted UPS units.

Auto-recovery: Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically.

Manual intervention required: IX server (neptune/exchange host) did NOT auto-restart — required a physical button press at the rack. Note for future: verify whether this is always the case or was a one-off (BIOS power-on-after-failure setting may need adjustment).

Remote fixes applied: All steps 15 above were needed. Total recovery time ~1 hour.


Last updated: 2026-05-17 — documented after power failure recovery Checked by: Mike Swanson