Files

Mike Swanson 3baaf91183 sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-17 22:07:52

Author: Mike Swanson
Machine: DESKTOP-0O8A1RL
Timestamp: 2026-05-17 22:07:52

2026-05-17 22:07:59 -07:00

8.7 KiB

Raw Blame History

Power Failure Recovery Runbook — ACG Office

Run through these checks IN ORDER after any unplanned power event. All SSH uses C:\Windows\System32\OpenSSH\ssh.exe (never Git SSH).

0. Confirm you have LAN access

If working remotely, Tailscale must be fixed before anything else can be reached. If on-site LAN, skip to Step 1.

1. pfSense — Tailscale subnet routes

What breaks: After reboot, pfSense loses its advertised Tailscale routes (AdvertiseRoutes: null). Remote machines can no longer reach 172.16.x.x.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll"

Healthy output: "AdvertiseRoutes": ["172.16.0.0/22"] and "RouteAll": true

Fix:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes"

Verify:

& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status"
# pfsense-2 should NOT show "rx 0" after a few seconds
Test-NetConnection -ComputerName 172.16.3.20 -Port 22

2. Jupiter (Unraid) — libvirt / VMs

What breaks: libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot. libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down.

Host: 172.16.3.20 (SSH as root, no password — key auth)

2a. Check if libvirt.img is mounted

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt"

Healthy: shows /dev/loopN on /etc/libvirt Broken: no output

2b. Check libvirtd process

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep"

2c. Fix — mount image and start libvirtd

# Mount libvirt config image
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img"
# Note the loop device returned (e.g. /dev/loop4)
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu"

# Start libvirtd
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d"

# Verify VMs came up
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all"

Expected VM list:

Name	Expected State
GuruRMM	running
Unifi	running
OwnCloud	running
Claude-Builder	running
Windows 7	shut off
Windows Server 2016	shut off
Windows Server 2016_Template	shut off

2d. Stale socket cleanup (if libvirtd still fails)

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock"
# If it shows as a directory (not a socket), remove it:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock"
# Then retry libvirtd -d

3. Seafile — seahub process

What breaks: Seahub (Django/gunicorn) does not survive container restart cleanly. Containers show "Up" but sync.azcomputerguru.com returns 5xx.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep"

Healthy: 3+ gunicorn worker processes visible Broken: no gunicorn output

Fix:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'"

Verify:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/"
# Should return 302

4. NPM — iptables port 443 rule

What breaks: The iptables PREROUTING rule that routes :443 → NPM container is added at boot via /boot/config/go on Jupiter. If that rule is missing (e.g. first boot after it was added), sync.azcomputerguru.com HTTPS will fail even though NPM is running.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'"

Healthy: DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443

Fix (if missing):

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443"

5. NPM — nginx health

What breaks: NPM's nginx may not be serving after a container restart.

Check:

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1"

Fix (reload nginx config):

& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload"

6. End-to-End Verification

Run all of these. Any False or non-2xx is a problem.

# Core network
$checks = @(
    @{host="172.16.3.20"; port=22;   label="Jupiter SSH"},
    @{host="172.16.3.20"; port=3000; label="Gitea"},
    @{host="172.16.3.30"; port=22;   label="GuruRMM VM SSH"},
    @{host="172.16.3.30"; port=3001; label="GuruRMM server"},
    @{host="172.16.3.30"; port=8001; label="Coord API"},
    @{host="172.16.3.20"; port=443;  label="NPM HTTPS (via iptables)"},
    @{host="172.16.3.20"; port=8082; label="Seafile direct"}
)
foreach ($c in $checks) {
    $r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue
    $status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" }
    Write-Host "$status $($c.label) ($($c.host):$($c.port))"
}

# DNS
Clear-DnsClientCache
$dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue
$dnsOk = $dns.IPAddress -eq "172.16.3.20"
Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)"

# HTTPS end-to-end
$resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue
Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)"

Infrastructure Reference

Host	IP	Role
pfSense	172.16.0.1 (SSH port 2248)	Router, DNS, Tailscale subnet router
Jupiter	172.16.3.20	Unraid NAS — hosts all VMs + Docker
Uranus	172.16.3.21	OwnCloud additional storage (not a proxy)
GuruRMM VM	172.16.3.30	Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB
Pluto	172.16.3.36	Windows Server 2019 VM on Jupiter — build server
Tailscale range	172.16.0.0/22	Advertised via pfSense pfsense-2 node

Docker containers on Jupiter (172.16.3.20):

Container	Purpose	Key ports
npm	Nginx Proxy Manager	1880 (HTTP), 7818 (admin), 18443 (HTTPS)
seafile	Seafile web/app	8082 (HTTP)
seafile-mysql	Seafile DB	internal
seafile-elasticsearch	Seafile search	internal
seafile-memcached	Seafile cache	internal

NPM proxy hosts:

Domain	Backend
sync.azcomputerguru.com	172.16.3.20:8082 (Seafile)
rmm.azcomputerguru.com	172.16.3.30:3001 (GuruRMM)
rmm-api.azcomputerguru.com	172.16.3.30:3001
git.azcomputerguru.com	172.16.3.20:3000 (Gitea)
unifi.azcomputerguru.com	(Unifi VM)
emby.azcomputerguru.com	(Emby)

Known Post-Power-Failure Issue Pattern

Unraid's VM plugin (dynamix.vm.manager) should auto-mount libvirt.img at boot. When it doesn't, the root cause is usually that the Unraid array came up before emhttp finished initializing, or the go script ran before the array was fully mounted.

Permanent fix (TODO): Add a user script via Unraid's User Scripts plugin that runs at array start and checks/mounts libvirt.img if not already mounted. This would eliminate the manual step 2c above.

2026-05-17 Post-Mortem

Root cause: Power flicker at the office. UPS batteries were disconnected during a rack reorganization move, so units had no backup capacity and shut down on the flicker instead of riding through it.

Resolution: Mike reconnected batteries and restarted UPS units.

Auto-recovery: Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically.

Manual intervention required: IX server (neptune/exchange host) did NOT auto-restart — required a physical button press at the rack. Note for future: verify whether this is always the case or was a one-off (BIOS power-on-after-failure setting may need adjustment).

Remote fixes applied: All steps 1–5 above were needed. Total recovery time ~1 hour.

Last updated: 2026-05-17 — documented after power failure recovery Checked by: Mike Swanson

8.7 KiB Raw Blame History Unescape Escape

Power Failure Recovery Runbook — ACG Office

0. Confirm you have LAN access

1. pfSense — Tailscale subnet routes

2. Jupiter (Unraid) — libvirt / VMs

2a. Check if libvirt.img is mounted

2b. Check libvirtd process

2c. Fix — mount image and start libvirtd

2d. Stale socket cleanup (if libvirtd still fails)

3. Seafile — seahub process

4. NPM — iptables port 443 rule

5. NPM — nginx health

6. End-to-End Verification

Infrastructure Reference

Known Post-Power-Failure Issue Pattern

2026-05-17 Post-Mortem

8.7 KiB

Raw Blame History