Files
claudetools/.claude/POWER_FAILURE_RUNBOOK.md
Mike Swanson 3baaf91183 sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-17 22:07:52
Author: Mike Swanson
Machine: DESKTOP-0O8A1RL
Timestamp: 2026-05-17 22:07:52
2026-05-17 22:07:59 -07:00

258 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Power Failure Recovery Runbook — ACG Office
Run through these checks IN ORDER after any unplanned power event.
All SSH uses `C:\Windows\System32\OpenSSH\ssh.exe` (never Git SSH).
---
## 0. Confirm you have LAN access
If working remotely, Tailscale must be fixed before anything else can be reached.
If on-site LAN, skip to Step 1.
---
## 1. pfSense — Tailscale subnet routes
**What breaks:** After reboot, pfSense loses its advertised Tailscale routes (`AdvertiseRoutes: null`).
Remote machines can no longer reach 172.16.x.x.
**Check:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll"
```
Healthy output: `"AdvertiseRoutes": ["172.16.0.0/22"]` and `"RouteAll": true`
**Fix:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes"
```
**Verify:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status"
# pfsense-2 should NOT show "rx 0" after a few seconds
Test-NetConnection -ComputerName 172.16.3.20 -Port 22
```
---
## 2. Jupiter (Unraid) — libvirt / VMs
**What breaks:** libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot.
libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down.
**Host:** 172.16.3.20 (SSH as root, no password — key auth)
### 2a. Check if libvirt.img is mounted
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt"
```
Healthy: shows `/dev/loopN on /etc/libvirt`
Broken: no output
### 2b. Check libvirtd process
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep"
```
### 2c. Fix — mount image and start libvirtd
```powershell
# Mount libvirt config image
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img"
# Note the loop device returned (e.g. /dev/loop4)
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu"
# Start libvirtd
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d"
# Verify VMs came up
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all"
```
**Expected VM list:**
| Name | Expected State |
|------|---------------|
| GuruRMM | running |
| Unifi | running |
| OwnCloud | running |
| Claude-Builder | running |
| Windows 7 | shut off |
| Windows Server 2016 | shut off |
| Windows Server 2016_Template | shut off |
### 2d. Stale socket cleanup (if libvirtd still fails)
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock"
# If it shows as a directory (not a socket), remove it:
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock"
# Then retry libvirtd -d
```
---
## 3. Seafile — seahub process
**What breaks:** Seahub (Django/gunicorn) does not survive container restart cleanly.
Containers show "Up" but sync.azcomputerguru.com returns 5xx.
**Check:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep"
```
Healthy: 3+ gunicorn worker processes visible
Broken: no gunicorn output
**Fix:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'"
```
**Verify:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/"
# Should return 302
```
---
## 4. NPM — iptables port 443 rule
**What breaks:** The iptables PREROUTING rule that routes :443 → NPM container is added at boot
via `/boot/config/go` on Jupiter. If that rule is missing (e.g. first boot after it was added),
sync.azcomputerguru.com HTTPS will fail even though NPM is running.
**Check:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'"
```
Healthy: `DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443`
**Fix (if missing):**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443"
```
---
## 5. NPM — nginx health
**What breaks:** NPM's nginx may not be serving after a container restart.
**Check:**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1"
```
**Fix (reload nginx config):**
```powershell
& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload"
```
---
## 6. End-to-End Verification
Run all of these. Any False or non-2xx is a problem.
```powershell
# Core network
$checks = @(
@{host="172.16.3.20"; port=22; label="Jupiter SSH"},
@{host="172.16.3.20"; port=3000; label="Gitea"},
@{host="172.16.3.30"; port=22; label="GuruRMM VM SSH"},
@{host="172.16.3.30"; port=3001; label="GuruRMM server"},
@{host="172.16.3.30"; port=8001; label="Coord API"},
@{host="172.16.3.20"; port=443; label="NPM HTTPS (via iptables)"},
@{host="172.16.3.20"; port=8082; label="Seafile direct"}
)
foreach ($c in $checks) {
$r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue
$status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" }
Write-Host "$status $($c.label) ($($c.host):$($c.port))"
}
# DNS
Clear-DnsClientCache
$dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue
$dnsOk = $dns.IPAddress -eq "172.16.3.20"
Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)"
# HTTPS end-to-end
$resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue
Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)"
```
---
## Infrastructure Reference
| Host | IP | Role |
|------|----|------|
| pfSense | 172.16.0.1 (SSH port 2248) | Router, DNS, Tailscale subnet router |
| Jupiter | 172.16.3.20 | Unraid NAS — hosts all VMs + Docker |
| Uranus | 172.16.3.21 | OwnCloud additional storage (not a proxy) |
| GuruRMM VM | 172.16.3.30 | Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB |
| Pluto | 172.16.3.36 | Windows Server 2019 VM on Jupiter — build server |
| Tailscale range | 172.16.0.0/22 | Advertised via pfSense pfsense-2 node |
**Docker containers on Jupiter (172.16.3.20):**
| Container | Purpose | Key ports |
|-----------|---------|-----------|
| npm | Nginx Proxy Manager | 1880 (HTTP), 7818 (admin), 18443 (HTTPS) |
| seafile | Seafile web/app | 8082 (HTTP) |
| seafile-mysql | Seafile DB | internal |
| seafile-elasticsearch | Seafile search | internal |
| seafile-memcached | Seafile cache | internal |
**NPM proxy hosts:**
| Domain | Backend |
|--------|---------|
| sync.azcomputerguru.com | 172.16.3.20:8082 (Seafile) |
| rmm.azcomputerguru.com | 172.16.3.30:3001 (GuruRMM) |
| rmm-api.azcomputerguru.com | 172.16.3.30:3001 |
| git.azcomputerguru.com | 172.16.3.20:3000 (Gitea) |
| unifi.azcomputerguru.com | (Unifi VM) |
| emby.azcomputerguru.com | (Emby) |
---
## Known Post-Power-Failure Issue Pattern
Unraid's VM plugin (`dynamix.vm.manager`) should auto-mount `libvirt.img` at boot.
When it doesn't, the root cause is usually that the Unraid array came up before emhttp
finished initializing, or the go script ran before the array was fully mounted.
**Permanent fix (TODO):** Add a user script via Unraid's User Scripts plugin that runs at
array start and checks/mounts libvirt.img if not already mounted. This would eliminate
the manual step 2c above.
---
---
## 2026-05-17 Post-Mortem
**Root cause:** Power flicker at the office. UPS batteries were disconnected during a rack
reorganization move, so units had no backup capacity and shut down on the flicker instead
of riding through it.
**Resolution:** Mike reconnected batteries and restarted UPS units.
**Auto-recovery:** Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically.
**Manual intervention required:** IX server (neptune/exchange host) did NOT auto-restart —
required a physical button press at the rack. Note for future: verify whether this is always
the case or was a one-off (BIOS power-on-after-failure setting may need adjustment).
**Remote fixes applied:** All steps 15 above were needed. Total recovery time ~1 hour.
---
*Last updated: 2026-05-17 — documented after power failure recovery*
*Checked by: Mike Swanson*