sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-17 22:07:52

Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-17 22:07:52
2026-05-17 22:07:59 -07:00
parent acb0af9d3a
commit 3baaf91183
4 changed files with 438 additions and 0 deletions
--- a/.claude/POWER_FAILURE_RUNBOOK.md
+++ b/.claude/POWER_FAILURE_RUNBOOK.md
@@ -0,0 +1,257 @@
+# Power Failure Recovery Runbook — ACG Office
+
+Run through these checks IN ORDER after any unplanned power event.
+All SSH uses `C:\Windows\System32\OpenSSH\ssh.exe` (never Git SSH).
+
+---
+
+## 0. Confirm you have LAN access
+
+If working remotely, Tailscale must be fixed before anything else can be reached.
+If on-site LAN, skip to Step 1.
+
+---
+
+## 1. pfSense — Tailscale subnet routes
+
+**What breaks:** After reboot, pfSense loses its advertised Tailscale routes (`AdvertiseRoutes: null`).
+Remote machines can no longer reach 172.16.x.x.
+
+**Check:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale debug prefs" | Select-String "AdvertiseRoutes|RouteAll"
+```
+Healthy output: `"AdvertiseRoutes": ["172.16.0.0/22"]` and `"RouteAll": true`
+
+**Fix:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale up --advertise-routes=172.16.0.0/22 --accept-routes"
+```
+
+**Verify:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" -p 2248 admin@172.16.0.1 "tailscale status"
+# pfsense-2 should NOT show "rx 0" after a few seconds
+Test-NetConnection -ComputerName 172.16.3.20 -Port 22
+```
+
+---
+
+## 2. Jupiter (Unraid) — libvirt / VMs
+
+**What breaks:** libvirt.img (contains /etc/libvirt/ configs) is not loop-mounted on boot.
+libvirtd fails with "socket already in use" or "snapshot dir not a directory". All VMs are down.
+
+**Host:** 172.16.3.20 (SSH as root, no password — key auth)
+
+### 2a. Check if libvirt.img is mounted
+
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount | grep libvirt"
+```
+Healthy: shows `/dev/loopN on /etc/libvirt`
+Broken: no output
+
+### 2b. Check libvirtd process
+
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ps aux | grep libvirtd | grep -v grep"
+```
+
+### 2c. Fix — mount image and start libvirtd
+
+```powershell
+# Mount libvirt config image
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "losetup -f --show /mnt/user/system/libvirt/libvirt.img"
+# Note the loop device returned (e.g. /dev/loop4)
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "mount /dev/loop4 /etc/libvirt && ls /etc/libvirt/qemu"
+
+# Start libvirtd
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "libvirtd -d"
+
+# Verify VMs came up
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "virsh -c qemu:///system list --all"
+```
+
+**Expected VM list:**
+| Name | Expected State |
+|------|---------------|
+| GuruRMM | running |
+| Unifi | running |
+| OwnCloud | running |
+| Claude-Builder | running |
+| Windows 7 | shut off |
+| Windows Server 2016 | shut off |
+| Windows Server 2016_Template | shut off |
+
+### 2d. Stale socket cleanup (if libvirtd still fails)
+
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "ls -la /run/libvirt/libvirt-sock"
+# If it shows as a directory (not a socket), remove it:
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "rm -rf /run/libvirt/libvirt-sock"
+# Then retry libvirtd -d
+```
+
+---
+
+## 3. Seafile — seahub process
+
+**What breaks:** Seahub (Django/gunicorn) does not survive container restart cleanly.
+Containers show "Up" but sync.azcomputerguru.com returns 5xx.
+
+**Check:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile ps aux 2>&1 | grep gunicorn | grep -v grep"
+```
+Healthy: 3+ gunicorn worker processes visible
+Broken: no gunicorn output
+
+**Fix:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile bash -c 'cd /opt/seafile/seafile-pro-server-12.0.19 && ./seahub.sh start 2>&1'"
+```
+
+**Verify:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec seafile curl -s -o /dev/null -w '%{http_code}' http://localhost:8000/"
+# Should return 302
+```
+
+---
+
+## 4. NPM — iptables port 443 rule
+
+**What breaks:** The iptables PREROUTING rule that routes :443 → NPM container is added at boot
+via `/boot/config/go` on Jupiter. If that rule is missing (e.g. first boot after it was added),
+sync.azcomputerguru.com HTTPS will fail even though NPM is running.
+
+**Check:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -L PREROUTING -n | grep 'dpt:443'"
+```
+Healthy: `DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443`
+
+**Fix (if missing):**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443"
+```
+
+---
+
+## 5. NPM — nginx health
+
+**What breaks:** NPM's nginx may not be serving after a container restart.
+
+**Check:**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -t 2>&1"
+```
+
+**Fix (reload nginx config):**
+```powershell
+& "C:\Windows\System32\OpenSSH\ssh.exe" root@172.16.3.20 "docker exec npm nginx -s reload"
+```
+
+---
+
+## 6. End-to-End Verification
+
+Run all of these. Any False or non-2xx is a problem.
+
+```powershell
+# Core network
+$checks = @(
+    @{host="172.16.3.20"; port=22;   label="Jupiter SSH"},
+    @{host="172.16.3.20"; port=3000; label="Gitea"},
+    @{host="172.16.3.30"; port=22;   label="GuruRMM VM SSH"},
+    @{host="172.16.3.30"; port=3001; label="GuruRMM server"},
+    @{host="172.16.3.30"; port=8001; label="Coord API"},
+    @{host="172.16.3.20"; port=443;  label="NPM HTTPS (via iptables)"},
+    @{host="172.16.3.20"; port=8082; label="Seafile direct"}
+)
+foreach ($c in $checks) {
+    $r = Test-NetConnection -ComputerName $c.host -Port $c.port -WarningAction SilentlyContinue
+    $status = if ($r.TcpTestSucceeded) { "[OK]" } else { "[FAIL]" }
+    Write-Host "$status $($c.label) ($($c.host):$($c.port))"
+}
+
+# DNS
+Clear-DnsClientCache
+$dns = Resolve-DnsName sync.azcomputerguru.com -ErrorAction SilentlyContinue
+$dnsOk = $dns.IPAddress -eq "172.16.3.20"
+Write-Host "$(if ($dnsOk) {'[OK]'} else {'[FAIL]'}) DNS sync.azcomputerguru.com -> $($dns.IPAddress) (want 172.16.3.20)"
+
+# HTTPS end-to-end
+$resp = Invoke-WebRequest -Uri "https://sync.azcomputerguru.com/" -UseBasicParsing -ErrorAction SilentlyContinue
+Write-Host "$(if ($resp.StatusCode -eq 200) {'[OK]'} else {'[FAIL]'}) sync.azcomputerguru.com HTTPS -> $($resp.StatusCode)"
+```
+
+---
+
+## Infrastructure Reference
+
+| Host | IP | Role |
+|------|----|------|
+| pfSense | 172.16.0.1 (SSH port 2248) | Router, DNS, Tailscale subnet router |
+| Jupiter | 172.16.3.20 | Unraid NAS — hosts all VMs + Docker |
+| Uranus | 172.16.3.21 | OwnCloud additional storage (not a proxy) |
+| GuruRMM VM | 172.16.3.30 | Linux VM on Jupiter — GuruRMM server, Coord API, MariaDB |
+| Pluto | 172.16.3.36 | Windows Server 2019 VM on Jupiter — build server |
+| Tailscale range | 172.16.0.0/22 | Advertised via pfSense pfsense-2 node |
+
+**Docker containers on Jupiter (172.16.3.20):**
+| Container | Purpose | Key ports |
+|-----------|---------|-----------|
+| npm | Nginx Proxy Manager | 1880 (HTTP), 7818 (admin), 18443 (HTTPS) |
+| seafile | Seafile web/app | 8082 (HTTP) |
+| seafile-mysql | Seafile DB | internal |
+| seafile-elasticsearch | Seafile search | internal |
+| seafile-memcached | Seafile cache | internal |
+
+**NPM proxy hosts:**
+| Domain | Backend |
+|--------|---------|
+| sync.azcomputerguru.com | 172.16.3.20:8082 (Seafile) |
+| rmm.azcomputerguru.com | 172.16.3.30:3001 (GuruRMM) |
+| rmm-api.azcomputerguru.com | 172.16.3.30:3001 |
+| git.azcomputerguru.com | 172.16.3.20:3000 (Gitea) |
+| unifi.azcomputerguru.com | (Unifi VM) |
+| emby.azcomputerguru.com | (Emby) |
+
+---
+
+## Known Post-Power-Failure Issue Pattern
+
+Unraid's VM plugin (`dynamix.vm.manager`) should auto-mount `libvirt.img` at boot.
+When it doesn't, the root cause is usually that the Unraid array came up before emhttp
+finished initializing, or the go script ran before the array was fully mounted.
+
+**Permanent fix (TODO):** Add a user script via Unraid's User Scripts plugin that runs at
+array start and checks/mounts libvirt.img if not already mounted. This would eliminate
+the manual step 2c above.
+
+---
+
+---
+
+## 2026-05-17 Post-Mortem
+
+**Root cause:** Power flicker at the office. UPS batteries were disconnected during a rack
+reorganization move, so units had no backup capacity and shut down on the flicker instead
+of riding through it.
+
+**Resolution:** Mike reconnected batteries and restarted UPS units.
+
+**Auto-recovery:** Jupiter (172.16.3.20) and Uranus (172.16.3.21) started automatically.
+
+**Manual intervention required:** IX server (neptune/exchange host) did NOT auto-restart —
+required a physical button press at the rack. Note for future: verify whether this is always
+the case or was a one-off (BIOS power-on-after-failure setting may need adjustment).
+
+**Remote fixes applied:** All steps 1–5 above were needed. Total recovery time ~1 hour.
+
+---
+
+*Last updated: 2026-05-17 — documented after power failure recovery*
+*Checked by: Mike Swanson*
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -1,6 +1,8 @@
 # Memory Index

 ## Reference
+- [ACG Office Network Infrastructure](infra_office_network.md) — IPs, hosts, roles for pfSense/Jupiter/VMs/Docker. Use before assuming what's where; .21 (Uranus) is storage, not a proxy.
+- [Power Failure Runbook](../.claude/POWER_FAILURE_RUNBOOK.md) — Step-by-step recovery: Tailscale routes, libvirt/VMs, Seafile, NPM/DNS. Run in order after any power event.
 - [Syncro API — Invoice Verification Pattern](syncro_invoice_verification_pattern.md) - **CRITICAL:** List endpoint (/invoices?customer_id=X) does NOT return ticket linkage. Must query individual invoices (/invoices/{number}) to get ticket_id field. Invoice numbers are strings. Use ticket ID (not number) for comparison. Real case: falsely reported 31 tickets had no invoices (actually 29 had invoices, 2 were Non-Billable).
 - [Approval Workflow: Tools vs Projects](approval-workflow-tools-vs-projects.md) - General tools (remediation-tool, onboard scripts, MSP utilities): Howard can modify OR Claude can execute with Howard/Mike approval. Projects (GuruRMM, etc.): require Mike approval, features→roadmap, bugs→bug list.
 - [Community Forum (Flarum)](reference_community_forum.md) - Flarum forum at community.azcomputerguru.com, API access, database, posting workflow
--- a/.claude/memory/infra_office_network.md
+++ b/.claude/memory/infra_office_network.md
@@ -0,0 +1,30 @@
+---
+name: infra-office-network
+description: ACG office LAN infrastructure — IPs, hosts, roles, and post-power-failure recovery
+metadata:
+  type: project
+---
+
+ACG office LAN is 172.16.0.0/22, routed via Tailscale through pfSense node `pfsense-2` (100.119.153.74).
+
+**Key hosts:**
+| Host | IP | SSH | Role |
+|------|----|-----|------|
+| pfSense | 172.16.0.1 | port 2248, user admin | Router, DNS (Unbound), Tailscale subnet router |
+| Jupiter | 172.16.3.20 | port 22, user root | Unraid NAS — all VMs + Docker containers |
+| Uranus | 172.16.3.21 | (no key) | OwnCloud additional storage only — NOT a proxy |
+| GuruRMM VM | 172.16.3.30 | port 22, user guru | Linux VM on Jupiter — GuruRMM, Coord API, MariaDB, Gitea |
+| Pluto | 172.16.3.36 | (Windows) | Windows Server 2019 VM on Jupiter — MSI build server |
+
+**Why:** How to apply: check these IPs before assuming what's where. .21 is NOT the Seafile proxy — NPM on .20 is.
+
+**Docker on Jupiter (.20):**
+- `npm` — Nginx Proxy Manager (ports 1880/7818/18443)
+- `seafile` + `seafile-mysql` + `seafile-elasticsearch` + `seafile-memcached` — Seafile stack (port 8082)
+- `gitea` — port 3000 (also accessed as 172.16.3.20:3000 or via SSH port forward from GuruRMM VM at .30:3000)
+
+**NPM → 443 routing:** iptables PREROUTING on Jupiter: `dpt:443 → 172.17.0.2:443` (NPM container). Persisted in `/boot/config/go`. DNS `sync.azcomputerguru.com` → 172.16.3.20.
+
+**VMs on Jupiter (virsh):** GuruRMM, Unifi, OwnCloud, Claude-Builder (running); Windows 7, Windows Server 2016, Windows Server 2016_Template (shut off).
+
+**Why:** How to apply: see [[power-failure-runbook]] for full post-outage recovery steps.