diff --git a/clients/cascades-tucson/session-logs/2026-06/2026-06-17-howard-cascades-power-outage-recovery-and-5ghz.md b/clients/cascades-tucson/session-logs/2026-06/2026-06-17-howard-cascades-power-outage-recovery-and-5ghz.md new file mode 100644 index 00000000..7444c99f --- /dev/null +++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-17-howard-cascades-power-outage-recovery-and-5ghz.md @@ -0,0 +1,149 @@ +# Cascades — 5GHz overlap analysis + Option-B plan, then POWER-OUTAGE incident recovery + config vaulting + +- **Date:** 2026-06-17 +- **Machine:** Howard-Home +- **Client:** Cascades of Tucson +- **Continuation of:** `2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md` (same session) +- **Incident report (ticket basis):** `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md` + +## User +- **User:** Howard Enos (howard) +- **Machine:** Howard-Home +- **Role:** tech + +## Session Summary + +Continued the Cascades whole-network-smoothing work from the phone-drops investigation. Built +the AP-to-AP 5GHz overlap picture from the cached neighbor SNR matrix (`.claude/tmp/cascades-nbr.json`, +6/16 AP collect) crossed with live channels: real co-channel overlap was MODERATE (25 genuinely +co-channel pairs of 173 strong neighbor pairs), and width 80->40 alone only fixed 7 of 25 (28%) — +the rest are same-channel collisions needing a channel re-plan. Howard then had UniFi auto-change +the channels; recheck showed it made overlap WORSE (25->30 co-channel, new ch36 cluster on the +three busiest APs salon/2nd-Floor-Atrium/CC-Bridge at 36-48 dB). A `channel-plan na` dry-run got +co-channel to 0 but only by going non-DFS, which requires 40MHz width (at 80MHz non-DFS = 2 blocks). +Conclusion: width + channel are coupled; the real fix is **Option B = 40MHz + non-DFS optimized +channels + min-RSSI relax**, done as a combined per-AP radio_table PUT. Howard chose Option B and +asked to park/save the 5GHz plan and validate the 2.4 (Low->Medium) plan separately. + +Mid-2.4-diagnostic, a **building POWER OUTAGE** took the entire Cascades network down (all 77 APs + +12 switches disconnected, 0 clients). Spent the bulk of the session driving the recovery read-only +from Howard-Home (VPN up; controller via UOS). Root cause chain: pfSense was plugged into the +**surge-only side of the UPS, not the battery side**, so it hard-powered-off uncleanly. ZFS + config +survived (pools healthy, config.xml valid). The dirty boot left a **duplicate dhcpd** (DISCOVER->OFFER +but no REQUEST/ACK) and a **2nd-floor switch with one-way L2 forwarding** (APs' DISCOVERs reached +pfSense, pfSense offered correctly on igc1/192.168.2-3.x, but OFFERs never returned -> 404 DISCOVER / +404 OFFER / 0 REQUEST / 0 ACK across the 16 second-floor APs). Confirmed pfSense was healthy (full +health check: services up, WAN up, no interface errors, no storm, state table fine). Killed the +duplicate dhcpd and restarted a single clean instance via `services_dhcpd_configure()`. Mike worked +the on-box/console side: re-seated pfSense onto the battery outlets (rectified), restored config from +the on-box auto-backup (the 12:20 version, WITH today's VLAN30 work), **reset+re-adopted the 2nd-floor +#2 switch** (which brought floors 3/4 up), and rebooted the Cox modem (the missed post-restore step). +Network fully restored. + +Closed out by writing the incident report (ticket basis), capturing reusable pfSense-25.07 lessons +to memory + an errorlog `--friction` entry (clog-vs-plaintext-logs), and **vaulting the pfSense +config** (verified VLAN30 present first). While vaulting, `vault check` caught a PRE-EXISTING +plaintext credential (`synology-signin-portal.sops.yaml`) committed plaintext in vault history +(commit 1fbc0e1) — encrypted it go-forward; flagged the credential as exposed (needs rotation). + +## Key Decisions + +- **5GHz: go with Option B (40MHz + non-DFS optimized channels + min-RSSI relax)** as one combined + per-AP radio_table PUT, because width and channel are coupled — width alone is weak (7/25 pairs) + and a clean non-DFS channel plan requires narrowing to 40MHz. Parked pending execution (no change applied). +- **Measured before acting on 5GHz** — the neighbor-matrix data overrode the theory (overlap moderate, + width change weak, auto-channel made it worse), exactly what Howard asked for. +- **Recovery driven READ-ONLY** while Mike owned the on-box restore; the only pfSense change made was + killing the duplicate dhcpd + clean restart (the DHCP fix Mike was waiting on). +- **Did NOT blind-reboot pfSense** (it was the up edge; problem was downstream). Diagnosed the 2nd-floor + switch as the L2 culprit; Mike's reset+re-adopt of switch #2 confirmed it (floors 3/4 followed). +- **Vaulted the post-restore LIVE config** (not the incident-time copy), after verifying VLAN30 present. +- **Encrypted the pre-existing plaintext synology entry** rather than leaving it — vault `check` is the + mandated pre-commit gate; flagged the historical exposure for rotation. + +## Problems Encountered + +- **POWER OUTAGE** -> total site outage. Root cause: pfSense on the UPS surge-only side (no battery) -> + unclean shutdown. RECTIFIED (moved to battery side by Mike). +- **Duplicate dhcpd** after dirty boot -> DHCP offered but never completed. Fixed: `killall dhcpd` + + `echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php` -> single clean instance; ACKs resumed. +- **2nd-floor switch one-way L2 forwarding** (online to controller, but not passing OFFERs down to APs). + Fixed by Mike: reset + re-adopt of `Switch 2nd Floor #2` (USL24PB) -> floors 3/4 recovered. +- **Cox modem not rebooted after the pfSense restore** -> WAN didn't fully re-establish; prolonged issues. + Fixed by rebooting the modem. (Now a runbook step.) +- **clog on pfSense 25.07 returned empty** -> I wrongly concluded "DHCP log empty / dhcpd not serving." + 25.07 logs are PLAIN TEXT; read with tail/grep. Cost a hypothesis. Logged `--friction`. +- **`pfSsh.php` is slow to load (~20-40s)** -> several restart commands timed out mid-run (timeout 60s + insufficient). Use 50s+ and verify after. +- **sops encrypt failed from outside the vault dir** ("no creation rules") leaving a 730KB PLAINTEXT + config on disk briefly. Fixed by replicating vault-helper's pattern: `( cd "$VAULT_DIR" && sops --encrypt --in-place )`. +- **Pre-existing plaintext vault entry** (`synology-signin-portal.sops.yaml`) found by `vault check`; + encrypted it. It was committed plaintext at vault commit 1fbc0e1 -> credential in git history (exposed). + +## Configuration Changes + +- **pfSense (192.168.0.1):** killed duplicate dhcpd + clean restart via `services_dhcpd_configure()` + (operational, no config.xml change). All other pfSense access was read-only. +- **pfSense hardware (Mike):** moved from UPS surge-only outlets to battery-backed outlets. +- **pfSense config (Mike):** restored from on-box auto-backup (12:20 version, VLAN30 intact). Cox modem rebooted. +- **UniFi (Mike):** reset + re-adopted `Switch 2nd Floor #2` (USL24PB, 192.168.2.193). +- **UniFi channels (Howard, earlier):** auto-channel reassignment on 5GHz (made co-channel overlap worse; to be re-planned via Option B). +- **Repo:** created `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`; created + memory `.claude/memory/reference_pfsense_25_07_ops.md` + MEMORY.md index line; errorlog `--friction` entry. +- **Vault:** created `clients/cascades-tucson/pfsense-config-backup-2026-06-17.sops.yaml` (encrypted, VLAN30 verified); + encrypted the pre-existing-plaintext `clients/cascades-tucson/synology-signin-portal.sops.yaml`. + +## Credentials & Secrets + +- **pfSense config backup vaulted:** `clients/cascades-tucson/pfsense-config-backup-2026-06-17` (kind note, + config under encrypted `content:` field; decrypts cleanly; VLAN30 present). Vault commit 3735119. +- **[EXPOSED — ROTATE] Cascades Synology Cloud Signin Portal** (`account.synology.com`): the entry + `clients/cascades-tucson/synology-signin-portal.sops.yaml` was committed **plaintext** at vault commit + **1fbc0e1** and remains recoverable from vault git history. Now encrypted go-forward, but the credential + should be considered exposed and rotated. Same commit also added an "MDM service account" + "WiFi CSCNet" — + verify those were never plaintext in history. +- pfSense admin SSH: vault `clients/cascades-tucson/pfsense-firewall` (unchanged). + +## Infrastructure & Servers + +- **Cascades pfSense** `192.168.0.1`, Plus 25.07-RELEASE, **ZFS** (power-loss resilient; pools healthy). + WAN igc0 = Cox (184.191.143.x). LAN igc1 = 192.168.0.0/22 + per-room /28 VLANs + VLAN20 (10.0.20.0/24) + + **VLAN30 Voice opt241/igc1.30 (10.0.30.0/24, DHCP .100-.250) — VERIFIED INTACT post-restore, status active.** + Logs PLAIN TEXT (not clog). Clean dhcpd restart = `services_dhcpd_configure()` via pfSsh.php. +- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) 192.168.2.193. Also `Switch 2nd Floor` + (US24PRO) 192.168.2.4. +- **UniFi UOS controller** 172.16.3.29, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`. 77 APs, 12 switches. +- **5GHz overlap data:** neighbor SNR matrix `.claude/tmp/cascades-nbr.json` (6/16); live channels via + Network API (live-stats; read cred only is `-rw`, RW cred `infrastructure/uos-server-network-api-rw` vaulted). +- **16 second-floor APs (DHCP-stuck during incident):** names 203/204/206/209/210/217/221/222/229/236/237/240/241/247/248 + 2nd Floor Atrium; UniFi OUI 0c:ea:14. + +## Commands & Outputs + +- DORA breakdown (2nd-floor APs, the L2 proof): `404 DISCOVER -> 404 OFFER -> 0 REQUEST -> 0 ACK`, offers via igc1 on 192.168.2-3.x. +- DHCP fix: `killall dhcpd; echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php`; verify `pgrep -f "dhcpd -user" | wc -l` == 1. +- Read pfSense logs DIRECTLY (NOT clog): `tail/grep /var/log/dhcpd.log`. `file /var/log/*.log` -> ASCII text. +- ZFS health: `zpool status -x` -> "all pools are healthy". +- VLAN30 verify: `grep "10.0.30.1" config.xml` -> opt241 "Voice" igc1.30; DHCP from 10.0.30.100 to .250; `ifconfig igc1.30` inet 10.0.30.1 status active. +- Vault file encrypt pattern: `( cd "$VAULT_DIR" && sops --encrypt --in-place )`. +- 5GHz overlap: 173 strong neighbor pairs; 25 co-channel @80MHz (auto-change pushed to 30); width 80->40 fixes only 7; channel-plan na -> 0 (non-DFS, needs 40MHz). + +## Pending / Incomplete Tasks + +- **Execute Option B (5GHz)** when ready: combined per-AP radio_table PUT = ng power medium (42 low radios) + + na ht 40 (76) + na non-DFS optimized channels (channel-plan na) + na min_rssi -82 (69). Per-zone, evening window. Dry-run staged; nothing applied. +- **2.4 Low->Medium bump** still pending validation/execution (the diagnostic was interrupted by the outage). +- **Incident follow-ups (for the ticket):** enable Netgate AutoConfigBackup (no off-box backup existed); + verify full UPS coverage/runtime/clean-shutdown (NUT installed but box was on surge-only; confirm core + + PoE switches are battery-backed); add "reboot Cox modem after pfSense restore" to the runbook. +- **[SECURITY] Rotate the Cascades Synology signin credential** (exposed in vault history, commit 1fbc0e1); + verify the MDM service account + CSCNet entries from that commit were never plaintext. +- Standing rule: no Cascades prod-infra changes without discussing + explicit per-change go (memory feedback_cascades #4). + +## Reference Information + +- Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md` +- Memory: `reference_pfsense_25_07_ops.md` (new), `feedback_cascades.md` #4 (no-change-without-discussing). +- Vault: `clients/cascades-tucson/pfsense-config-backup-2026-06-17` (new, commit 3735119); + `clients/cascades-tucson/synology-signin-portal` (encrypted; exposed in history commit 1fbc0e1). +- pfSense access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run ""`. +- Prior session log (phone drops + smoothing plan): `2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md`.