sync: auto-sync from HOWARD-HOME at 2026-06-17 22:46:27

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-17 22:46:27
2026-06-17 22:46:37 -07:00
parent dc4560cf27
commit f36fb97eb8
4 changed files with 136 additions and 0 deletions
--- a/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md
+++ b/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md
@@ -0,0 +1,105 @@
+# Incident Report — Cascades of Tucson Site-Wide Network Outage (Power Outage)
+
+- **Date:** 2026-06-17
+- **Client:** Cascades of Tucson
+- **Severity:** High — total network outage (all APs, switches, clients) at a 24/7 HIPAA assisted-living facility
+- **Status:** RESOLVED — full network back online
+- **Logged by:** Howard Enos (Howard-Home). Mike Swanson worked the on-box/console recovery.
+- **Purpose:** Basis for a power-outage ticket + corrective/preventive actions.
+
+## Summary
+
+A building power outage took the entire Cascades network down. The pfSense firewall lost power
+**uncleanly** because it was plugged into the **surge-only side of the UPS, not the battery-backed
+side**, so it hard-powered-off instead of riding through / shutting down gracefully. The unclean
+shutdown led to a messy recovery: a **duplicate `dhcpd` process** on pfSense and **broken Layer-2
+forwarding on a 2nd-floor switch** that prevented APs/clients from completing DHCP. Recovery was
+further delayed because **the Cox modem was not rebooted after the pfSense config restore**.
+Restoring service required: re-seating pfSense onto the battery side (done), a clean pfSense config
+restore from on-box auto-backups, eliminating the duplicate dhcpd, **resetting + re-adopting the
+2nd-floor #2 switch** (which brought floors 2/3/4 back), and the Cox modem reboot.
+
+## Impact
+
+- **All 77 APs + all 12 switches showed disconnected; 0 clients; no DHCP** for the duration.
+- Resident/staff WiFi, the Poly VoIP phones, and wired devices were down site-wide.
+- Duration: power-loss through full restoration (afternoon/evening 2026-06-17).
+
+## Root Causes
+
+1. **Primary trigger — building power outage.**
+2. **pfSense had NO battery protection** — the firewall was connected to the **surge-only outlets
+   of the battery backup, not the battery-backed outlets**, so it lost power instantly and shut
+   down uncleanly (the rest of the protected gear behaved differently). **RECTIFIED** — pfSense
+   moved to the battery-backed side.
+3. **Unclean shutdown → degraded recovery on pfSense:** a **duplicate `dhcpd` instance** came up.
+   Two dhcpd processes fought over the same scopes, so clients got `DISCOVER → OFFER` but the
+   handshake never completed (`REQUEST/ACK` failed) → no IPs.
+4. **2nd-floor switch L2 forwarding broke on re-adoption:** the switch came back "online" to the
+   controller (management/uplink OK) but **forwarded AP DHCP requests UP to pfSense while not
+   delivering pfSense's replies back DOWN to the APs** — an asymmetric/one-way forwarding failure
+   (stale forwarding state / port-profile not cleanly applied after the dirty power-up).
+5. **Cox modem not rebooted after the pfSense restore** — after restoring the pfSense config, the
+   upstream Cox modem was not power-cycled, so the WAN did not fully re-establish, prolonging the
+   issues post-restore. (Modems commonly bind to the firewall's WAN MAC/DHCP state and need a
+   reboot after a firewall restore/replace.)
+
+## Detection & Diagnosis (key findings)
+
+- Controller (independent of the remote VPN) showed **77/77 APs + 12/12 switches disconnected,
+  0 clients** — confirming a real, common-path outage, not a monitoring artifact.
+- pfSense answered ICMP (`192.168.0.1` up) but its **SSH/management hung** initially and the GUI
+  briefly returned a **50x** (transient, services still starting after the unclean power-up).
+- **ZFS pool healthy** (`all pools are healthy`) and **`config.xml` intact** (684,842 bytes, valid)
+  — no filesystem/config corruption; ZFS's copy-on-write protected it.
+- Reading logs **directly** (pfSense 25.07 uses plain-text logs, not `clog`): DHCP server itself was
+  healthy (leases writing; clients that reached REQUEST got ACKed) — but APs/native-LAN clients were
+  stuck at OFFER. DORA breakdown for the 16 second-floor APs: **404 DISCOVER → 404 OFFER → 0 REQUEST
+  → 0 ACK**, offers correctly on `igc1` / `192.168.2-3.x` — proving pfSense was offering correctly
+  and the **replies were not reaching the APs** (downstream switch forwarding).
+- Found **2 `dhcpd` processes** running (should be 1) — the duplicate explaining the broken completion.
+
+## Resolution (what fixed it)
+
+1. **pfSense re-seated onto the battery-backed UPS outlets** (corrective; prevents recurrence of #2).
+2. **pfSense config restored** from the on-box auto-backup history (`/cf/conf/backup/config-*.xml`);
+   ZFS/config were intact so this was clean.
+3. **Duplicate `dhcpd` eliminated** → single clean instance via `services_dhcpd_configure()`; DHCP
+   began completing (switches + Poly phones started getting IPs).
+4. **Reset + re-adopted the 2nd-floor #2 switch** (`Switch 2nd Floor #2`, USL24PB, 192.168.2.193).
+   This cleared the broken forwarding state; **floors 3 and 4 came up immediately afterward**
+   (they were downstream of it), and the site returned to fully online.
+5. **Cox modem rebooted** after the restore to re-establish the WAN.
+
+## Corrective Actions
+
+- [DONE] pfSense moved from the **surge-only** side to the **battery-backed** side of the UPS.
+
+## Preventive Actions / Follow-ups (for the ticket)
+
+1. **Enable Netgate AutoConfigBackup on pfSense** — there was **NO off-box config backup** at
+   incident time, which turned a blip into a scramble. ACB gives automatic encrypted cloud backups
+   on every change. (An off-box copy of the current + pre-outage config was captured locally during
+   the incident; should also be vaulted.)
+2. **Verify full UPS coverage + runtime + clean shutdown:** the **NUT** package is configured on
+   pfSense, but the box still went down hard (it was on surge-only). Confirm that pfSense AND the
+   **core/aggregation switch + the AP-PoE switches** are all on **battery-backed** outlets with
+   adequate runtime, and test that NUT triggers a graceful pfSense shutdown on power loss.
+3. **Post-restore runbook step:** after any pfSense config restore/replacement, **reboot the Cox
+   modem** to re-sync the WAN (this was the missed step that prolonged recovery).
+4. **Vault the pfSense config** off-box (immediate safety net beyond ACB).
+5. **Optional monitoring:** alert on DHCP-not-completing / duplicate-dhcpd / mass device-disconnect
+   so a future event is caught in minutes.
+
+## Reference Information
+
+- **pfSense:** `192.168.0.1`, Plus 25.07-RELEASE, ZFS. WAN: Cox (`igc0`, 184.191.143.x). LAN: `igc1`
+  192.168.0.0/22 + per-room /28 VLANs (HIPAA isolation) + VLAN20 (`10.0.20.0/24`) + VLAN30 voice.
+  Admin cred vault `clients/cascades-tucson/pfsense-firewall`. Access via
+  `unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson`.
+- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) `192.168.2.193`. Also
+  present: `Switch 2nd Floor` (US24PRO) `192.168.2.4`.
+- **UniFi controller:** UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`.
+- **Fleet:** 77 APs, 12 switches, ~573–788 clients. 22 Poly WiFi phones (OUI `48:25:67`, VLAN 20).
+- **Key technical notes:** pfSense 25.07 logs are **plain text** (use `tail`/`grep`, NOT `clog`).
+  Clean DHCP restart on this box = `services_dhcpd_configure()` via `pfSsh.php`.