sync: auto-sync from HOWARD-HOME at 2026-06-17 22:46:27

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-17 22:46:27
This commit is contained in:
2026-06-17 22:46:37 -07:00
parent dc4560cf27
commit f36fb97eb8
4 changed files with 136 additions and 0 deletions

View File

@@ -0,0 +1,105 @@
# Incident Report — Cascades of Tucson Site-Wide Network Outage (Power Outage)
- **Date:** 2026-06-17
- **Client:** Cascades of Tucson
- **Severity:** High — total network outage (all APs, switches, clients) at a 24/7 HIPAA assisted-living facility
- **Status:** RESOLVED — full network back online
- **Logged by:** Howard Enos (Howard-Home). Mike Swanson worked the on-box/console recovery.
- **Purpose:** Basis for a power-outage ticket + corrective/preventive actions.
## Summary
A building power outage took the entire Cascades network down. The pfSense firewall lost power
**uncleanly** because it was plugged into the **surge-only side of the UPS, not the battery-backed
side**, so it hard-powered-off instead of riding through / shutting down gracefully. The unclean
shutdown led to a messy recovery: a **duplicate `dhcpd` process** on pfSense and **broken Layer-2
forwarding on a 2nd-floor switch** that prevented APs/clients from completing DHCP. Recovery was
further delayed because **the Cox modem was not rebooted after the pfSense config restore**.
Restoring service required: re-seating pfSense onto the battery side (done), a clean pfSense config
restore from on-box auto-backups, eliminating the duplicate dhcpd, **resetting + re-adopting the
2nd-floor #2 switch** (which brought floors 2/3/4 back), and the Cox modem reboot.
## Impact
- **All 77 APs + all 12 switches showed disconnected; 0 clients; no DHCP** for the duration.
- Resident/staff WiFi, the Poly VoIP phones, and wired devices were down site-wide.
- Duration: power-loss through full restoration (afternoon/evening 2026-06-17).
## Root Causes
1. **Primary trigger — building power outage.**
2. **pfSense had NO battery protection** — the firewall was connected to the **surge-only outlets
of the battery backup, not the battery-backed outlets**, so it lost power instantly and shut
down uncleanly (the rest of the protected gear behaved differently). **RECTIFIED** — pfSense
moved to the battery-backed side.
3. **Unclean shutdown → degraded recovery on pfSense:** a **duplicate `dhcpd` instance** came up.
Two dhcpd processes fought over the same scopes, so clients got `DISCOVER → OFFER` but the
handshake never completed (`REQUEST/ACK` failed) → no IPs.
4. **2nd-floor switch L2 forwarding broke on re-adoption:** the switch came back "online" to the
controller (management/uplink OK) but **forwarded AP DHCP requests UP to pfSense while not
delivering pfSense's replies back DOWN to the APs** — an asymmetric/one-way forwarding failure
(stale forwarding state / port-profile not cleanly applied after the dirty power-up).
5. **Cox modem not rebooted after the pfSense restore** — after restoring the pfSense config, the
upstream Cox modem was not power-cycled, so the WAN did not fully re-establish, prolonging the
issues post-restore. (Modems commonly bind to the firewall's WAN MAC/DHCP state and need a
reboot after a firewall restore/replace.)
## Detection & Diagnosis (key findings)
- Controller (independent of the remote VPN) showed **77/77 APs + 12/12 switches disconnected,
0 clients** — confirming a real, common-path outage, not a monitoring artifact.
- pfSense answered ICMP (`192.168.0.1` up) but its **SSH/management hung** initially and the GUI
briefly returned a **50x** (transient, services still starting after the unclean power-up).
- **ZFS pool healthy** (`all pools are healthy`) and **`config.xml` intact** (684,842 bytes, valid)
— no filesystem/config corruption; ZFS's copy-on-write protected it.
- Reading logs **directly** (pfSense 25.07 uses plain-text logs, not `clog`): DHCP server itself was
healthy (leases writing; clients that reached REQUEST got ACKed) — but APs/native-LAN clients were
stuck at OFFER. DORA breakdown for the 16 second-floor APs: **404 DISCOVER → 404 OFFER → 0 REQUEST
→ 0 ACK**, offers correctly on `igc1` / `192.168.2-3.x` — proving pfSense was offering correctly
and the **replies were not reaching the APs** (downstream switch forwarding).
- Found **2 `dhcpd` processes** running (should be 1) — the duplicate explaining the broken completion.
## Resolution (what fixed it)
1. **pfSense re-seated onto the battery-backed UPS outlets** (corrective; prevents recurrence of #2).
2. **pfSense config restored** from the on-box auto-backup history (`/cf/conf/backup/config-*.xml`);
ZFS/config were intact so this was clean.
3. **Duplicate `dhcpd` eliminated** → single clean instance via `services_dhcpd_configure()`; DHCP
began completing (switches + Poly phones started getting IPs).
4. **Reset + re-adopted the 2nd-floor #2 switch** (`Switch 2nd Floor #2`, USL24PB, 192.168.2.193).
This cleared the broken forwarding state; **floors 3 and 4 came up immediately afterward**
(they were downstream of it), and the site returned to fully online.
5. **Cox modem rebooted** after the restore to re-establish the WAN.
## Corrective Actions
- [DONE] pfSense moved from the **surge-only** side to the **battery-backed** side of the UPS.
## Preventive Actions / Follow-ups (for the ticket)
1. **Enable Netgate AutoConfigBackup on pfSense** — there was **NO off-box config backup** at
incident time, which turned a blip into a scramble. ACB gives automatic encrypted cloud backups
on every change. (An off-box copy of the current + pre-outage config was captured locally during
the incident; should also be vaulted.)
2. **Verify full UPS coverage + runtime + clean shutdown:** the **NUT** package is configured on
pfSense, but the box still went down hard (it was on surge-only). Confirm that pfSense AND the
**core/aggregation switch + the AP-PoE switches** are all on **battery-backed** outlets with
adequate runtime, and test that NUT triggers a graceful pfSense shutdown on power loss.
3. **Post-restore runbook step:** after any pfSense config restore/replacement, **reboot the Cox
modem** to re-sync the WAN (this was the missed step that prolonged recovery).
4. **Vault the pfSense config** off-box (immediate safety net beyond ACB).
5. **Optional monitoring:** alert on DHCP-not-completing / duplicate-dhcpd / mass device-disconnect
so a future event is caught in minutes.
## Reference Information
- **pfSense:** `192.168.0.1`, Plus 25.07-RELEASE, ZFS. WAN: Cox (`igc0`, 184.191.143.x). LAN: `igc1`
192.168.0.0/22 + per-room /28 VLANs (HIPAA isolation) + VLAN20 (`10.0.20.0/24`) + VLAN30 voice.
Admin cred vault `clients/cascades-tucson/pfsense-firewall`. Access via
`unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson`.
- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) `192.168.2.193`. Also
present: `Switch 2nd Floor` (US24PRO) `192.168.2.4`.
- **UniFi controller:** UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`.
- **Fleet:** 77 APs, 12 switches, ~573788 clients. 22 Poly WiFi phones (OUI `48:25:67`, VLAN 20).
- **Key technical notes:** pfSense 25.07 logs are **plain text** (use `tail`/`grep`, NOT `clog`).
Clean DHCP restart on this box = `services_dhcpd_configure()` via `pfSsh.php`.