From f36fb97eb847cbff0de663894435456821e51b35 Mon Sep 17 00:00:00 2001 From: Howard Enos Date: Wed, 17 Jun 2026 22:46:37 -0700 Subject: [PATCH] sync: auto-sync from HOWARD-HOME at 2026-06-17 22:46:27 Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-17 22:46:27 --- .claude/memory/MEMORY.md | 1 + .claude/memory/reference_pfsense_25_07_ops.md | 28 +++++ .../2026-06-17-power-outage-incident.md | 105 ++++++++++++++++++ errorlog.md | 2 + 4 files changed, 136 insertions(+) create mode 100644 .claude/memory/reference_pfsense_25_07_ops.md create mode 100644 clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md index e20c2b3b..b8a3fb4c 100644 --- a/.claude/memory/MEMORY.md +++ b/.claude/memory/MEMORY.md @@ -101,6 +101,7 @@ ### Cascades - [Cascades operational rules](feedback_cascades.md) — Active rules: (1) folder redirection (fdeploy) needs subfolders PRE-CREATED before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1. (2) ALWAYS ask which security group(s) a new user goes into — never auto-derive from OU. (3) Do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use. (4) NEVER change Cascades production infra (pfSense/UniFi/switches/DHCP) without discussing it + explicit per-change go — read-only/dry-run until then. - [Cascades FR GPO fix](reference_cascades_fr_gpo_fix.md) — Native Folder Redirection was DOA on every machine: redirect targets were in a misnamed `fdeploy1.ini` (Windows reads `fdeploy.ini`) → empty target path → silent no-op → per-user registry workaround every time. Fixed 2026-06-08 (correct fdeploy.ini + version bump). Also: CS-SERVER live RMM agent is `c39f1de7...` (old `6766e973` stale). +- [pfSense 25.07 ops quirks](reference_pfsense_25_07_ops.md) — Cascades pfSense Plus 25.07: logs are PLAIN TEXT (use tail/grep, NOT clog → clog returns empty); clean dhcpd restart = `services_dhcpd_configure()` via slow pfSsh.php (needs 50s+ timeout); dirty boot can leave 2 dhcpd → DISCOVER/OFFER but no ACK; reboot the Cox modem after a config restore; ZFS survives power loss. From the 2026-06-17 power-outage incident. - [feedback_ascii_only_api_payloads](feedback_ascii_only_api_payloads.md) -- On Windows/Git-bash, non-ASCII chars (em-dash, arrow, smart quotes) in JSON payload TEXT passed to curl get mangled and rejected — Discord bot-alert returns 400, the coord API returns "error parsing the body". Use ASCII-only in API payload text, or a single-quoted heredoc. ## Machine diff --git a/.claude/memory/reference_pfsense_25_07_ops.md b/.claude/memory/reference_pfsense_25_07_ops.md new file mode 100644 index 00000000..3e5a45b8 --- /dev/null +++ b/.claude/memory/reference_pfsense_25_07_ops.md @@ -0,0 +1,28 @@ +--- +name: reference_pfsense_25_07_ops +description: pfSense Plus 25.07 operational quirks learned during the Cascades power-outage recovery — plain-text logs (NOT clog), clean dhcpd restart via pfSsh.php, reboot the upstream modem after a config restore, ZFS power-loss resilience +metadata: + type: reference +--- + +Learned on the Cascades pfSense (`192.168.0.1`, Plus 25.07-RELEASE, ZFS) during the 2026-06-17 +power-outage recovery. Access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run ""` +(admin SSH = real shell). Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`. + +- **Logs are PLAIN TEXT (ASCII), not clog binary.** `clog /var/log/dhcpd.log` returns EMPTY on 25.07 + → do NOT conclude "logs are empty / service not logging." Read with `tail`/`grep`/`cat` directly. + (Burned a whole hypothesis on this — the DHCP server was actually fine.) `file /var/log/*.log` → ASCII text. +- **Clean single-instance DHCP restart from shell:** `echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php` + (regenerates `/var/dhcpd/etc/dhcpd.conf` + restarts ONE dhcpd; kills duplicates). A power-loss/dirty boot + can leave **two `dhcpd` processes** fighting → clients get DISCOVER→OFFER but never REQUEST/ACK. + Verify: `pgrep -f "dhcpd -user" | wc -l` should be **1**. Test config: `dhcpd -t -cf /var/dhcpd/etc/dhcpd.conf`. +- **`pfSsh.php` is SLOW to load (~20-40s).** SSH commands that invoke it need a long timeout (50s+) or they + time out mid-run and you can't tell if the action took. +- **After a pfSense config restore/replace, REBOOT the upstream modem** (Cox at Cascades) to re-sync the WAN — + skipping this prolongs post-restore issues. Add to any restore runbook. +- **ZFS root is power-loss resistant** — `zpool status -x` → "all pools are healthy"; `config.xml` survived an + unclean power-off intact. A 50x on the GUI right after a dirty boot is usually transient (services still starting). +- **DHCP "offers but never completes" on ONE segment/switch** = asymmetric L2 forwarding (DISCOVER reaches + pfSense + OFFER sent on the right iface/subnet, but REQUEST=0/ACK=0 because the reply doesn't reach the client). + Root cause is the switch (re-adopted with stale forwarding/bad port profile), NOT pfSense — fix = reset/re-adopt + that switch. See [[reference_cascades_fr_gpo_fix]] for other Cascades infra notes. diff --git a/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md b/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md new file mode 100644 index 00000000..4678c943 --- /dev/null +++ b/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md @@ -0,0 +1,105 @@ +# Incident Report — Cascades of Tucson Site-Wide Network Outage (Power Outage) + +- **Date:** 2026-06-17 +- **Client:** Cascades of Tucson +- **Severity:** High — total network outage (all APs, switches, clients) at a 24/7 HIPAA assisted-living facility +- **Status:** RESOLVED — full network back online +- **Logged by:** Howard Enos (Howard-Home). Mike Swanson worked the on-box/console recovery. +- **Purpose:** Basis for a power-outage ticket + corrective/preventive actions. + +## Summary + +A building power outage took the entire Cascades network down. The pfSense firewall lost power +**uncleanly** because it was plugged into the **surge-only side of the UPS, not the battery-backed +side**, so it hard-powered-off instead of riding through / shutting down gracefully. The unclean +shutdown led to a messy recovery: a **duplicate `dhcpd` process** on pfSense and **broken Layer-2 +forwarding on a 2nd-floor switch** that prevented APs/clients from completing DHCP. Recovery was +further delayed because **the Cox modem was not rebooted after the pfSense config restore**. +Restoring service required: re-seating pfSense onto the battery side (done), a clean pfSense config +restore from on-box auto-backups, eliminating the duplicate dhcpd, **resetting + re-adopting the +2nd-floor #2 switch** (which brought floors 2/3/4 back), and the Cox modem reboot. + +## Impact + +- **All 77 APs + all 12 switches showed disconnected; 0 clients; no DHCP** for the duration. +- Resident/staff WiFi, the Poly VoIP phones, and wired devices were down site-wide. +- Duration: power-loss through full restoration (afternoon/evening 2026-06-17). + +## Root Causes + +1. **Primary trigger — building power outage.** +2. **pfSense had NO battery protection** — the firewall was connected to the **surge-only outlets + of the battery backup, not the battery-backed outlets**, so it lost power instantly and shut + down uncleanly (the rest of the protected gear behaved differently). **RECTIFIED** — pfSense + moved to the battery-backed side. +3. **Unclean shutdown → degraded recovery on pfSense:** a **duplicate `dhcpd` instance** came up. + Two dhcpd processes fought over the same scopes, so clients got `DISCOVER → OFFER` but the + handshake never completed (`REQUEST/ACK` failed) → no IPs. +4. **2nd-floor switch L2 forwarding broke on re-adoption:** the switch came back "online" to the + controller (management/uplink OK) but **forwarded AP DHCP requests UP to pfSense while not + delivering pfSense's replies back DOWN to the APs** — an asymmetric/one-way forwarding failure + (stale forwarding state / port-profile not cleanly applied after the dirty power-up). +5. **Cox modem not rebooted after the pfSense restore** — after restoring the pfSense config, the + upstream Cox modem was not power-cycled, so the WAN did not fully re-establish, prolonging the + issues post-restore. (Modems commonly bind to the firewall's WAN MAC/DHCP state and need a + reboot after a firewall restore/replace.) + +## Detection & Diagnosis (key findings) + +- Controller (independent of the remote VPN) showed **77/77 APs + 12/12 switches disconnected, + 0 clients** — confirming a real, common-path outage, not a monitoring artifact. +- pfSense answered ICMP (`192.168.0.1` up) but its **SSH/management hung** initially and the GUI + briefly returned a **50x** (transient, services still starting after the unclean power-up). +- **ZFS pool healthy** (`all pools are healthy`) and **`config.xml` intact** (684,842 bytes, valid) + — no filesystem/config corruption; ZFS's copy-on-write protected it. +- Reading logs **directly** (pfSense 25.07 uses plain-text logs, not `clog`): DHCP server itself was + healthy (leases writing; clients that reached REQUEST got ACKed) — but APs/native-LAN clients were + stuck at OFFER. DORA breakdown for the 16 second-floor APs: **404 DISCOVER → 404 OFFER → 0 REQUEST + → 0 ACK**, offers correctly on `igc1` / `192.168.2-3.x` — proving pfSense was offering correctly + and the **replies were not reaching the APs** (downstream switch forwarding). +- Found **2 `dhcpd` processes** running (should be 1) — the duplicate explaining the broken completion. + +## Resolution (what fixed it) + +1. **pfSense re-seated onto the battery-backed UPS outlets** (corrective; prevents recurrence of #2). +2. **pfSense config restored** from the on-box auto-backup history (`/cf/conf/backup/config-*.xml`); + ZFS/config were intact so this was clean. +3. **Duplicate `dhcpd` eliminated** → single clean instance via `services_dhcpd_configure()`; DHCP + began completing (switches + Poly phones started getting IPs). +4. **Reset + re-adopted the 2nd-floor #2 switch** (`Switch 2nd Floor #2`, USL24PB, 192.168.2.193). + This cleared the broken forwarding state; **floors 3 and 4 came up immediately afterward** + (they were downstream of it), and the site returned to fully online. +5. **Cox modem rebooted** after the restore to re-establish the WAN. + +## Corrective Actions + +- [DONE] pfSense moved from the **surge-only** side to the **battery-backed** side of the UPS. + +## Preventive Actions / Follow-ups (for the ticket) + +1. **Enable Netgate AutoConfigBackup on pfSense** — there was **NO off-box config backup** at + incident time, which turned a blip into a scramble. ACB gives automatic encrypted cloud backups + on every change. (An off-box copy of the current + pre-outage config was captured locally during + the incident; should also be vaulted.) +2. **Verify full UPS coverage + runtime + clean shutdown:** the **NUT** package is configured on + pfSense, but the box still went down hard (it was on surge-only). Confirm that pfSense AND the + **core/aggregation switch + the AP-PoE switches** are all on **battery-backed** outlets with + adequate runtime, and test that NUT triggers a graceful pfSense shutdown on power loss. +3. **Post-restore runbook step:** after any pfSense config restore/replacement, **reboot the Cox + modem** to re-sync the WAN (this was the missed step that prolonged recovery). +4. **Vault the pfSense config** off-box (immediate safety net beyond ACB). +5. **Optional monitoring:** alert on DHCP-not-completing / duplicate-dhcpd / mass device-disconnect + so a future event is caught in minutes. + +## Reference Information + +- **pfSense:** `192.168.0.1`, Plus 25.07-RELEASE, ZFS. WAN: Cox (`igc0`, 184.191.143.x). LAN: `igc1` + 192.168.0.0/22 + per-room /28 VLANs (HIPAA isolation) + VLAN20 (`10.0.20.0/24`) + VLAN30 voice. + Admin cred vault `clients/cascades-tucson/pfsense-firewall`. Access via + `unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson`. +- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) `192.168.2.193`. Also + present: `Switch 2nd Floor` (US24PRO) `192.168.2.4`. +- **UniFi controller:** UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`. +- **Fleet:** 77 APs, 12 switches, ~573–788 clients. 22 Poly WiFi phones (OUI `48:25:67`, VLAN 20). +- **Key technical notes:** pfSense 25.07 logs are **plain text** (use `tail`/`grep`, NOT `clog`). + Clean DHCP restart on this box = `services_dhcpd_configure()` via `pfSsh.php`. diff --git a/errorlog.md b/errorlog.md index cb430802..4f5c8b93 100644 --- a/errorlog.md +++ b/errorlog.md @@ -17,6 +17,8 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure · +2026-06-18 | Howard-Home | pfsense-ssh/logs | [friction] used clog on pfSense 25.07 logs (now plain-text ASCII) -> empty output -> wrongly concluded DHCP log was empty / dhcpd not serving; cost a hypothesis. Read pfSense 25.07 logs with tail/grep/cat directly, NOT clog [ctx: ref=reference_pfsense_25_07_ops client=cascades-tucson] + 2026-06-17 | GURU-5070 | mailbox/365-mail | [correction] claimed in a prior session that /mailbox skill + memories were repointed off the deleted fabb3421 to the 365-mail suite, but mailbox.md still hardwired fabb3421 (token 401 AADSTS700016). Correct app is the dedicated ComputerGuru Mailbox app 1873b1b0 via get-token.sh 'mailbox' tier (cert auth); repointed mailbox.md + feedback_365_remediation_tool.md 2026-06-17. Lesson: verify the edit actually landed before reporting it done. 2026-06-17 | Howard-Home | wiki-compile/coord | [friction] skill doc Phase 6 shows 'lock release claudetools wiki//' but coord.py takes 'lock release '; wasted a round-trip. Capture the lock id from claim output and release by id. [ctx: ref=wiki-compile-skill]