sync: auto-sync from HOWARD-HOME at 2026-06-17 22:46:27

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-17 22:46:27
2026-06-17 22:46:37 -07:00
parent dc4560cf27
commit f36fb97eb8
4 changed files with 136 additions and 0 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -101,6 +101,7 @@
 ### Cascades
 - [Cascades operational rules](feedback_cascades.md) — Active rules: (1) folder redirection (fdeploy) needs subfolders PRE-CREATED before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1. (2) ALWAYS ask which security group(s) a new user goes into — never auto-derive from OU. (3) Do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use. (4) NEVER change Cascades production infra (pfSense/UniFi/switches/DHCP) without discussing it + explicit per-change go — read-only/dry-run until then.
 - [Cascades FR GPO fix](reference_cascades_fr_gpo_fix.md) — Native Folder Redirection was DOA on every machine: redirect targets were in a misnamed `fdeploy1.ini` (Windows reads `fdeploy.ini`) → empty target path → silent no-op → per-user registry workaround every time. Fixed 2026-06-08 (correct fdeploy.ini + version bump). Also: CS-SERVER live RMM agent is `c39f1de7...` (old `6766e973` stale).
+- [pfSense 25.07 ops quirks](reference_pfsense_25_07_ops.md) — Cascades pfSense Plus 25.07: logs are PLAIN TEXT (use tail/grep, NOT clog → clog returns empty); clean dhcpd restart = `services_dhcpd_configure()` via slow pfSsh.php (needs 50s+ timeout); dirty boot can leave 2 dhcpd → DISCOVER/OFFER but no ACK; reboot the Cox modem after a config restore; ZFS survives power loss. From the 2026-06-17 power-outage incident.
 - [feedback_ascii_only_api_payloads](feedback_ascii_only_api_payloads.md) -- On Windows/Git-bash, non-ASCII chars (em-dash, arrow, smart quotes) in JSON payload TEXT passed to curl get mangled and rejected — Discord bot-alert returns 400, the coord API returns "error parsing the body". Use ASCII-only in API payload text, or a single-quoted heredoc.

 ## Machine
--- a/.claude/memory/reference_pfsense_25_07_ops.md
+++ b/.claude/memory/reference_pfsense_25_07_ops.md
@@ -0,0 +1,28 @@
+---
+name: reference_pfsense_25_07_ops
+description: pfSense Plus 25.07 operational quirks learned during the Cascades power-outage recovery — plain-text logs (NOT clog), clean dhcpd restart via pfSsh.php, reboot the upstream modem after a config restore, ZFS power-loss resilience
+metadata:
+  type: reference
+---
+
+Learned on the Cascades pfSense (`192.168.0.1`, Plus 25.07-RELEASE, ZFS) during the 2026-06-17
+power-outage recovery. Access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run "<cmd>"`
+(admin SSH = real shell). Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`.
+
+- **Logs are PLAIN TEXT (ASCII), not clog binary.** `clog /var/log/dhcpd.log` returns EMPTY on 25.07
+  → do NOT conclude "logs are empty / service not logging." Read with `tail`/`grep`/`cat` directly.
+  (Burned a whole hypothesis on this — the DHCP server was actually fine.) `file /var/log/*.log` → ASCII text.
+- **Clean single-instance DHCP restart from shell:** `echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php`
+  (regenerates `/var/dhcpd/etc/dhcpd.conf` + restarts ONE dhcpd; kills duplicates). A power-loss/dirty boot
+  can leave **two `dhcpd` processes** fighting → clients get DISCOVER→OFFER but never REQUEST/ACK.
+  Verify: `pgrep -f "dhcpd -user" | wc -l` should be **1**. Test config: `dhcpd -t -cf /var/dhcpd/etc/dhcpd.conf`.
+- **`pfSsh.php` is SLOW to load (~20-40s).** SSH commands that invoke it need a long timeout (50s+) or they
+  time out mid-run and you can't tell if the action took.
+- **After a pfSense config restore/replace, REBOOT the upstream modem** (Cox at Cascades) to re-sync the WAN —
+  skipping this prolongs post-restore issues. Add to any restore runbook.
+- **ZFS root is power-loss resistant** — `zpool status -x` → "all pools are healthy"; `config.xml` survived an
+  unclean power-off intact. A 50x on the GUI right after a dirty boot is usually transient (services still starting).
+- **DHCP "offers but never completes" on ONE segment/switch** = asymmetric L2 forwarding (DISCOVER reaches
+  pfSense + OFFER sent on the right iface/subnet, but REQUEST=0/ACK=0 because the reply doesn't reach the client).
+  Root cause is the switch (re-adopted with stale forwarding/bad port profile), NOT pfSense — fix = reset/re-adopt
+  that switch. See [[reference_cascades_fr_gpo_fix]] for other Cascades infra notes.
--- a/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md
+++ b/clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md
@@ -0,0 +1,105 @@
+# Incident Report — Cascades of Tucson Site-Wide Network Outage (Power Outage)
+
+- **Date:** 2026-06-17
+- **Client:** Cascades of Tucson
+- **Severity:** High — total network outage (all APs, switches, clients) at a 24/7 HIPAA assisted-living facility
+- **Status:** RESOLVED — full network back online
+- **Logged by:** Howard Enos (Howard-Home). Mike Swanson worked the on-box/console recovery.
+- **Purpose:** Basis for a power-outage ticket + corrective/preventive actions.
+
+## Summary
+
+A building power outage took the entire Cascades network down. The pfSense firewall lost power
+**uncleanly** because it was plugged into the **surge-only side of the UPS, not the battery-backed
+side**, so it hard-powered-off instead of riding through / shutting down gracefully. The unclean
+shutdown led to a messy recovery: a **duplicate `dhcpd` process** on pfSense and **broken Layer-2
+forwarding on a 2nd-floor switch** that prevented APs/clients from completing DHCP. Recovery was
+further delayed because **the Cox modem was not rebooted after the pfSense config restore**.
+Restoring service required: re-seating pfSense onto the battery side (done), a clean pfSense config
+restore from on-box auto-backups, eliminating the duplicate dhcpd, **resetting + re-adopting the
+2nd-floor #2 switch** (which brought floors 2/3/4 back), and the Cox modem reboot.
+
+## Impact
+
+- **All 77 APs + all 12 switches showed disconnected; 0 clients; no DHCP** for the duration.
+- Resident/staff WiFi, the Poly VoIP phones, and wired devices were down site-wide.
+- Duration: power-loss through full restoration (afternoon/evening 2026-06-17).
+
+## Root Causes
+
+1. **Primary trigger — building power outage.**
+2. **pfSense had NO battery protection** — the firewall was connected to the **surge-only outlets
+   of the battery backup, not the battery-backed outlets**, so it lost power instantly and shut
+   down uncleanly (the rest of the protected gear behaved differently). **RECTIFIED** — pfSense
+   moved to the battery-backed side.
+3. **Unclean shutdown → degraded recovery on pfSense:** a **duplicate `dhcpd` instance** came up.
+   Two dhcpd processes fought over the same scopes, so clients got `DISCOVER → OFFER` but the
+   handshake never completed (`REQUEST/ACK` failed) → no IPs.
+4. **2nd-floor switch L2 forwarding broke on re-adoption:** the switch came back "online" to the
+   controller (management/uplink OK) but **forwarded AP DHCP requests UP to pfSense while not
+   delivering pfSense's replies back DOWN to the APs** — an asymmetric/one-way forwarding failure
+   (stale forwarding state / port-profile not cleanly applied after the dirty power-up).
+5. **Cox modem not rebooted after the pfSense restore** — after restoring the pfSense config, the
+   upstream Cox modem was not power-cycled, so the WAN did not fully re-establish, prolonging the
+   issues post-restore. (Modems commonly bind to the firewall's WAN MAC/DHCP state and need a
+   reboot after a firewall restore/replace.)
+
+## Detection & Diagnosis (key findings)
+
+- Controller (independent of the remote VPN) showed **77/77 APs + 12/12 switches disconnected,
+  0 clients** — confirming a real, common-path outage, not a monitoring artifact.
+- pfSense answered ICMP (`192.168.0.1` up) but its **SSH/management hung** initially and the GUI
+  briefly returned a **50x** (transient, services still starting after the unclean power-up).
+- **ZFS pool healthy** (`all pools are healthy`) and **`config.xml` intact** (684,842 bytes, valid)
+  — no filesystem/config corruption; ZFS's copy-on-write protected it.
+- Reading logs **directly** (pfSense 25.07 uses plain-text logs, not `clog`): DHCP server itself was
+  healthy (leases writing; clients that reached REQUEST got ACKed) — but APs/native-LAN clients were
+  stuck at OFFER. DORA breakdown for the 16 second-floor APs: **404 DISCOVER → 404 OFFER → 0 REQUEST
+  → 0 ACK**, offers correctly on `igc1` / `192.168.2-3.x` — proving pfSense was offering correctly
+  and the **replies were not reaching the APs** (downstream switch forwarding).
+- Found **2 `dhcpd` processes** running (should be 1) — the duplicate explaining the broken completion.
+
+## Resolution (what fixed it)
+
+1. **pfSense re-seated onto the battery-backed UPS outlets** (corrective; prevents recurrence of #2).
+2. **pfSense config restored** from the on-box auto-backup history (`/cf/conf/backup/config-*.xml`);
+   ZFS/config were intact so this was clean.
+3. **Duplicate `dhcpd` eliminated** → single clean instance via `services_dhcpd_configure()`; DHCP
+   began completing (switches + Poly phones started getting IPs).
+4. **Reset + re-adopted the 2nd-floor #2 switch** (`Switch 2nd Floor #2`, USL24PB, 192.168.2.193).
+   This cleared the broken forwarding state; **floors 3 and 4 came up immediately afterward**
+   (they were downstream of it), and the site returned to fully online.
+5. **Cox modem rebooted** after the restore to re-establish the WAN.
+
+## Corrective Actions
+
+- [DONE] pfSense moved from the **surge-only** side to the **battery-backed** side of the UPS.
+
+## Preventive Actions / Follow-ups (for the ticket)
+
+1. **Enable Netgate AutoConfigBackup on pfSense** — there was **NO off-box config backup** at
+   incident time, which turned a blip into a scramble. ACB gives automatic encrypted cloud backups
+   on every change. (An off-box copy of the current + pre-outage config was captured locally during
+   the incident; should also be vaulted.)
+2. **Verify full UPS coverage + runtime + clean shutdown:** the **NUT** package is configured on
+   pfSense, but the box still went down hard (it was on surge-only). Confirm that pfSense AND the
+   **core/aggregation switch + the AP-PoE switches** are all on **battery-backed** outlets with
+   adequate runtime, and test that NUT triggers a graceful pfSense shutdown on power loss.
+3. **Post-restore runbook step:** after any pfSense config restore/replacement, **reboot the Cox
+   modem** to re-sync the WAN (this was the missed step that prolonged recovery).
+4. **Vault the pfSense config** off-box (immediate safety net beyond ACB).
+5. **Optional monitoring:** alert on DHCP-not-completing / duplicate-dhcpd / mass device-disconnect
+   so a future event is caught in minutes.
+
+## Reference Information
+
+- **pfSense:** `192.168.0.1`, Plus 25.07-RELEASE, ZFS. WAN: Cox (`igc0`, 184.191.143.x). LAN: `igc1`
+  192.168.0.0/22 + per-room /28 VLANs (HIPAA isolation) + VLAN20 (`10.0.20.0/24`) + VLAN30 voice.
+  Admin cred vault `clients/cascades-tucson/pfsense-firewall`. Access via
+  `unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson`.
+- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) `192.168.2.193`. Also
+  present: `Switch 2nd Floor` (US24PRO) `192.168.2.4`.
+- **UniFi controller:** UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`.
+- **Fleet:** 77 APs, 12 switches, ~573–788 clients. 22 Poly WiFi phones (OUI `48:25:67`, VLAN 20).
+- **Key technical notes:** pfSense 25.07 logs are **plain text** (use `tail`/`grep`, NOT `clog`).
+  Clean DHCP restart on this box = `services_dhcpd_configure()` via `pfSsh.php`.
--- a/errorlog.md
+++ b/errorlog.md
@@ -17,6 +17,8 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure ·

 <!-- Append entries below this line -->

+2026-06-18 | Howard-Home | pfsense-ssh/logs | [friction] used clog on pfSense 25.07 logs (now plain-text ASCII) -> empty output -> wrongly concluded DHCP log was empty / dhcpd not serving; cost a hypothesis. Read pfSense 25.07 logs with tail/grep/cat directly, NOT clog [ctx: ref=reference_pfsense_25_07_ops client=cascades-tucson]
+
 2026-06-17 | GURU-5070 | mailbox/365-mail | [correction] claimed in a prior session that /mailbox skill + memories were repointed off the deleted fabb3421 to the 365-mail suite, but mailbox.md still hardwired fabb3421 (token 401 AADSTS700016). Correct app is the dedicated ComputerGuru Mailbox app 1873b1b0 via get-token.sh 'mailbox' tier (cert auth); repointed mailbox.md + feedback_365_remediation_tool.md 2026-06-17. Lesson: verify the edit actually landed before reporting it done.

 2026-06-17 | Howard-Home | wiki-compile/coord | [friction] skill doc Phase 6 shows 'lock release claudetools wiki/<type>/<slug>' but coord.py takes 'lock release <id>'; wasted a round-trip. Capture the lock id from claim output and release by id. [ctx: ref=wiki-compile-skill]