sync: auto-sync from HOWARD-HOME at 2026-06-17 22:46:27

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-17 22:46:27
This commit is contained in:
2026-06-17 22:46:37 -07:00
parent dc4560cf27
commit f36fb97eb8
4 changed files with 136 additions and 0 deletions

View File

@@ -101,6 +101,7 @@
### Cascades
- [Cascades operational rules](feedback_cascades.md) — Active rules: (1) folder redirection (fdeploy) needs subfolders PRE-CREATED before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1. (2) ALWAYS ask which security group(s) a new user goes into — never auto-derive from OU. (3) Do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use. (4) NEVER change Cascades production infra (pfSense/UniFi/switches/DHCP) without discussing it + explicit per-change go — read-only/dry-run until then.
- [Cascades FR GPO fix](reference_cascades_fr_gpo_fix.md) — Native Folder Redirection was DOA on every machine: redirect targets were in a misnamed `fdeploy1.ini` (Windows reads `fdeploy.ini`) → empty target path → silent no-op → per-user registry workaround every time. Fixed 2026-06-08 (correct fdeploy.ini + version bump). Also: CS-SERVER live RMM agent is `c39f1de7...` (old `6766e973` stale).
- [pfSense 25.07 ops quirks](reference_pfsense_25_07_ops.md) — Cascades pfSense Plus 25.07: logs are PLAIN TEXT (use tail/grep, NOT clog → clog returns empty); clean dhcpd restart = `services_dhcpd_configure()` via slow pfSsh.php (needs 50s+ timeout); dirty boot can leave 2 dhcpd → DISCOVER/OFFER but no ACK; reboot the Cox modem after a config restore; ZFS survives power loss. From the 2026-06-17 power-outage incident.
- [feedback_ascii_only_api_payloads](feedback_ascii_only_api_payloads.md) -- On Windows/Git-bash, non-ASCII chars (em-dash, arrow, smart quotes) in JSON payload TEXT passed to curl get mangled and rejected — Discord bot-alert returns 400, the coord API returns "error parsing the body". Use ASCII-only in API payload text, or a single-quoted heredoc.
## Machine

View File

@@ -0,0 +1,28 @@
---
name: reference_pfsense_25_07_ops
description: pfSense Plus 25.07 operational quirks learned during the Cascades power-outage recovery — plain-text logs (NOT clog), clean dhcpd restart via pfSsh.php, reboot the upstream modem after a config restore, ZFS power-loss resilience
metadata:
type: reference
---
Learned on the Cascades pfSense (`192.168.0.1`, Plus 25.07-RELEASE, ZFS) during the 2026-06-17
power-outage recovery. Access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run "<cmd>"`
(admin SSH = real shell). Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`.
- **Logs are PLAIN TEXT (ASCII), not clog binary.** `clog /var/log/dhcpd.log` returns EMPTY on 25.07
→ do NOT conclude "logs are empty / service not logging." Read with `tail`/`grep`/`cat` directly.
(Burned a whole hypothesis on this — the DHCP server was actually fine.) `file /var/log/*.log` → ASCII text.
- **Clean single-instance DHCP restart from shell:** `echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php`
(regenerates `/var/dhcpd/etc/dhcpd.conf` + restarts ONE dhcpd; kills duplicates). A power-loss/dirty boot
can leave **two `dhcpd` processes** fighting → clients get DISCOVER→OFFER but never REQUEST/ACK.
Verify: `pgrep -f "dhcpd -user" | wc -l` should be **1**. Test config: `dhcpd -t -cf /var/dhcpd/etc/dhcpd.conf`.
- **`pfSsh.php` is SLOW to load (~20-40s).** SSH commands that invoke it need a long timeout (50s+) or they
time out mid-run and you can't tell if the action took.
- **After a pfSense config restore/replace, REBOOT the upstream modem** (Cox at Cascades) to re-sync the WAN —
skipping this prolongs post-restore issues. Add to any restore runbook.
- **ZFS root is power-loss resistant** — `zpool status -x` → "all pools are healthy"; `config.xml` survived an
unclean power-off intact. A 50x on the GUI right after a dirty boot is usually transient (services still starting).
- **DHCP "offers but never completes" on ONE segment/switch** = asymmetric L2 forwarding (DISCOVER reaches
pfSense + OFFER sent on the right iface/subnet, but REQUEST=0/ACK=0 because the reply doesn't reach the client).
Root cause is the switch (re-adopted with stale forwarding/bad port profile), NOT pfSense — fix = reset/re-adopt
that switch. See [[reference_cascades_fr_gpo_fix]] for other Cascades infra notes.

View File

@@ -0,0 +1,105 @@
# Incident Report — Cascades of Tucson Site-Wide Network Outage (Power Outage)
- **Date:** 2026-06-17
- **Client:** Cascades of Tucson
- **Severity:** High — total network outage (all APs, switches, clients) at a 24/7 HIPAA assisted-living facility
- **Status:** RESOLVED — full network back online
- **Logged by:** Howard Enos (Howard-Home). Mike Swanson worked the on-box/console recovery.
- **Purpose:** Basis for a power-outage ticket + corrective/preventive actions.
## Summary
A building power outage took the entire Cascades network down. The pfSense firewall lost power
**uncleanly** because it was plugged into the **surge-only side of the UPS, not the battery-backed
side**, so it hard-powered-off instead of riding through / shutting down gracefully. The unclean
shutdown led to a messy recovery: a **duplicate `dhcpd` process** on pfSense and **broken Layer-2
forwarding on a 2nd-floor switch** that prevented APs/clients from completing DHCP. Recovery was
further delayed because **the Cox modem was not rebooted after the pfSense config restore**.
Restoring service required: re-seating pfSense onto the battery side (done), a clean pfSense config
restore from on-box auto-backups, eliminating the duplicate dhcpd, **resetting + re-adopting the
2nd-floor #2 switch** (which brought floors 2/3/4 back), and the Cox modem reboot.
## Impact
- **All 77 APs + all 12 switches showed disconnected; 0 clients; no DHCP** for the duration.
- Resident/staff WiFi, the Poly VoIP phones, and wired devices were down site-wide.
- Duration: power-loss through full restoration (afternoon/evening 2026-06-17).
## Root Causes
1. **Primary trigger — building power outage.**
2. **pfSense had NO battery protection** — the firewall was connected to the **surge-only outlets
of the battery backup, not the battery-backed outlets**, so it lost power instantly and shut
down uncleanly (the rest of the protected gear behaved differently). **RECTIFIED** — pfSense
moved to the battery-backed side.
3. **Unclean shutdown → degraded recovery on pfSense:** a **duplicate `dhcpd` instance** came up.
Two dhcpd processes fought over the same scopes, so clients got `DISCOVER → OFFER` but the
handshake never completed (`REQUEST/ACK` failed) → no IPs.
4. **2nd-floor switch L2 forwarding broke on re-adoption:** the switch came back "online" to the
controller (management/uplink OK) but **forwarded AP DHCP requests UP to pfSense while not
delivering pfSense's replies back DOWN to the APs** — an asymmetric/one-way forwarding failure
(stale forwarding state / port-profile not cleanly applied after the dirty power-up).
5. **Cox modem not rebooted after the pfSense restore** — after restoring the pfSense config, the
upstream Cox modem was not power-cycled, so the WAN did not fully re-establish, prolonging the
issues post-restore. (Modems commonly bind to the firewall's WAN MAC/DHCP state and need a
reboot after a firewall restore/replace.)
## Detection & Diagnosis (key findings)
- Controller (independent of the remote VPN) showed **77/77 APs + 12/12 switches disconnected,
0 clients** — confirming a real, common-path outage, not a monitoring artifact.
- pfSense answered ICMP (`192.168.0.1` up) but its **SSH/management hung** initially and the GUI
briefly returned a **50x** (transient, services still starting after the unclean power-up).
- **ZFS pool healthy** (`all pools are healthy`) and **`config.xml` intact** (684,842 bytes, valid)
— no filesystem/config corruption; ZFS's copy-on-write protected it.
- Reading logs **directly** (pfSense 25.07 uses plain-text logs, not `clog`): DHCP server itself was
healthy (leases writing; clients that reached REQUEST got ACKed) — but APs/native-LAN clients were
stuck at OFFER. DORA breakdown for the 16 second-floor APs: **404 DISCOVER → 404 OFFER → 0 REQUEST
→ 0 ACK**, offers correctly on `igc1` / `192.168.2-3.x` — proving pfSense was offering correctly
and the **replies were not reaching the APs** (downstream switch forwarding).
- Found **2 `dhcpd` processes** running (should be 1) — the duplicate explaining the broken completion.
## Resolution (what fixed it)
1. **pfSense re-seated onto the battery-backed UPS outlets** (corrective; prevents recurrence of #2).
2. **pfSense config restored** from the on-box auto-backup history (`/cf/conf/backup/config-*.xml`);
ZFS/config were intact so this was clean.
3. **Duplicate `dhcpd` eliminated** → single clean instance via `services_dhcpd_configure()`; DHCP
began completing (switches + Poly phones started getting IPs).
4. **Reset + re-adopted the 2nd-floor #2 switch** (`Switch 2nd Floor #2`, USL24PB, 192.168.2.193).
This cleared the broken forwarding state; **floors 3 and 4 came up immediately afterward**
(they were downstream of it), and the site returned to fully online.
5. **Cox modem rebooted** after the restore to re-establish the WAN.
## Corrective Actions
- [DONE] pfSense moved from the **surge-only** side to the **battery-backed** side of the UPS.
## Preventive Actions / Follow-ups (for the ticket)
1. **Enable Netgate AutoConfigBackup on pfSense** — there was **NO off-box config backup** at
incident time, which turned a blip into a scramble. ACB gives automatic encrypted cloud backups
on every change. (An off-box copy of the current + pre-outage config was captured locally during
the incident; should also be vaulted.)
2. **Verify full UPS coverage + runtime + clean shutdown:** the **NUT** package is configured on
pfSense, but the box still went down hard (it was on surge-only). Confirm that pfSense AND the
**core/aggregation switch + the AP-PoE switches** are all on **battery-backed** outlets with
adequate runtime, and test that NUT triggers a graceful pfSense shutdown on power loss.
3. **Post-restore runbook step:** after any pfSense config restore/replacement, **reboot the Cox
modem** to re-sync the WAN (this was the missed step that prolonged recovery).
4. **Vault the pfSense config** off-box (immediate safety net beyond ACB).
5. **Optional monitoring:** alert on DHCP-not-completing / duplicate-dhcpd / mass device-disconnect
so a future event is caught in minutes.
## Reference Information
- **pfSense:** `192.168.0.1`, Plus 25.07-RELEASE, ZFS. WAN: Cox (`igc0`, 184.191.143.x). LAN: `igc1`
192.168.0.0/22 + per-room /28 VLANs (HIPAA isolation) + VLAN20 (`10.0.20.0/24`) + VLAN30 voice.
Admin cred vault `clients/cascades-tucson/pfsense-firewall`. Access via
`unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson`.
- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) `192.168.2.193`. Also
present: `Switch 2nd Floor` (US24PRO) `192.168.2.4`.
- **UniFi controller:** UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`.
- **Fleet:** 77 APs, 12 switches, ~573788 clients. 22 Poly WiFi phones (OUI `48:25:67`, VLAN 20).
- **Key technical notes:** pfSense 25.07 logs are **plain text** (use `tail`/`grep`, NOT `clog`).
Clean DHCP restart on this box = `services_dhcpd_configure()` via `pfSsh.php`.

View File

@@ -17,6 +17,8 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure ·
<!-- Append entries below this line -->
2026-06-18 | Howard-Home | pfsense-ssh/logs | [friction] used clog on pfSense 25.07 logs (now plain-text ASCII) -> empty output -> wrongly concluded DHCP log was empty / dhcpd not serving; cost a hypothesis. Read pfSense 25.07 logs with tail/grep/cat directly, NOT clog [ctx: ref=reference_pfsense_25_07_ops client=cascades-tucson]
2026-06-17 | GURU-5070 | mailbox/365-mail | [correction] claimed in a prior session that /mailbox skill + memories were repointed off the deleted fabb3421 to the 365-mail suite, but mailbox.md still hardwired fabb3421 (token 401 AADSTS700016). Correct app is the dedicated ComputerGuru Mailbox app 1873b1b0 via get-token.sh 'mailbox' tier (cert auth); repointed mailbox.md + feedback_365_remediation_tool.md 2026-06-17. Lesson: verify the edit actually landed before reporting it done.
2026-06-17 | Howard-Home | wiki-compile/coord | [friction] skill doc Phase 6 shows 'lock release claudetools wiki/<type>/<slug>' but coord.py takes 'lock release <id>'; wasted a round-trip. Capture the lock id from claim output and release by id. [ctx: ref=wiki-compile-skill]