sync: auto-sync from HOWARD-HOME at 2026-06-17 23:09:29
Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-17 23:09:29
This commit is contained in:
@@ -0,0 +1,149 @@
|
||||
# Cascades — 5GHz overlap analysis + Option-B plan, then POWER-OUTAGE incident recovery + config vaulting
|
||||
|
||||
- **Date:** 2026-06-17
|
||||
- **Machine:** Howard-Home
|
||||
- **Client:** Cascades of Tucson
|
||||
- **Continuation of:** `2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md` (same session)
|
||||
- **Incident report (ticket basis):** `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`
|
||||
|
||||
## User
|
||||
- **User:** Howard Enos (howard)
|
||||
- **Machine:** Howard-Home
|
||||
- **Role:** tech
|
||||
|
||||
## Session Summary
|
||||
|
||||
Continued the Cascades whole-network-smoothing work from the phone-drops investigation. Built
|
||||
the AP-to-AP 5GHz overlap picture from the cached neighbor SNR matrix (`.claude/tmp/cascades-nbr.json`,
|
||||
6/16 AP collect) crossed with live channels: real co-channel overlap was MODERATE (25 genuinely
|
||||
co-channel pairs of 173 strong neighbor pairs), and width 80->40 alone only fixed 7 of 25 (28%) —
|
||||
the rest are same-channel collisions needing a channel re-plan. Howard then had UniFi auto-change
|
||||
the channels; recheck showed it made overlap WORSE (25->30 co-channel, new ch36 cluster on the
|
||||
three busiest APs salon/2nd-Floor-Atrium/CC-Bridge at 36-48 dB). A `channel-plan na` dry-run got
|
||||
co-channel to 0 but only by going non-DFS, which requires 40MHz width (at 80MHz non-DFS = 2 blocks).
|
||||
Conclusion: width + channel are coupled; the real fix is **Option B = 40MHz + non-DFS optimized
|
||||
channels + min-RSSI relax**, done as a combined per-AP radio_table PUT. Howard chose Option B and
|
||||
asked to park/save the 5GHz plan and validate the 2.4 (Low->Medium) plan separately.
|
||||
|
||||
Mid-2.4-diagnostic, a **building POWER OUTAGE** took the entire Cascades network down (all 77 APs +
|
||||
12 switches disconnected, 0 clients). Spent the bulk of the session driving the recovery read-only
|
||||
from Howard-Home (VPN up; controller via UOS). Root cause chain: pfSense was plugged into the
|
||||
**surge-only side of the UPS, not the battery side**, so it hard-powered-off uncleanly. ZFS + config
|
||||
survived (pools healthy, config.xml valid). The dirty boot left a **duplicate dhcpd** (DISCOVER->OFFER
|
||||
but no REQUEST/ACK) and a **2nd-floor switch with one-way L2 forwarding** (APs' DISCOVERs reached
|
||||
pfSense, pfSense offered correctly on igc1/192.168.2-3.x, but OFFERs never returned -> 404 DISCOVER /
|
||||
404 OFFER / 0 REQUEST / 0 ACK across the 16 second-floor APs). Confirmed pfSense was healthy (full
|
||||
health check: services up, WAN up, no interface errors, no storm, state table fine). Killed the
|
||||
duplicate dhcpd and restarted a single clean instance via `services_dhcpd_configure()`. Mike worked
|
||||
the on-box/console side: re-seated pfSense onto the battery outlets (rectified), restored config from
|
||||
the on-box auto-backup (the 12:20 version, WITH today's VLAN30 work), **reset+re-adopted the 2nd-floor
|
||||
#2 switch** (which brought floors 3/4 up), and rebooted the Cox modem (the missed post-restore step).
|
||||
Network fully restored.
|
||||
|
||||
Closed out by writing the incident report (ticket basis), capturing reusable pfSense-25.07 lessons
|
||||
to memory + an errorlog `--friction` entry (clog-vs-plaintext-logs), and **vaulting the pfSense
|
||||
config** (verified VLAN30 present first). While vaulting, `vault check` caught a PRE-EXISTING
|
||||
plaintext credential (`synology-signin-portal.sops.yaml`) committed plaintext in vault history
|
||||
(commit 1fbc0e1) — encrypted it go-forward; flagged the credential as exposed (needs rotation).
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **5GHz: go with Option B (40MHz + non-DFS optimized channels + min-RSSI relax)** as one combined
|
||||
per-AP radio_table PUT, because width and channel are coupled — width alone is weak (7/25 pairs)
|
||||
and a clean non-DFS channel plan requires narrowing to 40MHz. Parked pending execution (no change applied).
|
||||
- **Measured before acting on 5GHz** — the neighbor-matrix data overrode the theory (overlap moderate,
|
||||
width change weak, auto-channel made it worse), exactly what Howard asked for.
|
||||
- **Recovery driven READ-ONLY** while Mike owned the on-box restore; the only pfSense change made was
|
||||
killing the duplicate dhcpd + clean restart (the DHCP fix Mike was waiting on).
|
||||
- **Did NOT blind-reboot pfSense** (it was the up edge; problem was downstream). Diagnosed the 2nd-floor
|
||||
switch as the L2 culprit; Mike's reset+re-adopt of switch #2 confirmed it (floors 3/4 followed).
|
||||
- **Vaulted the post-restore LIVE config** (not the incident-time copy), after verifying VLAN30 present.
|
||||
- **Encrypted the pre-existing plaintext synology entry** rather than leaving it — vault `check` is the
|
||||
mandated pre-commit gate; flagged the historical exposure for rotation.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **POWER OUTAGE** -> total site outage. Root cause: pfSense on the UPS surge-only side (no battery) ->
|
||||
unclean shutdown. RECTIFIED (moved to battery side by Mike).
|
||||
- **Duplicate dhcpd** after dirty boot -> DHCP offered but never completed. Fixed: `killall dhcpd` +
|
||||
`echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php` -> single clean instance; ACKs resumed.
|
||||
- **2nd-floor switch one-way L2 forwarding** (online to controller, but not passing OFFERs down to APs).
|
||||
Fixed by Mike: reset + re-adopt of `Switch 2nd Floor #2` (USL24PB) -> floors 3/4 recovered.
|
||||
- **Cox modem not rebooted after the pfSense restore** -> WAN didn't fully re-establish; prolonged issues.
|
||||
Fixed by rebooting the modem. (Now a runbook step.)
|
||||
- **clog on pfSense 25.07 returned empty** -> I wrongly concluded "DHCP log empty / dhcpd not serving."
|
||||
25.07 logs are PLAIN TEXT; read with tail/grep. Cost a hypothesis. Logged `--friction`.
|
||||
- **`pfSsh.php` is slow to load (~20-40s)** -> several restart commands timed out mid-run (timeout 60s
|
||||
insufficient). Use 50s+ and verify after.
|
||||
- **sops encrypt failed from outside the vault dir** ("no creation rules") leaving a 730KB PLAINTEXT
|
||||
config on disk briefly. Fixed by replicating vault-helper's pattern: `( cd "$VAULT_DIR" && sops --encrypt --in-place <relpath> )`.
|
||||
- **Pre-existing plaintext vault entry** (`synology-signin-portal.sops.yaml`) found by `vault check`;
|
||||
encrypted it. It was committed plaintext at vault commit 1fbc0e1 -> credential in git history (exposed).
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
- **pfSense (192.168.0.1):** killed duplicate dhcpd + clean restart via `services_dhcpd_configure()`
|
||||
(operational, no config.xml change). All other pfSense access was read-only.
|
||||
- **pfSense hardware (Mike):** moved from UPS surge-only outlets to battery-backed outlets.
|
||||
- **pfSense config (Mike):** restored from on-box auto-backup (12:20 version, VLAN30 intact). Cox modem rebooted.
|
||||
- **UniFi (Mike):** reset + re-adopted `Switch 2nd Floor #2` (USL24PB, 192.168.2.193).
|
||||
- **UniFi channels (Howard, earlier):** auto-channel reassignment on 5GHz (made co-channel overlap worse; to be re-planned via Option B).
|
||||
- **Repo:** created `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`; created
|
||||
memory `.claude/memory/reference_pfsense_25_07_ops.md` + MEMORY.md index line; errorlog `--friction` entry.
|
||||
- **Vault:** created `clients/cascades-tucson/pfsense-config-backup-2026-06-17.sops.yaml` (encrypted, VLAN30 verified);
|
||||
encrypted the pre-existing-plaintext `clients/cascades-tucson/synology-signin-portal.sops.yaml`.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- **pfSense config backup vaulted:** `clients/cascades-tucson/pfsense-config-backup-2026-06-17` (kind note,
|
||||
config under encrypted `content:` field; decrypts cleanly; VLAN30 present). Vault commit 3735119.
|
||||
- **[EXPOSED — ROTATE] Cascades Synology Cloud Signin Portal** (`account.synology.com`): the entry
|
||||
`clients/cascades-tucson/synology-signin-portal.sops.yaml` was committed **plaintext** at vault commit
|
||||
**1fbc0e1** and remains recoverable from vault git history. Now encrypted go-forward, but the credential
|
||||
should be considered exposed and rotated. Same commit also added an "MDM service account" + "WiFi CSCNet" —
|
||||
verify those were never plaintext in history.
|
||||
- pfSense admin SSH: vault `clients/cascades-tucson/pfsense-firewall` (unchanged).
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **Cascades pfSense** `192.168.0.1`, Plus 25.07-RELEASE, **ZFS** (power-loss resilient; pools healthy).
|
||||
WAN igc0 = Cox (184.191.143.x). LAN igc1 = 192.168.0.0/22 + per-room /28 VLANs + VLAN20 (10.0.20.0/24) +
|
||||
**VLAN30 Voice opt241/igc1.30 (10.0.30.0/24, DHCP .100-.250) — VERIFIED INTACT post-restore, status active.**
|
||||
Logs PLAIN TEXT (not clog). Clean dhcpd restart = `services_dhcpd_configure()` via pfSsh.php.
|
||||
- **Switch that needed reset+re-adopt:** `Switch 2nd Floor #2` (USL24PB) 192.168.2.193. Also `Switch 2nd Floor`
|
||||
(US24PRO) 192.168.2.4.
|
||||
- **UniFi UOS controller** 172.16.3.29, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`. 77 APs, 12 switches.
|
||||
- **5GHz overlap data:** neighbor SNR matrix `.claude/tmp/cascades-nbr.json` (6/16); live channels via
|
||||
Network API (live-stats; read cred only is `-rw`, RW cred `infrastructure/uos-server-network-api-rw` vaulted).
|
||||
- **16 second-floor APs (DHCP-stuck during incident):** names 203/204/206/209/210/217/221/222/229/236/237/240/241/247/248 + 2nd Floor Atrium; UniFi OUI 0c:ea:14.
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- DORA breakdown (2nd-floor APs, the L2 proof): `404 DISCOVER -> 404 OFFER -> 0 REQUEST -> 0 ACK`, offers via igc1 on 192.168.2-3.x.
|
||||
- DHCP fix: `killall dhcpd; echo "services_dhcpd_configure();" | /usr/local/sbin/pfSsh.php`; verify `pgrep -f "dhcpd -user" | wc -l` == 1.
|
||||
- Read pfSense logs DIRECTLY (NOT clog): `tail/grep /var/log/dhcpd.log`. `file /var/log/*.log` -> ASCII text.
|
||||
- ZFS health: `zpool status -x` -> "all pools are healthy".
|
||||
- VLAN30 verify: `grep "<ipaddr>10.0.30.1" config.xml` -> opt241 "Voice" igc1.30; DHCP from 10.0.30.100 to .250; `ifconfig igc1.30` inet 10.0.30.1 status active.
|
||||
- Vault file encrypt pattern: `( cd "$VAULT_DIR" && sops --encrypt --in-place <relpath> )`.
|
||||
- 5GHz overlap: 173 strong neighbor pairs; 25 co-channel @80MHz (auto-change pushed to 30); width 80->40 fixes only 7; channel-plan na -> 0 (non-DFS, needs 40MHz).
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **Execute Option B (5GHz)** when ready: combined per-AP radio_table PUT = ng power medium (42 low radios) +
|
||||
na ht 40 (76) + na non-DFS optimized channels (channel-plan na) + na min_rssi -82 (69). Per-zone, evening window. Dry-run staged; nothing applied.
|
||||
- **2.4 Low->Medium bump** still pending validation/execution (the diagnostic was interrupted by the outage).
|
||||
- **Incident follow-ups (for the ticket):** enable Netgate AutoConfigBackup (no off-box backup existed);
|
||||
verify full UPS coverage/runtime/clean-shutdown (NUT installed but box was on surge-only; confirm core +
|
||||
PoE switches are battery-backed); add "reboot Cox modem after pfSense restore" to the runbook.
|
||||
- **[SECURITY] Rotate the Cascades Synology signin credential** (exposed in vault history, commit 1fbc0e1);
|
||||
verify the MDM service account + CSCNet entries from that commit were never plaintext.
|
||||
- Standing rule: no Cascades prod-infra changes without discussing + explicit per-change go (memory feedback_cascades #4).
|
||||
|
||||
## Reference Information
|
||||
|
||||
- Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`
|
||||
- Memory: `reference_pfsense_25_07_ops.md` (new), `feedback_cascades.md` #4 (no-change-without-discussing).
|
||||
- Vault: `clients/cascades-tucson/pfsense-config-backup-2026-06-17` (new, commit 3735119);
|
||||
`clients/cascades-tucson/synology-signin-portal` (encrypted; exposed in history commit 1fbc0e1).
|
||||
- pfSense access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run "<cmd>"`.
|
||||
- Prior session log (phone drops + smoothing plan): `2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md`.
|
||||
Reference in New Issue
Block a user