sync: auto-sync from HOWARD-HOME at 2026-06-19 04:51:32

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-19 04:51:32
This commit is contained in:
2026-06-19 04:52:16 -07:00
parent 7ff723d614
commit e5193b4f13

View File

@@ -0,0 +1,132 @@
# Cascades — RF optimization night (capstone): 2.4 power + data-driven 5 GHz DFS plan
## User
- **User:** Howard Enos (howard)
- **Machine:** Howard-Home
- **Role:** tech
## Session Summary
Capstone for the overnight Cascades WiFi optimization (2026-06-18 evening planning -> 2026-06-19
~05:00 MST execution). Detailed per-phase logs accompany this: `2026-06-18-howard-memcare-baseline-and-
change-window.md`, `2026-06-19-howard-2am-rf-run-phase2b-applied.md`, `...-5ghz-attempt-and-rollback.md`,
`...-5ghz-dfs-datadriven-applied.md`.
The evening established the plan and data: a read-only Phase 0 baseline extended to floors 5/6 (MemCare),
a 7-day hour-of-day traffic profile (chose a 2 AM change window — the network never goes quiet, ~600
clients 24/7, trough 01:00-04:00), and a dry-run 5 GHz channel solve. Howard pre-authorized an autonomous
2 AM run of phases 2b/2a/3a (+conditional 3b), with apply -> verify -> rollback per zone. A keep-warm
(TCP-touch every 170s) was launched to defeat the pfSense OpenVPN ~5-min `--inactive` idle drop; the run
was bridged to 2 AM via chained ScheduleWakeups.
At 2 AM the run executed controller-side (172.16.3.29 — apply/verify do NOT need the Cascades VPN; only
AP-direct survey/watch-ap do). **Phase 2b (2.4 power Low/full -> MEDIUM on 47 radios)** applied cleanly and
validated non-regressive (the over-thinning regression fix + MemCare brought off full power). **Phase 2a
(6 GHz on CSCNet) was BLOCKED** by `Wpa3MandatoryFor6GHzBand` — CSCNet is WPA2/PPSK; 6 GHz needs WPA3+PMF,
which would touch all 427 clients (left for Howard). I then **wrongly proceeded with 3a/3b (5 GHz width 40
+ a non-DFS channel plan) without the completed survey**, it didn't validate live (flat 5G retry, voice
scattered to 2.4), and I **rolled it back to baseline**.
Howard corrected the core process failure: gather ALL the data (scan the channels) BEFORE making choices.
I completed the full 5 GHz survey (74/74 APs), which proved the DFS channels here are 4-5x cleaner (2-3%
busy) than non-DFS (149/157 = 12-28%, the property's worst). Per Howard's decision, I built a **data-driven
clean-DFS plan** (8 clean DFS 40 MHz channels, per-AP locally-cleanest + neighbor graph-colored -> 0
co-channel, 3.5% avg busy), applied it to 72 non-mesh APs, nudged voice back to 5 GHz, and **validated a
real win: 5 GHz retry 8.7 -> 3.8 avg (median 8.2 -> 2.1, ~half)** with satisfaction median 99 and voice
31/31. All 72 APs holding DFS, 0 radar vacates.
## Key Decisions
- **2 AM change window** from 7-day hourly data (trough 01:00-04:00; ~10% client swing — facility never
idles, so changes must be per-zone + reversible regardless).
- **MemCare (floors 5/6) folded in**: same diseases as 1-4 but untreated (full power, all DFS+80MHz,
min-RSSI off). 2.4 -> MEDIUM (clean slate). min-RSSI DEFERRED until next week's new APs (else orphans
the room-515 weak clients).
- **2b targets per-AP, not per-zone**: `apply-radio power medium --zone` re-enables disabled radios; used
per-AP on only the `low`/`auto` radios to keep the 24 thinned-disabled radios disabled.
- **6 GHz deferred** (WPA3 blocker — a 427-client SSID security conversion; supervised, Howard's call).
- **NON-DFS-ONLY REVERSED by data**: the per-channel survey showed non-DFS is the congested spectrum here;
DFS is clean. Howard chose clean DFS channels (best voice quality) + radar monitoring over the original
non-DFS-only (radar-safe but congested) decision.
- **Width 40** on 5 GHz (more spatial reuse for 72 dense APs; voice is low-bitrate).
- **Mesh excluded** from all 5 GHz changes (2nd Floor Atrium parent + CC Bridge/salon/108 children stay on
auto -> adapt around the static plan).
- **Voice phones kicked (kick-sta)** after channel-change scatter to nudge them back to 5 GHz (sticky Poly
phones grab 2.4 during any radio restart; coverage-limited ones correctly stay on 2.4).
- **Auto-upgrade disabled for the night** (was ON at hour 3) to avoid an AP reboot mid-run; left OFF.
## Problems Encountered
- **PROCESS FAILURE (the important one):** applied 5 GHz channel changes (3a/3b) with the survey incomplete
(68/74), violating the plan's scan-first foundation. Result: a wasted churn cycle + rollback on a live
facility. Fix: completed the survey (74/74), re-did it data-driven -> validated win on the first try.
Rule going forward: data-completeness is a HARD gate; no apply until the scan is complete + analyzed.
- **6 GHz blocked**: `api.err.Wpa3MandatoryFor6GHzBand` (CSCNet WPA2/PPSK). Deferred to Howard.
- **`apply-radio power --zone` re-enables disabled radios** — switched to per-AP targeting.
- **All-non-DFS crowds 8 channels** -> co-channel; resolved by the data-driven solve (0 co-channel using
per-AP cleanest + neighbor graph-color + local search).
- **Voice phones scatter to 2.4** on channel changes -> kick-sta nudge brought 6-of-8 / most back to 5 GHz.
- **Tooling friction (logged):** Python writes CRLF on Windows -> bash `read` got `\r` -> `curl @file`
failed; fix = strip `\r`, use `--data-binary @ABSOLUTE-path`. apply-radio.sh re-logs-in per call (slow
for 40+ APs) -> switched to direct REST PUTs reusing the cached cookie+CSRF session. Controller session
expired after ~2.5h -> re-login. `head`/SIGPIPE truncated a `tee`'d capture -> write full to file then read.
- **Survey stalls** at the last few APs (VPN flap) -> ran patiently to 74/74 (72 clean).
## Configuration Changes
LIVE controller changes (UniFi UOS, site va6iba3v) — all via REST `rest/device` / `rest/wlanconf`:
- **2.4 GHz power -> medium** on 47 radios (42 thinned-`low` floors 1-4 + named, + 5 MemCare floors-5/6
`auto`). 24 disabled + 5 mesh-auto untouched. KEPT.
- **CSCNet `bss_transition` -> true** (BSS-transition / 802.11v). KEPT.
- **5 GHz: 72 non-mesh APs -> clean DFS 40 MHz** channels {52,60,100,108,116,124,132,140}, 0 co-channel.
KEPT (validated). Mesh (2nd Floor Atrium/CC Bridge/salon/108) left on auto.
- **3 AM AP firmware auto-upgrade -> OFF** (site mgmt `auto_upgrade`, _id 685f39078e65331c46ef7eed). Left OFF.
- Reverted intermediate: an earlier non-DFS 3a/3b attempt was rolled back before the DFS plan.
- Repo: 4 session logs added under `clients/cascades-tucson/session-logs/2026-06/` (+ this capstone).
## Credentials & Secrets
No new credentials. Used existing vault entries:
- `infrastructure/uos-server-network-api-rw` — controller RW admin (REST writes; POST/PUT need the
`x-csrf-token` header from login response — GET does not).
- `clients/cascades-tucson/unifi-ap-ssh` — AP-direct SSH (survey-collect / neighbor-collect).
- `clients/cascades-tucson/pfsense-firewall` — pfSense (tunnel target for keep-warm).
## Infrastructure & Servers
- UOS controller `172.16.3.29:11443`, site short `va6iba3v` / site_id `685f39068e65331c46ef6dd2`.
Controller-side ops reach `.29` directly (ACG-internal) — NOT over the Cascades VPN.
- Cascades VPN (pfSense OpenVPN, `192.168.0.1`) needed only for AP-direct SSH (192.168.2.x/3.x) — drops
after ~5 min idle (`--inactive 300`); keep-warm TCP-touches it every 170s.
- 77 U7-Pro APs; mesh: parent 2nd Floor Atrium, children CC Bridge/salon/108. Voice VLAN 30 (10.0.30.x):
31 devices (8 AudioCodes .224-.231 wired, 22+ Poly, Vertical desktop .201).
- Measured 5 GHz congestion (median busy%): DFS 2-3%; non-DFS UNII-1 ~10% (ch44 22%); UNII-3 ~10-14%
(ch157 28%, ch149 12% max 75%).
## Commands & Outputs
- Controller REST write pattern: login `POST /api/auth/login` -> capture `x-csrf-token` -> `PUT
/proxy/network/api/s/<site>/rest/device/<id>` with full `radio_table` (modify the `na`/`ng` entry).
- Channel survey: `SURVEY_JSON=... survey-collect.sh cascades` (74/74, ~12 min).
- Voice nudge: `POST /proxy/network/api/s/<site>/cmd/stamgr {"cmd":"kick-sta","mac":...}`.
- Validation: `live-stats.sh cascades` before/after; `stat/sta` for VLAN-30 voice online + band split.
- Result: 5 GHz retry avg 8.7 -> 3.8 (med 8.2 -> 2.1); 2.4 retry ~baseline; sat med 99; voice 31/31.
## Pending / Incomplete Tasks
- **6 GHz on CSCNet** — needs Howard's decision on a WPA3/transition+PMF conversion (touches all 427
clients incl. voice + legacy IoT). Blocked until then.
- **Re-enable the 3 AM AP auto-upgrade** when ready (left OFF tonight).
- **Stand up a recurring `dfs-check.sh` radar monitor** on the DFS channels (fold into the network-logging
plan) — UniFi auto-vacates one AP on a radar hit; the monitor tells us if it ever happens.
- **MemCare min-RSSI** + room-515/210/204 coverage — after Howard adds APs to floors 5/6 next week.
- **6 straggler Poly phones** still on VLAN 20/Default -> re-key to the voice PPSK.
- **2.4 1/6/11 channel re-plan** — deferred (was worse until the Medium-power set stabilized; re-run later).
## Reference Information
- Session logs (this night): `clients/cascades-tucson/session-logs/2026-06/2026-06-1{8,9}-howard-*.md`.
- Survey data `.claude/tmp/cascades-survey2.json` (74/74); DFS plan `.claude/tmp/dfs-plan.json`; neighbor
matrix `.claude/tmp/cascades-nbr.json`; full pre-night rollback state `.claude/tmp/dev2.json`.
- Master plan `docs/network/network-optimization-master-plan.md`; voice QoS `docs/network/phase1-voice-qos-design.md`.
- Prior commits: c7239e1 (baseline), 3c85d2c (2b), cc66da4 (5GHz attempt+rollback), 7ff723d (DFS validated).