# Cascades — Network Optimization Master Plan (all devices, holistic) - **Created:** 2026-06-18 (Howard-Home / claude-main) - **Status:** PLAN — for execution tonight (floors 1–4) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round. - **Goal:** Fix the *system*, not one device at a time. Improve quality for **every** client (~587), not just the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another. - **Builds on:** `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md` (RF mechanics + gated apply commands), `reports/2026-06-18-voice-quality-diagnostic.md`, and the live 2026-06-18 fleet sample. All RF changes use the gated `unifi-wifi` scripts (per-zone, dry-run, rollback JSON). --- ## 1. Current state (what's actually true right now) | Layer | State | Verdict | |---|---|---| | **2.4 GHz** | **OVER-THINNED.** Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→32–48%) BUT **retry rose 17→23.4%, satisfaction fell 39→30** (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). | **Regressed — must correct power floor** | | **5 GHz** | 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). **AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients** (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). | **Constrain width + spread channels + relieve hotspots** | | **6 GHz** | 75 radios live, **~1 client.** Root cause: **CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`). Cleanest untapped capacity. | **Open it — the relief valve** | | **QoS** | **NONE.** Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. | **Add — guaranteed win, now possible** | | **pfSense/WAN/DHCP/DNS** | Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. | **Fine — add voice QoS shaping only** | | **Switching / physical** | ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. | **Physical work — not tonight, but tracked** | --- ## 2. Root-cause model (why "some devices" are bad) Three compounding RF causes, plus a missing QoS layer: 1. **2.4 GHz contention** — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz suffers. Made *worse* by the over-thinning (weaker signal → more retransmits). 2. **5 GHz over-width + hotspots** — 80 MHz halves the usable channel count → co-channel overlap → retries; a few APs (103) are simply overloaded. 3. **6 GHz unused** — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz. 4. **No voice prioritization** — even with perfect RF, voice breaks under data bursts without QoS. **The trap we must avoid (the "whack-a-mole"):** narrowing 5 GHz to 40 MHz *without* first opening 6 GHz pushes more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's already too low) starves edge clients. **Sequence matters.** --- ## 3. The holistic sequence (open relief valves BEFORE constraining) > Principle: **(A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize > → (D) fine-tune → validate at every gate.** Each step is reversible; gate on live metrics before the next. ### PHASE 0 — Pre-flight + baseline (always) - VPN up; `live-stats.sh cascades | head -3` (expect 77 APs). - Baseline (compare after, same time-of-day): `live-stats.sh cascades > .claude/tmp/opt-pre.txt`; `radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt`. - Pick a watch AP per floor (`watch-ap.sh `). ### PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix) Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load. > **Reframe (measured 2026-06-18):** WAN1 fiber upload is **~522 Mbps** vs ~98 Mbps peak usage — huge > headroom, so the WAN is **not** the day-to-day voice bottleneck (that's RF, Phases 2–4). QoS still earns its > place as insurance for **WAN2 (coax) failover** and **rare WAN1 saturation** (you hit 680 Mbps down). Build > it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design: > `docs/network/phase1-voice-qos-design.md`. **Phones confirmed marking DSCP EF** → rely on DSCP; subnet match is the net. - **UniFi (WLAN/switch):** ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the USW, voice VLAN traffic should hit the high-priority queue. - **pfSense:** add a traffic-shaper/limiter or floating QoS rule that puts `VOICE net (10.0.30.0/24)` DSCP EF (46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only. - **Validate:** place test calls during a data-heavy moment; confirm no breakup. (No RF change here.) - *Skill gap:* the `unifi-wifi` skill has no QoS verb — this is a pfSense + UniFi config task; consider a small `voice-qos` helper later. ### PHASE 2 — Open the relief valves (capacity + correct the regression) **2a. Enable 6 GHz on CSCNet + steering** (creates the offload path BEFORE we narrow 5 GHz): ``` apply-wlan.sh cascades bands all --wlan CSCNet --apply # -> [2g,5g,6g] apply-wlan.sh cascades bsstm on --wlan CSCNet --apply # 802.11v BSS-transition (assists up-band + roam) ``` Band-steering (`no2ghz_oui`) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate: client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on). **2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios** (restores edge signal; keeps cells smaller than full power). Per floor, dry-run then apply; regenerate the kept-radio list live: ``` for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \ apply-radio.sh cascades ng power medium --zone "$z" --apply; done # ~12–15 dBm ``` Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio (`ng enable --ap ""`). **Gate:** re-measure retry%/satisfaction same time-of-day vs `opt-pre.txt` — expect retry back down from ~23% and satisfaction recovering. ### PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load) **3a. Width 80 → 40 MHz** (doubles non-overlapping channels → spatial reuse): ``` for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \ apply-radio.sh cascades na width 40 --zone "$z" --apply; done # rollback: na width 80 ``` **3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification).** Use **UNII-1 (36–48) + UNII-3 (149–165) only**; do NOT use DFS channels (52–144) on this voice-critical network. A precise radar-detection sweep (real `radar found`/`NOL` signatures, CAC/control housekeeping excluded) found **ZERO genuine hits across all 53 DFS APs** — BUT the window was only ~21–23h (APs rebooted ~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single hit forces a 30-min channel vacate = **dropped calls** — unacceptable for voice. **Resilience > diversity.** The lost 5 GHz channel count is covered by **6 GHz (Phase 2a) absorbing capacity** — this is WHY 6 GHz comes first. ``` SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades SURVEY_JSON=$SURVEY channel-plan.sh cascades na # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone ``` **Periodic DFS monitoring:** the ~1-day window isn't conclusive, so add a recurring precise `dfs-check.sh` (fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just confirms the choice stays right. **3c. Relieve AP 103 specifically** (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to a clean channel from the plan, 40 MHz. Verify Lauren `.202` retry drops after. **Gate:** 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded. ### PHASE 4 — Fine-tune (after 1–3 settle) - **2.4 channel plan 1/6/11** (graph-color; co-channel pairs 92→35) + **pin the 4 off-plan APs** (128/108/108U7/salon) to 1/6/11. - **2.4 min-RSSI ON** for the 6 APs where it's OFF (615/608/505/517/622/salon) — *note 505/517/615/608/622 are Floors 5/6 → DEFER with the rest of 5/6*; do `salon` only this round. - **Roaming for voice continuity:** confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r). Keeps calls alive when staff walk between APs. - **min-RSSI tuning:** only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association. ### PHASE 5 — Physical (separate visit, not tonight — but it caps results) - Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput). - Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run. - p38 (1st Floor USW) 4% tx-drop after the above. --- ## 4. Interdependency map (read before changing anything) - **6 GHz BEFORE 5 GHz 40 MHz** — else 5 GHz congestion just relocates. (Phase 2a before 3a.) - **2.4 power MEDIUM not LOW** — Low already over-thinned; going lower starves edge clients. (Phase 2b.) - **AP-lock needs AP capacity** — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion. - **QoS is independent** — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.) - **Disables + power-down compound** — never do both aggressively in the same area; we already saw the satisfaction hit. - **min-RSSI + power interact** — raising min-RSSI while lowering power can orphan clients; tune one lever at a time. - **Mesh-protected APs** (`2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108`) — never disable; power changes only with watch. ## 5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust) **Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions.** Each change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is *all* devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones. ### 5.1 How each change affects the OTHER (non-voice) devices Almost every change targets the **shared RF environment**, so it helps everyone — voice is just the most sensitive canary: | Change | Effect on non-voice devices | |---|---| | QoS (voice VLAN priority) | **Negligible** — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load | | Enable 6 GHz on CSCNet | **Positive** — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left | | 2.4 Low→Medium power | **Positive for ALL 2.4 devices** — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back) | | 5 GHz 80→40 MHz | **Net positive (reliability), small peak-speed cost** — density win; lone heavy transfer sees lower peak | | AP 103 relief | **Positive for all 16 clients on 103**, not just Lauren | | 2.4 1/6/11 channel plan | **Positive for all 2.4 devices** (less co-channel) | | Phone-side (Vertical) | Phones only — **no effect on others** | ### 5.2 Trade-offs to WATCH in the data (don't help voice, hurt others) 1. **Non-DFS 5 GHz + 5 GHz-only devices** — the DirecTV fleet + older laptops **can't use 6 GHz**, so they stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; **watch non-DFS 5 GHz cu_total** — if it climbs, that's the signal to rebalance. 2. **min-RSSI** affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently. 3. **40 MHz** trades single-user peak for fleet reliability — right in density, but it is a trade. ### 5.3 Fleet-wide metrics we pull at every gate (the data we decide on) Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after: - `live-stats.sh cascades` → per-band **avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts** - `radio-usage.sh cascades ` → per-AP outliers (saturated/high-retry APs) - `/stat/sta` band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count - Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole) ### 5.4 The GATE decision rule (per change, per zone) - **KEEP + proceed** only if: the target metric improved **AND** fleet-wide **satisfaction did not fall**, **retry% did not rise**, band split moved the intended way, and **no AP lost its clients** (no hole). - **HOLD** (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more. - **ROLLBACK that step** if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint. - Do **one lever per zone at a time** so cause/effect is attributable (the over-thinning happened because power-down + disables were stacked). ### 5.5 Rollback Every `apply-radio`/`apply-wlan` writes a rollback JSON to `.claude/tmp/`; `device-control poe-cycle` for a hung AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals. ## 6. Out of scope tonight (explicit) - **Floors 5 & 6 (MemCare)** — all RF + the MemCare voice phones (`.217/.218/.219/.220`) DEFERRED per Howard. - Physical cabling / offline switches (Phase 5 — separate visit). - The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless. ## 7. Open decisions for Howard 1. ~~**5 GHz channel plan:** clean-DFS vs non-DFS-only~~ — **RESOLVED 2026-06-18: NON-DFS ONLY** (UNII-1 36–48 + UNII-3 149–165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.) 2. **QoS depth:** UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive). 3. **802.11r** on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first. 4. Tonight's stopping point: Phases 1–2 alone are a legitimate, lower-risk night; 3–4 can be a second night. 5. ~~**Dedicated voice SSID?**~~ — **RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK.** UniFi 3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate, CSC ENT, still has **131 active clients** (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's not needed: **QoS is VLAN/DSCP-based (SSID-independent)**, band preference is best done **phone-side** (Vertical), and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only add voice-specific WiFi *policy* (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only if/when CSC ENT's 131 clients migrate off it.