The 2026-06-18 repo restructure (history rewrite + project->submodule split) dropped these 4 Cascades files from the new clone. Copied byte-identical from the pre-cutover claudetools.old clone (md5-verified): - docs/network/network-optimization-master-plan.md - docs/network/phase1-voice-qos-design.md - reports/2026-06-18-voice-quality-diagnostic.md - session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
198 lines
15 KiB
Markdown
198 lines
15 KiB
Markdown
# Cascades — Network Optimization Master Plan (all devices, holistic)
|
||
|
||
- **Created:** 2026-06-18 (Howard-Home / claude-main)
|
||
- **Status:** PLAN — for execution tonight (floors 1–4) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round.
|
||
- **Goal:** Fix the *system*, not one device at a time. Improve quality for **every** client (~587), not just
|
||
the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another.
|
||
- **Builds on:** `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`
|
||
(RF mechanics + gated apply commands), `reports/2026-06-18-voice-quality-diagnostic.md`, and the live
|
||
2026-06-18 fleet sample. All RF changes use the gated `unifi-wifi` scripts (per-zone, dry-run, rollback JSON).
|
||
|
||
---
|
||
|
||
## 1. Current state (what's actually true right now)
|
||
|
||
| Layer | State | Verdict |
|
||
|---|---|---|
|
||
| **2.4 GHz** | **OVER-THINNED.** Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→32–48%) BUT **retry rose 17→23.4%, satisfaction fell 39→30** (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). | **Regressed — must correct power floor** |
|
||
| **5 GHz** | 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). **AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients** (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). | **Constrain width + spread channels + relieve hotspots** |
|
||
| **6 GHz** | 75 radios live, **~1 client.** Root cause: **CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`). Cleanest untapped capacity. | **Open it — the relief valve** |
|
||
| **QoS** | **NONE.** Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. | **Add — guaranteed win, now possible** |
|
||
| **pfSense/WAN/DHCP/DNS** | Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. | **Fine — add voice QoS shaping only** |
|
||
| **Switching / physical** | ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. | **Physical work — not tonight, but tracked** |
|
||
|
||
---
|
||
|
||
## 2. Root-cause model (why "some devices" are bad)
|
||
|
||
Three compounding RF causes, plus a missing QoS layer:
|
||
1. **2.4 GHz contention** — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz
|
||
suffers. Made *worse* by the over-thinning (weaker signal → more retransmits).
|
||
2. **5 GHz over-width + hotspots** — 80 MHz halves the usable channel count → co-channel overlap → retries;
|
||
a few APs (103) are simply overloaded.
|
||
3. **6 GHz unused** — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz.
|
||
4. **No voice prioritization** — even with perfect RF, voice breaks under data bursts without QoS.
|
||
|
||
**The trap we must avoid (the "whack-a-mole"):** narrowing 5 GHz to 40 MHz *without* first opening 6 GHz pushes
|
||
more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's
|
||
already too low) starves edge clients. **Sequence matters.**
|
||
|
||
---
|
||
|
||
## 3. The holistic sequence (open relief valves BEFORE constraining)
|
||
|
||
> Principle: **(A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize
|
||
> → (D) fine-tune → validate at every gate.** Each step is reversible; gate on live metrics before the next.
|
||
|
||
### PHASE 0 — Pre-flight + baseline (always)
|
||
- VPN up; `live-stats.sh cascades | head -3` (expect 77 APs).
|
||
- Baseline (compare after, same time-of-day): `live-stats.sh cascades > .claude/tmp/opt-pre.txt`;
|
||
`radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt`.
|
||
- Pick a watch AP per floor (`watch-ap.sh <ip>`).
|
||
|
||
### PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix)
|
||
Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load.
|
||
> **Reframe (measured 2026-06-18):** WAN1 fiber upload is **~522 Mbps** vs ~98 Mbps peak usage — huge
|
||
> headroom, so the WAN is **not** the day-to-day voice bottleneck (that's RF, Phases 2–4). QoS still earns its
|
||
> place as insurance for **WAN2 (coax) failover** and **rare WAN1 saturation** (you hit 680 Mbps down). Build
|
||
> it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design:
|
||
> `docs/network/phase1-voice-qos-design.md`. **Phones confirmed marking DSCP EF** → rely on DSCP; subnet match is the net.
|
||
- **UniFi (WLAN/switch):** ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the
|
||
USW, voice VLAN traffic should hit the high-priority queue.
|
||
- **pfSense:** add a traffic-shaper/limiter or floating QoS rule that puts `VOICE net (10.0.30.0/24)` DSCP EF
|
||
(46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only.
|
||
- **Validate:** place test calls during a data-heavy moment; confirm no breakup. (No RF change here.)
|
||
- *Skill gap:* the `unifi-wifi` skill has no QoS verb — this is a pfSense + UniFi config task; consider a small
|
||
`voice-qos` helper later.
|
||
|
||
### PHASE 2 — Open the relief valves (capacity + correct the regression)
|
||
**2a. Enable 6 GHz on CSCNet + steering** (creates the offload path BEFORE we narrow 5 GHz):
|
||
```
|
||
apply-wlan.sh cascades bands all --wlan CSCNet --apply # -> [2g,5g,6g]
|
||
apply-wlan.sh cascades bsstm on --wlan CSCNet --apply # 802.11v BSS-transition (assists up-band + roam)
|
||
```
|
||
Band-steering (`no2ghz_oui`) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate:
|
||
client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on).
|
||
|
||
**2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios** (restores edge signal; keeps cells smaller
|
||
than full power). Per floor, dry-run then apply; regenerate the kept-radio list live:
|
||
```
|
||
for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \
|
||
apply-radio.sh cascades ng power medium --zone "$z" --apply; done # ~12–15 dBm
|
||
```
|
||
Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio
|
||
(`ng enable --ap "<name>"`). **Gate:** re-measure retry%/satisfaction same time-of-day vs `opt-pre.txt` —
|
||
expect retry back down from ~23% and satisfaction recovering.
|
||
|
||
### PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load)
|
||
**3a. Width 80 → 40 MHz** (doubles non-overlapping channels → spatial reuse):
|
||
```
|
||
for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \
|
||
apply-radio.sh cascades na width 40 --zone "$z" --apply; done # rollback: na width 80
|
||
```
|
||
**3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification).**
|
||
Use **UNII-1 (36–48) + UNII-3 (149–165) only**; do NOT use DFS channels (52–144) on this voice-critical
|
||
network. A precise radar-detection sweep (real `radar found`/`NOL` signatures, CAC/control housekeeping
|
||
excluded) found **ZERO genuine hits across all 53 DFS APs** — BUT the window was only ~21–23h (APs rebooted
|
||
~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single
|
||
hit forces a 30-min channel vacate = **dropped calls** — unacceptable for voice. **Resilience > diversity.**
|
||
The lost 5 GHz channel count is covered by **6 GHz (Phase 2a) absorbing capacity** — this is WHY 6 GHz comes first.
|
||
```
|
||
SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades
|
||
SURVEY_JSON=$SURVEY channel-plan.sh cascades na # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone
|
||
```
|
||
**Periodic DFS monitoring:** the ~1-day window isn't conclusive, so add a recurring precise `dfs-check.sh`
|
||
(fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just
|
||
confirms the choice stays right.
|
||
**3c. Relieve AP 103 specifically** (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to
|
||
a clean channel from the plan, 40 MHz. Verify Lauren `.202` retry drops after.
|
||
**Gate:** 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded.
|
||
|
||
### PHASE 4 — Fine-tune (after 1–3 settle)
|
||
- **2.4 channel plan 1/6/11** (graph-color; co-channel pairs 92→35) + **pin the 4 off-plan APs** (128/108/108U7/salon)
|
||
to 1/6/11.
|
||
- **2.4 min-RSSI ON** for the 6 APs where it's OFF (615/608/505/517/622/salon) — *note 505/517/615/608/622 are
|
||
Floors 5/6 → DEFER with the rest of 5/6*; do `salon` only this round.
|
||
- **Roaming for voice continuity:** confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r).
|
||
Keeps calls alive when staff walk between APs.
|
||
- **min-RSSI tuning:** only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association.
|
||
|
||
### PHASE 5 — Physical (separate visit, not tonight — but it caps results)
|
||
- Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput).
|
||
- Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run.
|
||
- p38 (1st Floor USW) 4% tx-drop after the above.
|
||
|
||
---
|
||
|
||
## 4. Interdependency map (read before changing anything)
|
||
- **6 GHz BEFORE 5 GHz 40 MHz** — else 5 GHz congestion just relocates. (Phase 2a before 3a.)
|
||
- **2.4 power MEDIUM not LOW** — Low already over-thinned; going lower starves edge clients. (Phase 2b.)
|
||
- **AP-lock needs AP capacity** — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion.
|
||
- **QoS is independent** — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.)
|
||
- **Disables + power-down compound** — never do both aggressively in the same area; we already saw the satisfaction hit.
|
||
- **min-RSSI + power interact** — raising min-RSSI while lowering power can orphan clients; tune one lever at a time.
|
||
- **Mesh-protected APs** (`2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108`) — never disable; power changes only with watch.
|
||
|
||
## 5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust)
|
||
|
||
**Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions.** Each
|
||
change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is *all*
|
||
devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones.
|
||
|
||
### 5.1 How each change affects the OTHER (non-voice) devices
|
||
Almost every change targets the **shared RF environment**, so it helps everyone — voice is just the most
|
||
sensitive canary:
|
||
| Change | Effect on non-voice devices |
|
||
|---|---|
|
||
| QoS (voice VLAN priority) | **Negligible** — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load |
|
||
| Enable 6 GHz on CSCNet | **Positive** — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left |
|
||
| 2.4 Low→Medium power | **Positive for ALL 2.4 devices** — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back) |
|
||
| 5 GHz 80→40 MHz | **Net positive (reliability), small peak-speed cost** — density win; lone heavy transfer sees lower peak |
|
||
| AP 103 relief | **Positive for all 16 clients on 103**, not just Lauren |
|
||
| 2.4 1/6/11 channel plan | **Positive for all 2.4 devices** (less co-channel) |
|
||
| Phone-side (Vertical) | Phones only — **no effect on others** |
|
||
|
||
### 5.2 Trade-offs to WATCH in the data (don't help voice, hurt others)
|
||
1. **Non-DFS 5 GHz + 5 GHz-only devices** — the DirecTV fleet + older laptops **can't use 6 GHz**, so they
|
||
stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; **watch
|
||
non-DFS 5 GHz cu_total** — if it climbs, that's the signal to rebalance.
|
||
2. **min-RSSI** affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently.
|
||
3. **40 MHz** trades single-user peak for fleet reliability — right in density, but it is a trade.
|
||
|
||
### 5.3 Fleet-wide metrics we pull at every gate (the data we decide on)
|
||
Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after:
|
||
- `live-stats.sh cascades` → per-band **avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts**
|
||
- `radio-usage.sh cascades <band>` → per-AP outliers (saturated/high-retry APs)
|
||
- `/stat/sta` band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count
|
||
- Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole)
|
||
|
||
### 5.4 The GATE decision rule (per change, per zone)
|
||
- **KEEP + proceed** only if: the target metric improved **AND** fleet-wide **satisfaction did not fall**,
|
||
**retry% did not rise**, band split moved the intended way, and **no AP lost its clients** (no hole).
|
||
- **HOLD** (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more.
|
||
- **ROLLBACK that step** if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint.
|
||
- Do **one lever per zone at a time** so cause/effect is attributable (the over-thinning happened because power-down
|
||
+ disables were stacked).
|
||
|
||
### 5.5 Rollback
|
||
Every `apply-radio`/`apply-wlan` writes a rollback JSON to `.claude/tmp/`; `device-control poe-cycle` for a hung
|
||
AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals.
|
||
|
||
## 6. Out of scope tonight (explicit)
|
||
- **Floors 5 & 6 (MemCare)** — all RF + the MemCare voice phones (`.217/.218/.219/.220`) DEFERRED per Howard.
|
||
- Physical cabling / offline switches (Phase 5 — separate visit).
|
||
- The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless.
|
||
|
||
## 7. Open decisions for Howard
|
||
1. ~~**5 GHz channel plan:** clean-DFS vs non-DFS-only~~ — **RESOLVED 2026-06-18: NON-DFS ONLY** (UNII-1 36–48 + UNII-3 149–165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.)
|
||
2. **QoS depth:** UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive).
|
||
3. **802.11r** on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first.
|
||
4. Tonight's stopping point: Phases 1–2 alone are a legitimate, lower-risk night; 3–4 can be a second night.
|
||
5. ~~**Dedicated voice SSID?**~~ — **RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK.** UniFi
|
||
3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate,
|
||
CSC ENT, still has **131 active clients** (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's
|
||
not needed: **QoS is VLAN/DSCP-based (SSID-independent)**, band preference is best done **phone-side** (Vertical),
|
||
and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only
|
||
add voice-specific WiFi *policy* (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only
|
||
if/when CSC ENT's 131 clients migrate off it.
|