Files
claudetools/clients/cascades-tucson/docs/network/network-optimization-master-plan.md
Howard Enos c2e5f4faeb cascades: recover 4 docs dropped by the history-rewrite/repo-split
The 2026-06-18 repo restructure (history rewrite + project->submodule split)
dropped these 4 Cascades files from the new clone. Copied byte-identical from
the pre-cutover claudetools.old clone (md5-verified):
- docs/network/network-optimization-master-plan.md
- docs/network/phase1-voice-qos-design.md
- reports/2026-06-18-voice-quality-diagnostic.md
- session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 20:21:27 -07:00

15 KiB
Raw Blame History

Cascades — Network Optimization Master Plan (all devices, holistic)

  • Created: 2026-06-18 (Howard-Home / claude-main)
  • Status: PLAN — for execution tonight (floors 14) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round.
  • Goal: Fix the system, not one device at a time. Improve quality for every client (~587), not just the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another.
  • Builds on: reports/2026-06-16-unifi-full-audit.md, reports/2026-06-16-2.4ghz-remediation-runbook.md (RF mechanics + gated apply commands), reports/2026-06-18-voice-quality-diagnostic.md, and the live 2026-06-18 fleet sample. All RF changes use the gated unifi-wifi scripts (per-zone, dry-run, rollback JSON).

1. Current state (what's actually true right now)

Layer State Verdict
2.4 GHz OVER-THINNED. Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→3248%) BUT retry rose 17→23.4%, satisfaction fell 39→30 (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). Regressed — must correct power floor
5 GHz 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). Constrain width + spread channels + relieve hotspots
6 GHz 75 radios live, ~1 client. Root cause: CSCNet not broadcasting 6 GHz (wlan_bands=[2g,5g]). Cleanest untapped capacity. Open it — the relief valve
QoS NONE. Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. Add — guaranteed win, now possible
pfSense/WAN/DHCP/DNS Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. Fine — add voice QoS shaping only
Switching / physical ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. Physical work — not tonight, but tracked

2. Root-cause model (why "some devices" are bad)

Three compounding RF causes, plus a missing QoS layer:

  1. 2.4 GHz contention — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz suffers. Made worse by the over-thinning (weaker signal → more retransmits).
  2. 5 GHz over-width + hotspots — 80 MHz halves the usable channel count → co-channel overlap → retries; a few APs (103) are simply overloaded.
  3. 6 GHz unused — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz.
  4. No voice prioritization — even with perfect RF, voice breaks under data bursts without QoS.

The trap we must avoid (the "whack-a-mole"): narrowing 5 GHz to 40 MHz without first opening 6 GHz pushes more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's already too low) starves edge clients. Sequence matters.


3. The holistic sequence (open relief valves BEFORE constraining)

Principle: (A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize → (D) fine-tune → validate at every gate. Each step is reversible; gate on live metrics before the next.

PHASE 0 — Pre-flight + baseline (always)

  • VPN up; live-stats.sh cascades | head -3 (expect 77 APs).
  • Baseline (compare after, same time-of-day): live-stats.sh cascades > .claude/tmp/opt-pre.txt; radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt.
  • Pick a watch AP per floor (watch-ap.sh <ip>).

PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix)

Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load.

Reframe (measured 2026-06-18): WAN1 fiber upload is ~522 Mbps vs ~98 Mbps peak usage — huge headroom, so the WAN is not the day-to-day voice bottleneck (that's RF, Phases 24). QoS still earns its place as insurance for WAN2 (coax) failover and rare WAN1 saturation (you hit 680 Mbps down). Build it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design: docs/network/phase1-voice-qos-design.md. Phones confirmed marking DSCP EF → rely on DSCP; subnet match is the net.

  • UniFi (WLAN/switch): ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the USW, voice VLAN traffic should hit the high-priority queue.
  • pfSense: add a traffic-shaper/limiter or floating QoS rule that puts VOICE net (10.0.30.0/24) DSCP EF (46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only.
  • Validate: place test calls during a data-heavy moment; confirm no breakup. (No RF change here.)
  • Skill gap: the unifi-wifi skill has no QoS verb — this is a pfSense + UniFi config task; consider a small voice-qos helper later.

PHASE 2 — Open the relief valves (capacity + correct the regression)

2a. Enable 6 GHz on CSCNet + steering (creates the offload path BEFORE we narrow 5 GHz):

apply-wlan.sh cascades bands all --wlan CSCNet --apply     # -> [2g,5g,6g]
apply-wlan.sh cascades bsstm on --wlan CSCNet --apply       # 802.11v BSS-transition (assists up-band + roam)

Band-steering (no2ghz_oui) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate: client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on).

2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios (restores edge signal; keeps cells smaller than full power). Per floor, dry-run then apply; regenerate the kept-radio list live:

for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \
  apply-radio.sh cascades ng power medium --zone "$z" --apply; done   # ~1215 dBm

Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio (ng enable --ap "<name>"). Gate: re-measure retry%/satisfaction same time-of-day vs opt-pre.txt — expect retry back down from ~23% and satisfaction recovering.

PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load)

3a. Width 80 → 40 MHz (doubles non-overlapping channels → spatial reuse):

for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \
  apply-radio.sh cascades na width 40 --zone "$z" --apply; done   # rollback: na width 80

3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification). Use UNII-1 (3648) + UNII-3 (149165) only; do NOT use DFS channels (52144) on this voice-critical network. A precise radar-detection sweep (real radar found/NOL signatures, CAC/control housekeeping excluded) found ZERO genuine hits across all 53 DFS APs — BUT the window was only ~2123h (APs rebooted ~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single hit forces a 30-min channel vacate = dropped calls — unacceptable for voice. Resilience > diversity. The lost 5 GHz channel count is covered by 6 GHz (Phase 2a) absorbing capacity — this is WHY 6 GHz comes first.

SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades
SURVEY_JSON=$SURVEY channel-plan.sh cascades na    # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone

Periodic DFS monitoring: the ~1-day window isn't conclusive, so add a recurring precise dfs-check.sh (fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just confirms the choice stays right. 3c. Relieve AP 103 specifically (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to a clean channel from the plan, 40 MHz. Verify Lauren .202 retry drops after. Gate: 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded.

PHASE 4 — Fine-tune (after 13 settle)

  • 2.4 channel plan 1/6/11 (graph-color; co-channel pairs 92→35) + pin the 4 off-plan APs (128/108/108U7/salon) to 1/6/11.
  • 2.4 min-RSSI ON for the 6 APs where it's OFF (615/608/505/517/622/salon) — note 505/517/615/608/622 are Floors 5/6 → DEFER with the rest of 5/6; do salon only this round.
  • Roaming for voice continuity: confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r). Keeps calls alive when staff walk between APs.
  • min-RSSI tuning: only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association.

PHASE 5 — Physical (separate visit, not tonight — but it caps results)

  • Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput).
  • Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run.
  • p38 (1st Floor USW) 4% tx-drop after the above.

4. Interdependency map (read before changing anything)

  • 6 GHz BEFORE 5 GHz 40 MHz — else 5 GHz congestion just relocates. (Phase 2a before 3a.)
  • 2.4 power MEDIUM not LOW — Low already over-thinned; going lower starves edge clients. (Phase 2b.)
  • AP-lock needs AP capacity — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion.
  • QoS is independent — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.)
  • Disables + power-down compound — never do both aggressively in the same area; we already saw the satisfaction hit.
  • min-RSSI + power interact — raising min-RSSI while lowering power can orphan clients; tune one lever at a time.
  • Mesh-protected APs (2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108) — never disable; power changes only with watch.

5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust)

Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions. Each change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is all devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones.

5.1 How each change affects the OTHER (non-voice) devices

Almost every change targets the shared RF environment, so it helps everyone — voice is just the most sensitive canary:

Change Effect on non-voice devices
QoS (voice VLAN priority) Negligible — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load
Enable 6 GHz on CSCNet Positive — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left
2.4 Low→Medium power Positive for ALL 2.4 devices — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back)
5 GHz 80→40 MHz Net positive (reliability), small peak-speed cost — density win; lone heavy transfer sees lower peak
AP 103 relief Positive for all 16 clients on 103, not just Lauren
2.4 1/6/11 channel plan Positive for all 2.4 devices (less co-channel)
Phone-side (Vertical) Phones only — no effect on others

5.2 Trade-offs to WATCH in the data (don't help voice, hurt others)

  1. Non-DFS 5 GHz + 5 GHz-only devices — the DirecTV fleet + older laptops can't use 6 GHz, so they stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; watch non-DFS 5 GHz cu_total — if it climbs, that's the signal to rebalance.
  2. min-RSSI affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently.
  3. 40 MHz trades single-user peak for fleet reliability — right in density, but it is a trade.

5.3 Fleet-wide metrics we pull at every gate (the data we decide on)

Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after:

  • live-stats.sh cascades → per-band avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts
  • radio-usage.sh cascades <band> → per-AP outliers (saturated/high-retry APs)
  • /stat/sta band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count
  • Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole)

5.4 The GATE decision rule (per change, per zone)

  • KEEP + proceed only if: the target metric improved AND fleet-wide satisfaction did not fall, retry% did not rise, band split moved the intended way, and no AP lost its clients (no hole).
  • HOLD (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more.
  • ROLLBACK that step if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint.
  • Do one lever per zone at a time so cause/effect is attributable (the over-thinning happened because power-down
    • disables were stacked).

5.5 Rollback

Every apply-radio/apply-wlan writes a rollback JSON to .claude/tmp/; device-control poe-cycle for a hung AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals.

6. Out of scope tonight (explicit)

  • Floors 5 & 6 (MemCare) — all RF + the MemCare voice phones (.217/.218/.219/.220) DEFERRED per Howard.
  • Physical cabling / offline switches (Phase 5 — separate visit).
  • The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless.

7. Open decisions for Howard

  1. 5 GHz channel plan: clean-DFS vs non-DFS-onlyRESOLVED 2026-06-18: NON-DFS ONLY (UNII-1 3648 + UNII-3 149165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.)
  2. QoS depth: UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive).
  3. 802.11r on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first.
  4. Tonight's stopping point: Phases 12 alone are a legitimate, lower-risk night; 34 can be a second night.
  5. Dedicated voice SSID?RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK. UniFi 3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate, CSC ENT, still has 131 active clients (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's not needed: QoS is VLAN/DSCP-based (SSID-independent), band preference is best done phone-side (Vertical), and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only add voice-specific WiFi policy (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only if/when CSC ENT's 131 clients migrate off it.