diff --git a/clients/cascades-tucson/docs/network/network-optimization-master-plan.md b/clients/cascades-tucson/docs/network/network-optimization-master-plan.md new file mode 100644 index 00000000..3232e0bb --- /dev/null +++ b/clients/cascades-tucson/docs/network/network-optimization-master-plan.md @@ -0,0 +1,197 @@ +# Cascades — Network Optimization Master Plan (all devices, holistic) + +- **Created:** 2026-06-18 (Howard-Home / claude-main) +- **Status:** PLAN — for execution tonight (floors 1–4) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round. +- **Goal:** Fix the *system*, not one device at a time. Improve quality for **every** client (~587), not just + the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another. +- **Builds on:** `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md` + (RF mechanics + gated apply commands), `reports/2026-06-18-voice-quality-diagnostic.md`, and the live + 2026-06-18 fleet sample. All RF changes use the gated `unifi-wifi` scripts (per-zone, dry-run, rollback JSON). + +--- + +## 1. Current state (what's actually true right now) + +| Layer | State | Verdict | +|---|---|---| +| **2.4 GHz** | **OVER-THINNED.** Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→32–48%) BUT **retry rose 17→23.4%, satisfaction fell 39→30** (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). | **Regressed — must correct power floor** | +| **5 GHz** | 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). **AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients** (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). | **Constrain width + spread channels + relieve hotspots** | +| **6 GHz** | 75 radios live, **~1 client.** Root cause: **CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`). Cleanest untapped capacity. | **Open it — the relief valve** | +| **QoS** | **NONE.** Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. | **Add — guaranteed win, now possible** | +| **pfSense/WAN/DHCP/DNS** | Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. | **Fine — add voice QoS shaping only** | +| **Switching / physical** | ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. | **Physical work — not tonight, but tracked** | + +--- + +## 2. Root-cause model (why "some devices" are bad) + +Three compounding RF causes, plus a missing QoS layer: +1. **2.4 GHz contention** — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz + suffers. Made *worse* by the over-thinning (weaker signal → more retransmits). +2. **5 GHz over-width + hotspots** — 80 MHz halves the usable channel count → co-channel overlap → retries; + a few APs (103) are simply overloaded. +3. **6 GHz unused** — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz. +4. **No voice prioritization** — even with perfect RF, voice breaks under data bursts without QoS. + +**The trap we must avoid (the "whack-a-mole"):** narrowing 5 GHz to 40 MHz *without* first opening 6 GHz pushes +more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's +already too low) starves edge clients. **Sequence matters.** + +--- + +## 3. The holistic sequence (open relief valves BEFORE constraining) + +> Principle: **(A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize +> → (D) fine-tune → validate at every gate.** Each step is reversible; gate on live metrics before the next. + +### PHASE 0 — Pre-flight + baseline (always) +- VPN up; `live-stats.sh cascades | head -3` (expect 77 APs). +- Baseline (compare after, same time-of-day): `live-stats.sh cascades > .claude/tmp/opt-pre.txt`; + `radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt`. +- Pick a watch AP per floor (`watch-ap.sh `). + +### PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix) +Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load. +> **Reframe (measured 2026-06-18):** WAN1 fiber upload is **~522 Mbps** vs ~98 Mbps peak usage — huge +> headroom, so the WAN is **not** the day-to-day voice bottleneck (that's RF, Phases 2–4). QoS still earns its +> place as insurance for **WAN2 (coax) failover** and **rare WAN1 saturation** (you hit 680 Mbps down). Build +> it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design: +> `docs/network/phase1-voice-qos-design.md`. **Phones confirmed marking DSCP EF** → rely on DSCP; subnet match is the net. +- **UniFi (WLAN/switch):** ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the + USW, voice VLAN traffic should hit the high-priority queue. +- **pfSense:** add a traffic-shaper/limiter or floating QoS rule that puts `VOICE net (10.0.30.0/24)` DSCP EF + (46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only. +- **Validate:** place test calls during a data-heavy moment; confirm no breakup. (No RF change here.) +- *Skill gap:* the `unifi-wifi` skill has no QoS verb — this is a pfSense + UniFi config task; consider a small + `voice-qos` helper later. + +### PHASE 2 — Open the relief valves (capacity + correct the regression) +**2a. Enable 6 GHz on CSCNet + steering** (creates the offload path BEFORE we narrow 5 GHz): +``` +apply-wlan.sh cascades bands all --wlan CSCNet --apply # -> [2g,5g,6g] +apply-wlan.sh cascades bsstm on --wlan CSCNet --apply # 802.11v BSS-transition (assists up-band + roam) +``` +Band-steering (`no2ghz_oui`) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate: +client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on). + +**2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios** (restores edge signal; keeps cells smaller +than full power). Per floor, dry-run then apply; regenerate the kept-radio list live: +``` +for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \ + apply-radio.sh cascades ng power medium --zone "$z" --apply; done # ~12–15 dBm +``` +Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio +(`ng enable --ap ""`). **Gate:** re-measure retry%/satisfaction same time-of-day vs `opt-pre.txt` — +expect retry back down from ~23% and satisfaction recovering. + +### PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load) +**3a. Width 80 → 40 MHz** (doubles non-overlapping channels → spatial reuse): +``` +for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \ + apply-radio.sh cascades na width 40 --zone "$z" --apply; done # rollback: na width 80 +``` +**3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification).** +Use **UNII-1 (36–48) + UNII-3 (149–165) only**; do NOT use DFS channels (52–144) on this voice-critical +network. A precise radar-detection sweep (real `radar found`/`NOL` signatures, CAC/control housekeeping +excluded) found **ZERO genuine hits across all 53 DFS APs** — BUT the window was only ~21–23h (APs rebooted +~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single +hit forces a 30-min channel vacate = **dropped calls** — unacceptable for voice. **Resilience > diversity.** +The lost 5 GHz channel count is covered by **6 GHz (Phase 2a) absorbing capacity** — this is WHY 6 GHz comes first. +``` +SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades +SURVEY_JSON=$SURVEY channel-plan.sh cascades na # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone +``` +**Periodic DFS monitoring:** the ~1-day window isn't conclusive, so add a recurring precise `dfs-check.sh` +(fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just +confirms the choice stays right. +**3c. Relieve AP 103 specifically** (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to +a clean channel from the plan, 40 MHz. Verify Lauren `.202` retry drops after. +**Gate:** 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded. + +### PHASE 4 — Fine-tune (after 1–3 settle) +- **2.4 channel plan 1/6/11** (graph-color; co-channel pairs 92→35) + **pin the 4 off-plan APs** (128/108/108U7/salon) + to 1/6/11. +- **2.4 min-RSSI ON** for the 6 APs where it's OFF (615/608/505/517/622/salon) — *note 505/517/615/608/622 are + Floors 5/6 → DEFER with the rest of 5/6*; do `salon` only this round. +- **Roaming for voice continuity:** confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r). + Keeps calls alive when staff walk between APs. +- **min-RSSI tuning:** only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association. + +### PHASE 5 — Physical (separate visit, not tonight — but it caps results) +- Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput). +- Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run. +- p38 (1st Floor USW) 4% tx-drop after the above. + +--- + +## 4. Interdependency map (read before changing anything) +- **6 GHz BEFORE 5 GHz 40 MHz** — else 5 GHz congestion just relocates. (Phase 2a before 3a.) +- **2.4 power MEDIUM not LOW** — Low already over-thinned; going lower starves edge clients. (Phase 2b.) +- **AP-lock needs AP capacity** — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion. +- **QoS is independent** — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.) +- **Disables + power-down compound** — never do both aggressively in the same area; we already saw the satisfaction hit. +- **min-RSSI + power interact** — raising min-RSSI while lowering power can orphan clients; tune one lever at a time. +- **Mesh-protected APs** (`2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108`) — never disable; power changes only with watch. + +## 5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust) + +**Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions.** Each +change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is *all* +devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones. + +### 5.1 How each change affects the OTHER (non-voice) devices +Almost every change targets the **shared RF environment**, so it helps everyone — voice is just the most +sensitive canary: +| Change | Effect on non-voice devices | +|---|---| +| QoS (voice VLAN priority) | **Negligible** — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load | +| Enable 6 GHz on CSCNet | **Positive** — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left | +| 2.4 Low→Medium power | **Positive for ALL 2.4 devices** — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back) | +| 5 GHz 80→40 MHz | **Net positive (reliability), small peak-speed cost** — density win; lone heavy transfer sees lower peak | +| AP 103 relief | **Positive for all 16 clients on 103**, not just Lauren | +| 2.4 1/6/11 channel plan | **Positive for all 2.4 devices** (less co-channel) | +| Phone-side (Vertical) | Phones only — **no effect on others** | + +### 5.2 Trade-offs to WATCH in the data (don't help voice, hurt others) +1. **Non-DFS 5 GHz + 5 GHz-only devices** — the DirecTV fleet + older laptops **can't use 6 GHz**, so they + stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; **watch + non-DFS 5 GHz cu_total** — if it climbs, that's the signal to rebalance. +2. **min-RSSI** affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently. +3. **40 MHz** trades single-user peak for fleet reliability — right in density, but it is a trade. + +### 5.3 Fleet-wide metrics we pull at every gate (the data we decide on) +Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after: +- `live-stats.sh cascades` → per-band **avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts** +- `radio-usage.sh cascades ` → per-AP outliers (saturated/high-retry APs) +- `/stat/sta` band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count +- Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole) + +### 5.4 The GATE decision rule (per change, per zone) +- **KEEP + proceed** only if: the target metric improved **AND** fleet-wide **satisfaction did not fall**, + **retry% did not rise**, band split moved the intended way, and **no AP lost its clients** (no hole). +- **HOLD** (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more. +- **ROLLBACK that step** if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint. +- Do **one lever per zone at a time** so cause/effect is attributable (the over-thinning happened because power-down + + disables were stacked). + +### 5.5 Rollback +Every `apply-radio`/`apply-wlan` writes a rollback JSON to `.claude/tmp/`; `device-control poe-cycle` for a hung +AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals. + +## 6. Out of scope tonight (explicit) +- **Floors 5 & 6 (MemCare)** — all RF + the MemCare voice phones (`.217/.218/.219/.220`) DEFERRED per Howard. +- Physical cabling / offline switches (Phase 5 — separate visit). +- The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless. + +## 7. Open decisions for Howard +1. ~~**5 GHz channel plan:** clean-DFS vs non-DFS-only~~ — **RESOLVED 2026-06-18: NON-DFS ONLY** (UNII-1 36–48 + UNII-3 149–165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.) +2. **QoS depth:** UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive). +3. **802.11r** on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first. +4. Tonight's stopping point: Phases 1–2 alone are a legitimate, lower-risk night; 3–4 can be a second night. +5. ~~**Dedicated voice SSID?**~~ — **RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK.** UniFi + 3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate, + CSC ENT, still has **131 active clients** (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's + not needed: **QoS is VLAN/DSCP-based (SSID-independent)**, band preference is best done **phone-side** (Vertical), + and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only + add voice-specific WiFi *policy* (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only + if/when CSC ENT's 131 clients migrate off it. diff --git a/clients/cascades-tucson/docs/network/phase1-voice-qos-design.md b/clients/cascades-tucson/docs/network/phase1-voice-qos-design.md new file mode 100644 index 00000000..26473df4 --- /dev/null +++ b/clients/cascades-tucson/docs/network/phase1-voice-qos-design.md @@ -0,0 +1,111 @@ +# Cascades — Phase 1: Voice QoS Design (VLAN 30) + +- **Created:** 2026-06-18 (Howard-Home / claude-main). Part of `network-optimization-master-plan.md` Phase 1. +- **Status:** DESIGN — for review, then build (Howard drives pfSense GUI). Nothing applied. +- **Risk:** LOW — additive, voice-only prioritization; rollback = disable the shaper. Main caution: size the + shaper bandwidth correctly (a wrong value can throttle throughput) → test before/after. + +## Objective +Guarantee voice quality under load by prioritizing VLAN 30 traffic end-to-end. **The phones register to a +CLOUD PBX (Vertical) over the internet**, so the bottleneck that breaks calls is **WAN upload saturation** +(someone uploading / cloud backup / OneDrive sync fills the uplink → voice RTP queues → jitter, dropped +audio). QoS keeps voice ahead of bulk data on the WAN. + +## The big advantage of the VLAN move +**All voice is now one subnet: `10.0.30.0/24`.** So QoS can match *all* voice by **source subnet** — no +need to guess SIP/RTP port ranges per PBX. This is the cleanest, most robust match criterion and it only +became possible because we isolated voice onto VLAN 30. + +## Current state (verified 2026-06-18) +- **No traffic shaper / limiter configured** on pfSense (clean build). +- **Dual-WAN:** WAN1 `igc0` (Cox Fiber, primary, 1G link), WAN2 `igc3` (Cox Coax, 2.5G link); `WAN_Group` + failover (`downlosslatency`). Shaping must be applied on **both** WAN interfaces. +- pfSense Plus 25.07 (ALTQ shaper + dummynet limiters available). +- **Phones mark DSCP EF — CONFIRMED (Howard 2026-06-18).** So we can rely on DSCP for WMM (Layer 2) + switch + QoS (Layer 3); the `10.0.30.0/24` subnet match (Layer 1) is the safety net. **No pfSense set-DSCP rule needed.** + +## Measured WAN bandwidth (2026-06-18) — REFRAMES QoS priority +- **WAN1 (fiber, primary): upload ~522 Mbps** (Cloudflare single-stream from pfSense). RRD 3-day peaks: + **680 Mbps down / 98 Mbps up** (actual usage). +- **WAN2 (coax): not measurable remotely** (source-route bind to `72.211.21.217` failed; needs a WAN2-routed + host or the Cox bill). Coax is typically asymmetric ~20–50 Mbps up — **size its shaper conservatively**. +- **Implication:** 30 calls ≈ ~3 Mbps. WAN1 upload (~522 Mbps) vs peak usage (98 Mbps) = **huge headroom → + the WAN is NOT the everyday voice bottleneck.** Everyday dropped-calls = **RF** (Phases 2–4 of the master + plan). **QoS here is INSURANCE, not the day-to-day fix** — it earns its keep in two cases: (1) **WAN2 + failover** (small coax upload + a big upload → real congestion), (2) **rare WAN1 saturation** (backup / + large upload; you do hit 680 Mbps down). Build it (cheap, correct), but set expectations: RF is the substance. + +## Three layers (priority order; Layer 1 = insurance, see reframe above) + +### Layer 1 — pfSense WAN shaper (PRIMARY — this is where calls break) +**Type: HFSC** (hierarchical, lets us guarantee voice a floor while letting it borrow idle bandwidth). +Per WAN interface, three queues: +| Queue | Role | HFSC settings (starting point) | +|---|---|---| +| `qVoice` | voice (VLAN 30 / DSCP EF) | **priority 7**, realtime ~30% of WAN-up, link-share 30%, NOT default | +| `qACK` | TCP ACKs (keeps downloads snappy) | priority 6, ~10% | +| `qDefault` | everything else | **default**, link-share ~60% | + +**Match rule (floating, WAN, direction out):** source `10.0.30.0/24` → `qVoice`. (Optionally also match +DSCP EF if phones mark it — see Layer 4.) One floating rule per WAN, or interface = WAN_Group. + +**Download side:** RTP from the PBX *to* the phones is shaped on the **LAN-side** queues. The wizard builds +both directions; if hand-building, mirror a `qVoice` on the internal interfaces too. Upload is the more +critical direction for cloud-PBX voice, but do both. + +**Build path (GUI — Howard drives):** +- Easiest: **Firewall → Traffic Shaper → Wizard → "Multiple Lan/Wan"** — set #WAN=2, #LAN as needed, + enter each WAN's bandwidth (below), on the VoIP page choose **"prioritize by address" = `10.0.30.0/24`** + with a guaranteed %; the wizard generates HFSC queues + the float rules. Then tune. +- Or manual: Firewall → Traffic Shaper → By Interface → add HFSC on WAN1 + WAN2, create the 3 queues, + then Firewall → Rules → Floating → match `10.0.30.0/24` out → Ackqueue/Queue = qACK/qVoice. + +> **Sizing inputs:** WAN1 upload **~522 Mbps (measured 2026-06-18)** → shape `qVoice`'s parent to ~480–500 +> Mbps. **WAN2 (coax) upload still UNKNOWN** (remote source-route test failed) — get from the Cox bill or a +> speedtest from a host routed via WAN2; size conservatively (assume ~35 Mbps up until measured). Shaping to +> ~90–95% of actual upload keeps the queue in pfSense (where we control priority), not at the ISP. WAN2 is the +> one that actually constrains voice (on failover), so its number matters most. + +### Layer 2 — UniFi WMM (the WiFi phones — Poly) +Over the air, **WMM** maps DSCP → WiFi access categories; voice (DSCP EF/46) → **WMM Voice AC** (gets TXOP +priority over data). WMM is ON by default on UniFi — **verify it's enabled on CSCNet** and that the U7 APs +honor DSCP→WMM. This is what protects the 22 Poly phones over the air during WiFi congestion. (Ties into the +RF work — a clean 5/6 GHz + WMM = good wireless voice.) + +### Layer 3 — UniFi switch QoS (the wired AudioCodes) +UniFi switches honor 802.1p/DSCP and queue tagged voice to a high-priority egress queue — mostly automatic +once the phones mark DSCP. LAN links are gig and rarely congested, so this is the least critical layer, but +confirm the USW isn't stripping DSCP and that voice VLAN 30 frames get the priority queue. + +### Layer 4 — DSCP marking (make the above reliable) +- **Verify the phones mark voice:** AudioCodes + Poly typically tag RTP **EF (46)** and signaling **CS3 (24)** + by default, often set via the PBX/provisioning. Confirm with Vertical (Richard) or capture a packet. +- **If they DON'T mark (or inconsistently):** add a pfSense floating rule that **SETS DSCP EF** on + `10.0.30.0/24` traffic (Advanced → "Match/Set DSCP"). Then Layer 1/2/3 can all match on EF too. +- **Match-by-subnet (Layer 1) works regardless of DSCP** — it's the safety net. DSCP makes WMM (Layer 2) + and switch QoS (Layer 3) automatic. + +## Implementation order +1. Get the Cox WAN upload numbers (blocker for Layer 1 sizing). +2. Confirm phones mark DSCP EF (Vertical) — decides whether we add the pfSense set-DSCP rule. +3. Build Layer 1 (pfSense HFSC + float rule) — dry-run mindset: set it, then validate. +4. Verify Layer 2 (WMM on CSCNet) + Layer 3 (switch honoring DSCP). +5. Validate (below). Tune `qVoice` % if needed. + +## Validation (prove it works) +- **Baseline:** from a LAN host, saturate the WAN upload (big upload / `iperf3 -u` / speedtest) WHILE on a + call from a voice phone — note the breakup *without* QoS. +- **After:** repeat the same saturation; call stays clean. Check Firewall → Traffic Shaper → Queues: `qVoice` + carrying voice with ~0 drops while `qDefault` absorbs the saturation + drops. +- Confirm both WANs (test on primary; fail to WAN2 and re-test). + +## Rollback +Firewall → Traffic Shaper → disable/remove the shaper; delete the floating rule. Zero residual effect +(QoS only orders packets under congestion; removing it reverts to FIFO). The set-DSCP rule (if added) can stay +or go independently. + +## Notes / interplay with the rest of the plan +- QoS is **independent of the RF work** — it helps wired + WiFi voice immediately and can be built tonight + regardless of the 2.4/5/6 GHz changes. +- It does NOT fix RF problems (a phone on a 50%-retry 2.4 GHz radio still suffers) — QoS handles *congestion/ + contention for bandwidth*, RF tuning handles *the air*. Both are needed; they're complementary. diff --git a/clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md b/clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md new file mode 100644 index 00000000..ee49f550 --- /dev/null +++ b/clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md @@ -0,0 +1,71 @@ +# Cascades — Voice Quality Diagnostic (post VLAN 30 cutover) + +- **Date:** 2026-06-18 (Howard-Home / claude-main) +- **Trigger:** All phones migrated to isolated VOICE VLAN 30 to improve call quality; users report + **dropped calls, breaks in voice, reception issues.** This is the RF/quality assessment. +- **Data source:** live UniFi controller `/stat/sta` + USW-16-PoE `port_table`, 2026-06-18. + +## Cutover status — COMPLETE +31 devices on VOICE (`10.0.30.0/24`): 8 AudioCodes (`.224-.231`), 22 Poly (`.202-.223`), Vertical +desktop (`.201`). AudioCodes required a full power-off/on (external-powered; not PoE -> UniFi +power-cycle is a no-op) before they re-DHCP'd. + +## Headline finding +**The VLAN move gives separation + sets up QoS, but it does NOT by itself fix call quality.** The +dropped calls / voice breaks are an **RF problem on the WiFi (Poly) phones.** The wired AudioCodes +are clean. Quality fixes are RF + QoS, below. + +## Wired AudioCodes (8) — HEALTHY +All USW-16-PoE ports 1-8: up, 100M full-duplex, **rx_err=0 tx_err=0 rx_drop=0.** No network-layer +problem. 100M is fine for voice. With VLAN isolation + QoS these desk phones should be solid. + +## WiFi Poly phones — RF problems (retry% = the call-quality killer) +Thresholds: retry >10% = audible breaks; RSSI <-67 marginal, <-75 bad; voice wants 5 GHz. + +**SEVERE — fix first:** +| Phone (IP) | User/Loc | AP | Band | RSSI | Retry | Issue | +|---|---|---|---|---|---|---| +| 10.0.30.202 | Lauren / Accounting | CC Bridge | **2.4** | -56 | **50%** | stuck on 2.4 GHz, half packets retransmit | +| 10.0.30.218 | Shelby / MemCare Dir | MemCare Nurse Stn | **2.4** | -56 | **53%** | stuck on 2.4 GHz | +| 10.0.30.220 | Christine / rm 515 | 517 | 5 | **-82** | 7% | coverage gap (signal near-unusable) | +| 10.0.30.219 | Karen Rossini / rm 515 | 517 | 5 | -75 | 16% | weak + high retry | + +**MODERATE:** +| 10.0.30.212 | rm 204 | 204 | 5 | -74 | 13% | weak + retry | +| 10.0.30.213 | Medtech rm 206 | 206 U7 Pro | 5 | -66 | 13% | 5 GHz congestion | +| 10.0.30.214 / .215 | rm 210 | 210 | 5 | -72 | 7-9% | weak | +| 10.0.30.206 | Dining Room | Dining Room | 5 | -70 | 9% | borderline | + +**Healthy (reference):** .207/.209/.221/.222/.223 etc. — 5 GHz, RSSI -41 to -60, retry <3%. + +### Three root causes +1. **2.4 GHz with ~50% retry** (Lauren .202, Shelby .218) — the single worst issue; matches the + documented Cascades 2.4 GHz saturation. **Must force these to 5 GHz.** +2. **Coverage gaps** — rooms 515 (-82/-75), 210/204 (-72/-74): too far from the serving AP; weak + signal drops calls when RF varies or people move. +3. **5 GHz congestion** — several at 13-16% retry on 5 GHz (80 MHz width + channel overlap, per the + 2026-06-16 audit). + +## Stragglers — 6 Poly phones NOT on VOICE +Five on VLAN 20 (`10.0.20.64/.65/.66/.67/.195`) + one on `192.168.1.126`. `10.0.20.66` (Dining +Room) is at **35% retry.** Missed in cutover or still on the old PPSK key -> migrate to the voice +PPSK so all phones are isolated + benefit from voice QoS. + +## Recommended fixes (prioritized; NONE applied — Cascades requires explicit per-change go) +1. **QoS for the voice VLAN (NEW capability the move enables) — highest ROI, lowest risk.** Mark + VLAN 30 voice traffic DSCP EF / priority on pfSense + UniFi so voice gets priority under load -> + reduces jitter/breaks network-wide. +2. **Force voice phones off 2.4 GHz** — on the CSCNet voice PPSK / the APs serving .202 & .218, + disable 2.4 GHz association for voice (or band-steer to 5/6 GHz). Fixes Lauren + Shelby (the two + worst) immediately. + - **DONE 2026-06-18: Lauren `.202` locked to AP 103** (off the CC Bridge wireless-mesh AP -> wired AP). **INTERDEPENDENCY:** AP 103's 5 GHz is saturated (ch149, 75% airtime, ~25,900 retries, 12 clients) -> tonight's 5 GHz plan MUST relieve AP 103 (channel off 149 / 80->40 MHz / load-balance) or she trades a mesh problem for a congestion problem. + - Shelby `.218` is floor 5/6 (MemCare) -> **out of scope tonight** per Howard. +3. **Coverage** — rooms 515, 210, 204: check AP placement/power; consider a closer AP or raising the + nearest AP's power; min-RSSI to push phones off far APs. (Ties into the staged coverage-thin / + 2.4 remediation runbooks.) +4. **Migrate the 6 straggler phones** to the voice PPSK (VLAN 30). +5. **5 GHz width/channel** — apply the staged audit recommendation (40 MHz width, non-DFS plan) to + cut co-channel retry. + +## Next +Discuss + pick changes. QoS (#1) + 2.4 GHz force-off (#2) are the fastest wins for the complaints. diff --git a/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md b/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md new file mode 100644 index 00000000..e1de089a --- /dev/null +++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md @@ -0,0 +1,135 @@ +# Cascades — voice-quality diagnostic + holistic RF/QoS optimization master plan + +## User +- **User:** Howard Enos (howard) +- **Machine:** Howard-Home +- **Role:** tech + +> NOTE: written in the OLD clone after the 2026-06-19 claudetools restructure coord message arrived. +> NOT synced from here. Recover into the fresh re-clone (see Pending tasks). + +## Session Summary + +Continuation of the VLAN 30 voice cutover. First, completed the AudioCodes migration: the 8 wired +AudioCodes would not pick up VLAN 30 addresses via port re-VLAN + UniFi PoE power-cycle (PoE is OFF on +those ports — they run on external power bricks, so a UniFi power-cycle is a no-op; a UI port disable/enable +didn't reset their uptime either). Root cause confirmed: they held their old main-LAN DHCP leases and never +re-DHCP'd. Howard fully powered them off/on, after which all 8 pulled VOICE leases (10.0.30.224-231). Final +state: 31 devices on VOICE (8 AudioCodes + 22 Poly + Vertical desktop). + +Second, diagnosed voice quality (dropped calls / voice breaks). Wired AudioCodes: all 8 ports clean (100M +full-duplex, zero errors). The problem is RF on the WiFi Poly phones: 14 flagged, worst = Lauren/.202 (2.4 +GHz, 50% retry, on the CC Bridge wireless-MESH AP) and Shelby/.218 (2.4 GHz, 53% retry, MemCare). Coverage +gaps in rooms 515/210/204 (RSSI -72 to -82). AP 103 5 GHz saturated (75% airtime, ~25,900 retries). Also +found 6 Poly phones NOT migrated (still on VLAN 20/Default) — fleet is 28 Poly, not 22; verified they are +distinct active MACs, not ghosts. Howard locked Lauren's phone to AP 103 (off the mesh AP). + +Third, built a holistic, all-device network optimization master plan grounded in the existing 2026-06-16 +audit + 2.4 GHz runbook + the over-thinning re-check. Key current-state fact: the network is OVER-THINNED on +2.4 GHz (overnight 6/17: 24 radios disabled + 42 at Low/6 dBm -> interference down but retry 17->23%, +satisfaction 39->30). The plan's central principle: open relief valves (6 GHz + correct 2.4 power) BEFORE +constraining (5 GHz 40 MHz), to avoid relocating congestion. Sequenced phases: (1) QoS, (2a) enable 6 GHz on +CSCNet + 2b correct 2.4 Low->Medium, (3) 5 GHz 80->40 MHz + non-DFS channel plan + relieve AP 103, (4) +fine-tune, (5) physical. With an interdependency map and per-phase gates. + +Fourth, verified DFS rigorously (Howard's concern re: Davis-Monthan AFB + TUS ~10 mi). The skill's dfs-check +flagged 3 APs, but on inspection all were benign (CAC timers + DFS-control toggles on a non-DFS channel, not +radar). A precise radar-detection-only sweep found ZERO genuine hits across all 53 DFS APs — but only over a +~21-23h window (APs rebooted in the 6/17 outage). DECISION: go NON-DFS only (UNII-1 36-48 + UNII-3 149-165) — +a radar vacate = dropped calls; resilience > diversity; 6 GHz covers the capacity gap. + +Fifth, designed Phase 1 QoS (pfSense + UniFi). Measured WAN: WAN1 fiber upload ~522 Mbps (vs ~98 Mbps peak +usage) -> the WAN is NOT the everyday voice bottleneck, so QoS is INSURANCE (WAN2 coax failover + rare +saturation), not the everyday fix — RF is the substance. Match voice by source subnet 10.0.30.0/24 (the VLAN +move's payoff). Phones confirmed to support DSCP EF. Determined a dedicated voice SSID is NOT viable (UniFi +3-SSID cap; CSC ENT still has 131 clients, not retireable) and NOT needed (QoS is VLAN/DSCP-based, +SSID-independent; band preference is phone-side). Added an all-devices impact + data-driven decision +framework: every change gated on fleet-wide metrics (measure -> decide -> adjust), with the trade-offs to +watch (non-DFS + 5GHz-only DirecTV fleet; min-RSSI orphaning; 40 MHz peak). + +## Key Decisions + +- **AudioCodes need a full power-cycle (off/on), not a UniFi PoE cycle** — they're externally powered (PoE off + on the ports); a UI port bounce doesn't reset them. +- **5 GHz: NON-DFS ONLY.** DFS sweep clean but only ~1-day window near a military base/airport; a radar vacate + drops calls. Resilience over channel diversity; lean on 6 GHz for capacity. +- **QoS reframed to INSURANCE, not the everyday fix** — WAN1 fiber has ~522 Mbps up vs 98 Mbps peak use; the + everyday dropped-calls cause is RF. QoS matters on WAN2 (coax) failover + rare WAN1 saturation. +- **No dedicated voice SSID** — 3-SSID cap is sound RF hygiene; CSC ENT (the only retirement candidate) still + has 131 clients; and voice doesn't need a dedicated SSID (QoS is SSID-independent, band-pref is phone-side). +- **Open relief valves before constraining** — 6 GHz + 2.4 Low->Medium BEFORE 5 GHz 40 MHz, or congestion just + relocates. +- **2.4 power Low->Medium, not lower** — Low already over-thinned (retry up, satisfaction down). +- **Data-driven gates** (Howard) — base every choice on measured fleet-wide metrics; one lever per zone; + keep/hold/rollback per the gate rule; validation measures ALL devices, not just voice. +- **Phones are 5 GHz (not 6E)** — 6 GHz helps voice indirectly by clearing 5 GHz of resident devices. + +## Problems Encountered + +- **AudioCodes wouldn't move to VLAN 30** — root-caused to held DHCP leases + PoE-off ports (power-cycle no-op); + resolved by full power-off/on. +- **UniFi controller PUT 403s** — CSRF token extraction flaky; fixed by reading `x-updated-csrf-token` (with a + TOKEN-cookie JWT fallback). +- **pfSense SSH rate-limiting + controller throttling** after many rapid queries — switched between controller + API and pfSense SSH as needed; one fleet pull hung in the background. +- **Temp-file/sync friction (RECURRED 3x)** — controller-scratch files (.sta.json, .fleet325.dev) written + CWD-relative got swept into commits by `git add -A` and blocked rebases (stray locked curl.exe held them). + Fixed: killed the procs, untracked, broadened .gitignore (.fleet*, .ap[0-9]*, .vq[0-9]*, .q[0-9]*). Real fix: + write API scratch OUTSIDE the repo (used mktemp -d for the DFS sweep). +- **Cloudflare __down / WAN2-bound speedtests returned 0.0** — only WAN1 upload (522 Mbps) measured cleanly; + WAN2 (coax) upload still unknown (needs a WAN2-routed host or Cox bill). + +## Configuration Changes (all in clients/cascades-tucson/, committed up through 2b2d094 BEFORE the restructure) + +- **Created** `docs/network/network-optimization-master-plan.md` — holistic all-device plan (sequencing, + interdependency map, data-driven decision framework, DFS non-DFS decision, SSID decision). +- **Created** `docs/network/phase1-voice-qos-design.md` — pfSense HFSC + UniFi WMM/switch QoS design. +- **Created** `reports/2026-06-18-voice-quality-diagnostic.md` — per-phone RF findings + fixes. +- **Updated** `reports/2026-06-16-voice-quality-diagnostic.md`? (no) — voice-quality report Lauren->103 note. +- **No live network changes applied** (Cascades rule: explicit per-change go). UniFi port bounces were + temporary (restored). DFS/WAN tests were read-only/bounded. + +## Credentials & Secrets +- No new credentials. Used existing: `infrastructure/uos-server-network-api-rw` (controller), + `clients/cascades-tucson/unifi-ap-ssh` (AP SSH for DFS sweep), `clients/cascades-tucson/pfsense-firewall`, + `clients/cascades-tucson/wifi-voice-ppsk` (key `V0!c38863171`). + +## Infrastructure & Servers +- VOICE VLAN 30 `10.0.30.0/24`: 8 AudioCodes `.224-.231`, 22 Poly `.202-.223`, desktop `.201`. +- WAN1 fiber igc0 (522 Mbps up measured; RRD peaks 680 down/98 up). WAN2 coax igc3 (72.211.21.217, upload + unmeasured). pfSense `192.168.0.1` Plus 25.07, no existing shaper. +- UniFi UOS `172.16.3.29:11443` site `va6iba3v`. USW-16-PoE mac `d8:b3:70:21:94:5f` dev_id `685f39078e65331c46ef7e90`. +- SSIDs: CSCNet (427 clients, PPSK, 2g+5g), CSC ENT (131 clients, legacy, 2g+5g), Guest (13, 2g+5g+6g). +- DFS: 53 APs on DFS, 0 genuine radar over ~21-23h. + +## Pending / Incomplete Tasks + +- **RE-CLONE claudetools** (coord message 2026-06-19): old clone incompatible after history rewrite. Steps below. +- **Verify** this session's Cascades docs (master plan, QoS design, voice-quality + diagnostic reports, voice + inventory, logging plan) survived the rewrite into the new repo; if missing, recover from this .old working tree. +- **Recover this session log** into the new clone (it's uncommitted here). +- WAN2 (coax) upload number — measure from a WAN2-routed host / Cox bill (sizes the failover shaper). +- 6 straggler Poly phones (10.0.20.64/65/66/67/195, 192.168.1.126) — re-key to voice PPSK. +- Floors 5/6 (MemCare) RF + phones — deferred. +- Execute the optimization plan (start Phase 2b 2.4 Low->Medium with baseline capture) — pending Howard's go. +- Hand Vertical the phone-side config list (band 5GHz lock, DSCP-on, k/v roaming, U-APSD, firmware). + +## Re-clone steps (Windows / C:\claudetools) +``` +# from C:\ +mv claudetools claudetools.old # or rename in Explorer +git clone https://git.azcomputerguru.com/azcomputerguru/claudetools.git claudetools +cp claudetools.old/.claude/identity.json claudetools/.claude/ +cd claudetools && git submodule update --init --recursive +# then: diff clients/cascades-tucson against claudetools.old; cp any missing files (esp. this session's docs) +# recover this session log from claudetools.old/clients/cascades-tucson/session-logs/2026-06/ +# verify, then delete claudetools.old +``` +See RECLONE.md in the new repo. Pre-split backup bundle: Jupiter share Backups/Gitea-Storage. + +## Reference Information +- Last pushed commit (old history): `2b2d094` (2026-06-18 19:16). Restructure force-push: ~2026-06-19 02:41 UTC. +- Master plan: `clients/cascades-tucson/docs/network/network-optimization-master-plan.md` +- QoS design: `clients/cascades-tucson/docs/network/phase1-voice-qos-design.md` +- Voice-quality diagnostic: `clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md` +- Existing RF audit + 2.4 runbook: `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`