cascades: recover 4 docs dropped by the history-rewrite/repo-split

The 2026-06-18 repo restructure (history rewrite + project->submodule split)
dropped these 4 Cascades files from the new clone. Copied byte-identical from
the pre-cutover claudetools.old clone (md5-verified):
- docs/network/network-optimization-master-plan.md
- docs/network/phase1-voice-qos-design.md
- reports/2026-06-18-voice-quality-diagnostic.md
- session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-18 20:21:27 -07:00
parent b66b80a95b
commit c2e5f4faeb
4 changed files with 514 additions and 0 deletions

View File

@@ -0,0 +1,197 @@
# Cascades — Network Optimization Master Plan (all devices, holistic)
- **Created:** 2026-06-18 (Howard-Home / claude-main)
- **Status:** PLAN — for execution tonight (floors 14) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round.
- **Goal:** Fix the *system*, not one device at a time. Improve quality for **every** client (~587), not just
the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another.
- **Builds on:** `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`
(RF mechanics + gated apply commands), `reports/2026-06-18-voice-quality-diagnostic.md`, and the live
2026-06-18 fleet sample. All RF changes use the gated `unifi-wifi` scripts (per-zone, dry-run, rollback JSON).
---
## 1. Current state (what's actually true right now)
| Layer | State | Verdict |
|---|---|---|
| **2.4 GHz** | **OVER-THINNED.** Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→3248%) BUT **retry rose 17→23.4%, satisfaction fell 39→30** (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). | **Regressed — must correct power floor** |
| **5 GHz** | 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). **AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients** (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). | **Constrain width + spread channels + relieve hotspots** |
| **6 GHz** | 75 radios live, **~1 client.** Root cause: **CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`). Cleanest untapped capacity. | **Open it — the relief valve** |
| **QoS** | **NONE.** Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. | **Add — guaranteed win, now possible** |
| **pfSense/WAN/DHCP/DNS** | Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. | **Fine — add voice QoS shaping only** |
| **Switching / physical** | ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. | **Physical work — not tonight, but tracked** |
---
## 2. Root-cause model (why "some devices" are bad)
Three compounding RF causes, plus a missing QoS layer:
1. **2.4 GHz contention** — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz
suffers. Made *worse* by the over-thinning (weaker signal → more retransmits).
2. **5 GHz over-width + hotspots** — 80 MHz halves the usable channel count → co-channel overlap → retries;
a few APs (103) are simply overloaded.
3. **6 GHz unused** — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz.
4. **No voice prioritization** — even with perfect RF, voice breaks under data bursts without QoS.
**The trap we must avoid (the "whack-a-mole"):** narrowing 5 GHz to 40 MHz *without* first opening 6 GHz pushes
more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's
already too low) starves edge clients. **Sequence matters.**
---
## 3. The holistic sequence (open relief valves BEFORE constraining)
> Principle: **(A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize
> → (D) fine-tune → validate at every gate.** Each step is reversible; gate on live metrics before the next.
### PHASE 0 — Pre-flight + baseline (always)
- VPN up; `live-stats.sh cascades | head -3` (expect 77 APs).
- Baseline (compare after, same time-of-day): `live-stats.sh cascades > .claude/tmp/opt-pre.txt`;
`radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt`.
- Pick a watch AP per floor (`watch-ap.sh <ip>`).
### PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix)
Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load.
> **Reframe (measured 2026-06-18):** WAN1 fiber upload is **~522 Mbps** vs ~98 Mbps peak usage — huge
> headroom, so the WAN is **not** the day-to-day voice bottleneck (that's RF, Phases 24). QoS still earns its
> place as insurance for **WAN2 (coax) failover** and **rare WAN1 saturation** (you hit 680 Mbps down). Build
> it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design:
> `docs/network/phase1-voice-qos-design.md`. **Phones confirmed marking DSCP EF** → rely on DSCP; subnet match is the net.
- **UniFi (WLAN/switch):** ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the
USW, voice VLAN traffic should hit the high-priority queue.
- **pfSense:** add a traffic-shaper/limiter or floating QoS rule that puts `VOICE net (10.0.30.0/24)` DSCP EF
(46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only.
- **Validate:** place test calls during a data-heavy moment; confirm no breakup. (No RF change here.)
- *Skill gap:* the `unifi-wifi` skill has no QoS verb — this is a pfSense + UniFi config task; consider a small
`voice-qos` helper later.
### PHASE 2 — Open the relief valves (capacity + correct the regression)
**2a. Enable 6 GHz on CSCNet + steering** (creates the offload path BEFORE we narrow 5 GHz):
```
apply-wlan.sh cascades bands all --wlan CSCNet --apply # -> [2g,5g,6g]
apply-wlan.sh cascades bsstm on --wlan CSCNet --apply # 802.11v BSS-transition (assists up-band + roam)
```
Band-steering (`no2ghz_oui`) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate:
client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on).
**2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios** (restores edge signal; keeps cells smaller
than full power). Per floor, dry-run then apply; regenerate the kept-radio list live:
```
for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \
apply-radio.sh cascades ng power medium --zone "$z" --apply; done # ~1215 dBm
```
Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio
(`ng enable --ap "<name>"`). **Gate:** re-measure retry%/satisfaction same time-of-day vs `opt-pre.txt`
expect retry back down from ~23% and satisfaction recovering.
### PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load)
**3a. Width 80 → 40 MHz** (doubles non-overlapping channels → spatial reuse):
```
for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \
apply-radio.sh cascades na width 40 --zone "$z" --apply; done # rollback: na width 80
```
**3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification).**
Use **UNII-1 (3648) + UNII-3 (149165) only**; do NOT use DFS channels (52144) on this voice-critical
network. A precise radar-detection sweep (real `radar found`/`NOL` signatures, CAC/control housekeeping
excluded) found **ZERO genuine hits across all 53 DFS APs** — BUT the window was only ~2123h (APs rebooted
~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single
hit forces a 30-min channel vacate = **dropped calls** — unacceptable for voice. **Resilience > diversity.**
The lost 5 GHz channel count is covered by **6 GHz (Phase 2a) absorbing capacity** — this is WHY 6 GHz comes first.
```
SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades
SURVEY_JSON=$SURVEY channel-plan.sh cascades na # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone
```
**Periodic DFS monitoring:** the ~1-day window isn't conclusive, so add a recurring precise `dfs-check.sh`
(fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just
confirms the choice stays right.
**3c. Relieve AP 103 specifically** (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to
a clean channel from the plan, 40 MHz. Verify Lauren `.202` retry drops after.
**Gate:** 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded.
### PHASE 4 — Fine-tune (after 13 settle)
- **2.4 channel plan 1/6/11** (graph-color; co-channel pairs 92→35) + **pin the 4 off-plan APs** (128/108/108U7/salon)
to 1/6/11.
- **2.4 min-RSSI ON** for the 6 APs where it's OFF (615/608/505/517/622/salon) — *note 505/517/615/608/622 are
Floors 5/6 → DEFER with the rest of 5/6*; do `salon` only this round.
- **Roaming for voice continuity:** confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r).
Keeps calls alive when staff walk between APs.
- **min-RSSI tuning:** only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association.
### PHASE 5 — Physical (separate visit, not tonight — but it caps results)
- Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput).
- Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run.
- p38 (1st Floor USW) 4% tx-drop after the above.
---
## 4. Interdependency map (read before changing anything)
- **6 GHz BEFORE 5 GHz 40 MHz** — else 5 GHz congestion just relocates. (Phase 2a before 3a.)
- **2.4 power MEDIUM not LOW** — Low already over-thinned; going lower starves edge clients. (Phase 2b.)
- **AP-lock needs AP capacity** — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion.
- **QoS is independent** — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.)
- **Disables + power-down compound** — never do both aggressively in the same area; we already saw the satisfaction hit.
- **min-RSSI + power interact** — raising min-RSSI while lowering power can orphan clients; tune one lever at a time.
- **Mesh-protected APs** (`2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108`) — never disable; power changes only with watch.
## 5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust)
**Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions.** Each
change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is *all*
devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones.
### 5.1 How each change affects the OTHER (non-voice) devices
Almost every change targets the **shared RF environment**, so it helps everyone — voice is just the most
sensitive canary:
| Change | Effect on non-voice devices |
|---|---|
| QoS (voice VLAN priority) | **Negligible** — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load |
| Enable 6 GHz on CSCNet | **Positive** — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left |
| 2.4 Low→Medium power | **Positive for ALL 2.4 devices** — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back) |
| 5 GHz 80→40 MHz | **Net positive (reliability), small peak-speed cost** — density win; lone heavy transfer sees lower peak |
| AP 103 relief | **Positive for all 16 clients on 103**, not just Lauren |
| 2.4 1/6/11 channel plan | **Positive for all 2.4 devices** (less co-channel) |
| Phone-side (Vertical) | Phones only — **no effect on others** |
### 5.2 Trade-offs to WATCH in the data (don't help voice, hurt others)
1. **Non-DFS 5 GHz + 5 GHz-only devices** — the DirecTV fleet + older laptops **can't use 6 GHz**, so they
stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; **watch
non-DFS 5 GHz cu_total** — if it climbs, that's the signal to rebalance.
2. **min-RSSI** affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently.
3. **40 MHz** trades single-user peak for fleet reliability — right in density, but it is a trade.
### 5.3 Fleet-wide metrics we pull at every gate (the data we decide on)
Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after:
- `live-stats.sh cascades` → per-band **avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts**
- `radio-usage.sh cascades <band>` → per-AP outliers (saturated/high-retry APs)
- `/stat/sta` band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count
- Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole)
### 5.4 The GATE decision rule (per change, per zone)
- **KEEP + proceed** only if: the target metric improved **AND** fleet-wide **satisfaction did not fall**,
**retry% did not rise**, band split moved the intended way, and **no AP lost its clients** (no hole).
- **HOLD** (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more.
- **ROLLBACK that step** if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint.
- Do **one lever per zone at a time** so cause/effect is attributable (the over-thinning happened because power-down
+ disables were stacked).
### 5.5 Rollback
Every `apply-radio`/`apply-wlan` writes a rollback JSON to `.claude/tmp/`; `device-control poe-cycle` for a hung
AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals.
## 6. Out of scope tonight (explicit)
- **Floors 5 & 6 (MemCare)** — all RF + the MemCare voice phones (`.217/.218/.219/.220`) DEFERRED per Howard.
- Physical cabling / offline switches (Phase 5 — separate visit).
- The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless.
## 7. Open decisions for Howard
1. ~~**5 GHz channel plan:** clean-DFS vs non-DFS-only~~**RESOLVED 2026-06-18: NON-DFS ONLY** (UNII-1 3648 + UNII-3 149165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.)
2. **QoS depth:** UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive).
3. **802.11r** on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first.
4. Tonight's stopping point: Phases 12 alone are a legitimate, lower-risk night; 34 can be a second night.
5. ~~**Dedicated voice SSID?**~~**RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK.** UniFi
3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate,
CSC ENT, still has **131 active clients** (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's
not needed: **QoS is VLAN/DSCP-based (SSID-independent)**, band preference is best done **phone-side** (Vertical),
and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only
add voice-specific WiFi *policy* (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only
if/when CSC ENT's 131 clients migrate off it.

View File

@@ -0,0 +1,111 @@
# Cascades — Phase 1: Voice QoS Design (VLAN 30)
- **Created:** 2026-06-18 (Howard-Home / claude-main). Part of `network-optimization-master-plan.md` Phase 1.
- **Status:** DESIGN — for review, then build (Howard drives pfSense GUI). Nothing applied.
- **Risk:** LOW — additive, voice-only prioritization; rollback = disable the shaper. Main caution: size the
shaper bandwidth correctly (a wrong value can throttle throughput) → test before/after.
## Objective
Guarantee voice quality under load by prioritizing VLAN 30 traffic end-to-end. **The phones register to a
CLOUD PBX (Vertical) over the internet**, so the bottleneck that breaks calls is **WAN upload saturation**
(someone uploading / cloud backup / OneDrive sync fills the uplink → voice RTP queues → jitter, dropped
audio). QoS keeps voice ahead of bulk data on the WAN.
## The big advantage of the VLAN move
**All voice is now one subnet: `10.0.30.0/24`.** So QoS can match *all* voice by **source subnet** — no
need to guess SIP/RTP port ranges per PBX. This is the cleanest, most robust match criterion and it only
became possible because we isolated voice onto VLAN 30.
## Current state (verified 2026-06-18)
- **No traffic shaper / limiter configured** on pfSense (clean build).
- **Dual-WAN:** WAN1 `igc0` (Cox Fiber, primary, 1G link), WAN2 `igc3` (Cox Coax, 2.5G link); `WAN_Group`
failover (`downlosslatency`). Shaping must be applied on **both** WAN interfaces.
- pfSense Plus 25.07 (ALTQ shaper + dummynet limiters available).
- **Phones mark DSCP EF — CONFIRMED (Howard 2026-06-18).** So we can rely on DSCP for WMM (Layer 2) + switch
QoS (Layer 3); the `10.0.30.0/24` subnet match (Layer 1) is the safety net. **No pfSense set-DSCP rule needed.**
## Measured WAN bandwidth (2026-06-18) — REFRAMES QoS priority
- **WAN1 (fiber, primary): upload ~522 Mbps** (Cloudflare single-stream from pfSense). RRD 3-day peaks:
**680 Mbps down / 98 Mbps up** (actual usage).
- **WAN2 (coax): not measurable remotely** (source-route bind to `72.211.21.217` failed; needs a WAN2-routed
host or the Cox bill). Coax is typically asymmetric ~2050 Mbps up — **size its shaper conservatively**.
- **Implication:** 30 calls ≈ ~3 Mbps. WAN1 upload (~522 Mbps) vs peak usage (98 Mbps) = **huge headroom →
the WAN is NOT the everyday voice bottleneck.** Everyday dropped-calls = **RF** (Phases 24 of the master
plan). **QoS here is INSURANCE, not the day-to-day fix** — it earns its keep in two cases: (1) **WAN2
failover** (small coax upload + a big upload → real congestion), (2) **rare WAN1 saturation** (backup /
large upload; you do hit 680 Mbps down). Build it (cheap, correct), but set expectations: RF is the substance.
## Three layers (priority order; Layer 1 = insurance, see reframe above)
### Layer 1 — pfSense WAN shaper (PRIMARY — this is where calls break)
**Type: HFSC** (hierarchical, lets us guarantee voice a floor while letting it borrow idle bandwidth).
Per WAN interface, three queues:
| Queue | Role | HFSC settings (starting point) |
|---|---|---|
| `qVoice` | voice (VLAN 30 / DSCP EF) | **priority 7**, realtime ~30% of WAN-up, link-share 30%, NOT default |
| `qACK` | TCP ACKs (keeps downloads snappy) | priority 6, ~10% |
| `qDefault` | everything else | **default**, link-share ~60% |
**Match rule (floating, WAN, direction out):** source `10.0.30.0/24``qVoice`. (Optionally also match
DSCP EF if phones mark it — see Layer 4.) One floating rule per WAN, or interface = WAN_Group.
**Download side:** RTP from the PBX *to* the phones is shaped on the **LAN-side** queues. The wizard builds
both directions; if hand-building, mirror a `qVoice` on the internal interfaces too. Upload is the more
critical direction for cloud-PBX voice, but do both.
**Build path (GUI — Howard drives):**
- Easiest: **Firewall → Traffic Shaper → Wizard → "Multiple Lan/Wan"** — set #WAN=2, #LAN as needed,
enter each WAN's bandwidth (below), on the VoIP page choose **"prioritize by address" = `10.0.30.0/24`**
with a guaranteed %; the wizard generates HFSC queues + the float rules. Then tune.
- Or manual: Firewall → Traffic Shaper → By Interface → add HFSC on WAN1 + WAN2, create the 3 queues,
then Firewall → Rules → Floating → match `10.0.30.0/24` out → Ackqueue/Queue = qACK/qVoice.
> **Sizing inputs:** WAN1 upload **~522 Mbps (measured 2026-06-18)** → shape `qVoice`'s parent to ~480500
> Mbps. **WAN2 (coax) upload still UNKNOWN** (remote source-route test failed) — get from the Cox bill or a
> speedtest from a host routed via WAN2; size conservatively (assume ~35 Mbps up until measured). Shaping to
> ~9095% of actual upload keeps the queue in pfSense (where we control priority), not at the ISP. WAN2 is the
> one that actually constrains voice (on failover), so its number matters most.
### Layer 2 — UniFi WMM (the WiFi phones — Poly)
Over the air, **WMM** maps DSCP → WiFi access categories; voice (DSCP EF/46) → **WMM Voice AC** (gets TXOP
priority over data). WMM is ON by default on UniFi — **verify it's enabled on CSCNet** and that the U7 APs
honor DSCP→WMM. This is what protects the 22 Poly phones over the air during WiFi congestion. (Ties into the
RF work — a clean 5/6 GHz + WMM = good wireless voice.)
### Layer 3 — UniFi switch QoS (the wired AudioCodes)
UniFi switches honor 802.1p/DSCP and queue tagged voice to a high-priority egress queue — mostly automatic
once the phones mark DSCP. LAN links are gig and rarely congested, so this is the least critical layer, but
confirm the USW isn't stripping DSCP and that voice VLAN 30 frames get the priority queue.
### Layer 4 — DSCP marking (make the above reliable)
- **Verify the phones mark voice:** AudioCodes + Poly typically tag RTP **EF (46)** and signaling **CS3 (24)**
by default, often set via the PBX/provisioning. Confirm with Vertical (Richard) or capture a packet.
- **If they DON'T mark (or inconsistently):** add a pfSense floating rule that **SETS DSCP EF** on
`10.0.30.0/24` traffic (Advanced → "Match/Set DSCP"). Then Layer 1/2/3 can all match on EF too.
- **Match-by-subnet (Layer 1) works regardless of DSCP** — it's the safety net. DSCP makes WMM (Layer 2)
and switch QoS (Layer 3) automatic.
## Implementation order
1. Get the Cox WAN upload numbers (blocker for Layer 1 sizing).
2. Confirm phones mark DSCP EF (Vertical) — decides whether we add the pfSense set-DSCP rule.
3. Build Layer 1 (pfSense HFSC + float rule) — dry-run mindset: set it, then validate.
4. Verify Layer 2 (WMM on CSCNet) + Layer 3 (switch honoring DSCP).
5. Validate (below). Tune `qVoice` % if needed.
## Validation (prove it works)
- **Baseline:** from a LAN host, saturate the WAN upload (big upload / `iperf3 -u` / speedtest) WHILE on a
call from a voice phone — note the breakup *without* QoS.
- **After:** repeat the same saturation; call stays clean. Check Firewall → Traffic Shaper → Queues: `qVoice`
carrying voice with ~0 drops while `qDefault` absorbs the saturation + drops.
- Confirm both WANs (test on primary; fail to WAN2 and re-test).
## Rollback
Firewall → Traffic Shaper → disable/remove the shaper; delete the floating rule. Zero residual effect
(QoS only orders packets under congestion; removing it reverts to FIFO). The set-DSCP rule (if added) can stay
or go independently.
## Notes / interplay with the rest of the plan
- QoS is **independent of the RF work** — it helps wired + WiFi voice immediately and can be built tonight
regardless of the 2.4/5/6 GHz changes.
- It does NOT fix RF problems (a phone on a 50%-retry 2.4 GHz radio still suffers) — QoS handles *congestion/
contention for bandwidth*, RF tuning handles *the air*. Both are needed; they're complementary.

View File

@@ -0,0 +1,71 @@
# Cascades — Voice Quality Diagnostic (post VLAN 30 cutover)
- **Date:** 2026-06-18 (Howard-Home / claude-main)
- **Trigger:** All phones migrated to isolated VOICE VLAN 30 to improve call quality; users report
**dropped calls, breaks in voice, reception issues.** This is the RF/quality assessment.
- **Data source:** live UniFi controller `/stat/sta` + USW-16-PoE `port_table`, 2026-06-18.
## Cutover status — COMPLETE
31 devices on VOICE (`10.0.30.0/24`): 8 AudioCodes (`.224-.231`), 22 Poly (`.202-.223`), Vertical
desktop (`.201`). AudioCodes required a full power-off/on (external-powered; not PoE -> UniFi
power-cycle is a no-op) before they re-DHCP'd.
## Headline finding
**The VLAN move gives separation + sets up QoS, but it does NOT by itself fix call quality.** The
dropped calls / voice breaks are an **RF problem on the WiFi (Poly) phones.** The wired AudioCodes
are clean. Quality fixes are RF + QoS, below.
## Wired AudioCodes (8) — HEALTHY
All USW-16-PoE ports 1-8: up, 100M full-duplex, **rx_err=0 tx_err=0 rx_drop=0.** No network-layer
problem. 100M is fine for voice. With VLAN isolation + QoS these desk phones should be solid.
## WiFi Poly phones — RF problems (retry% = the call-quality killer)
Thresholds: retry >10% = audible breaks; RSSI <-67 marginal, <-75 bad; voice wants 5 GHz.
**SEVERE — fix first:**
| Phone (IP) | User/Loc | AP | Band | RSSI | Retry | Issue |
|---|---|---|---|---|---|---|
| 10.0.30.202 | Lauren / Accounting | CC Bridge | **2.4** | -56 | **50%** | stuck on 2.4 GHz, half packets retransmit |
| 10.0.30.218 | Shelby / MemCare Dir | MemCare Nurse Stn | **2.4** | -56 | **53%** | stuck on 2.4 GHz |
| 10.0.30.220 | Christine / rm 515 | 517 | 5 | **-82** | 7% | coverage gap (signal near-unusable) |
| 10.0.30.219 | Karen Rossini / rm 515 | 517 | 5 | -75 | 16% | weak + high retry |
**MODERATE:**
| 10.0.30.212 | rm 204 | 204 | 5 | -74 | 13% | weak + retry |
| 10.0.30.213 | Medtech rm 206 | 206 U7 Pro | 5 | -66 | 13% | 5 GHz congestion |
| 10.0.30.214 / .215 | rm 210 | 210 | 5 | -72 | 7-9% | weak |
| 10.0.30.206 | Dining Room | Dining Room | 5 | -70 | 9% | borderline |
**Healthy (reference):** .207/.209/.221/.222/.223 etc. — 5 GHz, RSSI -41 to -60, retry <3%.
### Three root causes
1. **2.4 GHz with ~50% retry** (Lauren .202, Shelby .218) — the single worst issue; matches the
documented Cascades 2.4 GHz saturation. **Must force these to 5 GHz.**
2. **Coverage gaps** — rooms 515 (-82/-75), 210/204 (-72/-74): too far from the serving AP; weak
signal drops calls when RF varies or people move.
3. **5 GHz congestion** — several at 13-16% retry on 5 GHz (80 MHz width + channel overlap, per the
2026-06-16 audit).
## Stragglers — 6 Poly phones NOT on VOICE
Five on VLAN 20 (`10.0.20.64/.65/.66/.67/.195`) + one on `192.168.1.126`. `10.0.20.66` (Dining
Room) is at **35% retry.** Missed in cutover or still on the old PPSK key -> migrate to the voice
PPSK so all phones are isolated + benefit from voice QoS.
## Recommended fixes (prioritized; NONE applied — Cascades requires explicit per-change go)
1. **QoS for the voice VLAN (NEW capability the move enables) — highest ROI, lowest risk.** Mark
VLAN 30 voice traffic DSCP EF / priority on pfSense + UniFi so voice gets priority under load ->
reduces jitter/breaks network-wide.
2. **Force voice phones off 2.4 GHz** — on the CSCNet voice PPSK / the APs serving .202 & .218,
disable 2.4 GHz association for voice (or band-steer to 5/6 GHz). Fixes Lauren + Shelby (the two
worst) immediately.
- **DONE 2026-06-18: Lauren `.202` locked to AP 103** (off the CC Bridge wireless-mesh AP -> wired AP). **INTERDEPENDENCY:** AP 103's 5 GHz is saturated (ch149, 75% airtime, ~25,900 retries, 12 clients) -> tonight's 5 GHz plan MUST relieve AP 103 (channel off 149 / 80->40 MHz / load-balance) or she trades a mesh problem for a congestion problem.
- Shelby `.218` is floor 5/6 (MemCare) -> **out of scope tonight** per Howard.
3. **Coverage** — rooms 515, 210, 204: check AP placement/power; consider a closer AP or raising the
nearest AP's power; min-RSSI to push phones off far APs. (Ties into the staged coverage-thin /
2.4 remediation runbooks.)
4. **Migrate the 6 straggler phones** to the voice PPSK (VLAN 30).
5. **5 GHz width/channel** — apply the staged audit recommendation (40 MHz width, non-DFS plan) to
cut co-channel retry.
## Next
Discuss + pick changes. QoS (#1) + 2.4 GHz force-off (#2) are the fastest wins for the complaints.

View File

@@ -0,0 +1,135 @@
# Cascades — voice-quality diagnostic + holistic RF/QoS optimization master plan
## User
- **User:** Howard Enos (howard)
- **Machine:** Howard-Home
- **Role:** tech
> NOTE: written in the OLD clone after the 2026-06-19 claudetools restructure coord message arrived.
> NOT synced from here. Recover into the fresh re-clone (see Pending tasks).
## Session Summary
Continuation of the VLAN 30 voice cutover. First, completed the AudioCodes migration: the 8 wired
AudioCodes would not pick up VLAN 30 addresses via port re-VLAN + UniFi PoE power-cycle (PoE is OFF on
those ports — they run on external power bricks, so a UniFi power-cycle is a no-op; a UI port disable/enable
didn't reset their uptime either). Root cause confirmed: they held their old main-LAN DHCP leases and never
re-DHCP'd. Howard fully powered them off/on, after which all 8 pulled VOICE leases (10.0.30.224-231). Final
state: 31 devices on VOICE (8 AudioCodes + 22 Poly + Vertical desktop).
Second, diagnosed voice quality (dropped calls / voice breaks). Wired AudioCodes: all 8 ports clean (100M
full-duplex, zero errors). The problem is RF on the WiFi Poly phones: 14 flagged, worst = Lauren/.202 (2.4
GHz, 50% retry, on the CC Bridge wireless-MESH AP) and Shelby/.218 (2.4 GHz, 53% retry, MemCare). Coverage
gaps in rooms 515/210/204 (RSSI -72 to -82). AP 103 5 GHz saturated (75% airtime, ~25,900 retries). Also
found 6 Poly phones NOT migrated (still on VLAN 20/Default) — fleet is 28 Poly, not 22; verified they are
distinct active MACs, not ghosts. Howard locked Lauren's phone to AP 103 (off the mesh AP).
Third, built a holistic, all-device network optimization master plan grounded in the existing 2026-06-16
audit + 2.4 GHz runbook + the over-thinning re-check. Key current-state fact: the network is OVER-THINNED on
2.4 GHz (overnight 6/17: 24 radios disabled + 42 at Low/6 dBm -> interference down but retry 17->23%,
satisfaction 39->30). The plan's central principle: open relief valves (6 GHz + correct 2.4 power) BEFORE
constraining (5 GHz 40 MHz), to avoid relocating congestion. Sequenced phases: (1) QoS, (2a) enable 6 GHz on
CSCNet + 2b correct 2.4 Low->Medium, (3) 5 GHz 80->40 MHz + non-DFS channel plan + relieve AP 103, (4)
fine-tune, (5) physical. With an interdependency map and per-phase gates.
Fourth, verified DFS rigorously (Howard's concern re: Davis-Monthan AFB + TUS ~10 mi). The skill's dfs-check
flagged 3 APs, but on inspection all were benign (CAC timers + DFS-control toggles on a non-DFS channel, not
radar). A precise radar-detection-only sweep found ZERO genuine hits across all 53 DFS APs — but only over a
~21-23h window (APs rebooted in the 6/17 outage). DECISION: go NON-DFS only (UNII-1 36-48 + UNII-3 149-165) —
a radar vacate = dropped calls; resilience > diversity; 6 GHz covers the capacity gap.
Fifth, designed Phase 1 QoS (pfSense + UniFi). Measured WAN: WAN1 fiber upload ~522 Mbps (vs ~98 Mbps peak
usage) -> the WAN is NOT the everyday voice bottleneck, so QoS is INSURANCE (WAN2 coax failover + rare
saturation), not the everyday fix — RF is the substance. Match voice by source subnet 10.0.30.0/24 (the VLAN
move's payoff). Phones confirmed to support DSCP EF. Determined a dedicated voice SSID is NOT viable (UniFi
3-SSID cap; CSC ENT still has 131 clients, not retireable) and NOT needed (QoS is VLAN/DSCP-based,
SSID-independent; band preference is phone-side). Added an all-devices impact + data-driven decision
framework: every change gated on fleet-wide metrics (measure -> decide -> adjust), with the trade-offs to
watch (non-DFS + 5GHz-only DirecTV fleet; min-RSSI orphaning; 40 MHz peak).
## Key Decisions
- **AudioCodes need a full power-cycle (off/on), not a UniFi PoE cycle** — they're externally powered (PoE off
on the ports); a UI port bounce doesn't reset them.
- **5 GHz: NON-DFS ONLY.** DFS sweep clean but only ~1-day window near a military base/airport; a radar vacate
drops calls. Resilience over channel diversity; lean on 6 GHz for capacity.
- **QoS reframed to INSURANCE, not the everyday fix** — WAN1 fiber has ~522 Mbps up vs 98 Mbps peak use; the
everyday dropped-calls cause is RF. QoS matters on WAN2 (coax) failover + rare WAN1 saturation.
- **No dedicated voice SSID** — 3-SSID cap is sound RF hygiene; CSC ENT (the only retirement candidate) still
has 131 clients; and voice doesn't need a dedicated SSID (QoS is SSID-independent, band-pref is phone-side).
- **Open relief valves before constraining** — 6 GHz + 2.4 Low->Medium BEFORE 5 GHz 40 MHz, or congestion just
relocates.
- **2.4 power Low->Medium, not lower** — Low already over-thinned (retry up, satisfaction down).
- **Data-driven gates** (Howard) — base every choice on measured fleet-wide metrics; one lever per zone;
keep/hold/rollback per the gate rule; validation measures ALL devices, not just voice.
- **Phones are 5 GHz (not 6E)** — 6 GHz helps voice indirectly by clearing 5 GHz of resident devices.
## Problems Encountered
- **AudioCodes wouldn't move to VLAN 30** — root-caused to held DHCP leases + PoE-off ports (power-cycle no-op);
resolved by full power-off/on.
- **UniFi controller PUT 403s** — CSRF token extraction flaky; fixed by reading `x-updated-csrf-token` (with a
TOKEN-cookie JWT fallback).
- **pfSense SSH rate-limiting + controller throttling** after many rapid queries — switched between controller
API and pfSense SSH as needed; one fleet pull hung in the background.
- **Temp-file/sync friction (RECURRED 3x)** — controller-scratch files (.sta.json, .fleet325.dev) written
CWD-relative got swept into commits by `git add -A` and blocked rebases (stray locked curl.exe held them).
Fixed: killed the procs, untracked, broadened .gitignore (.fleet*, .ap[0-9]*, .vq[0-9]*, .q[0-9]*). Real fix:
write API scratch OUTSIDE the repo (used mktemp -d for the DFS sweep).
- **Cloudflare __down / WAN2-bound speedtests returned 0.0** — only WAN1 upload (522 Mbps) measured cleanly;
WAN2 (coax) upload still unknown (needs a WAN2-routed host or Cox bill).
## Configuration Changes (all in clients/cascades-tucson/, committed up through 2b2d094 BEFORE the restructure)
- **Created** `docs/network/network-optimization-master-plan.md` — holistic all-device plan (sequencing,
interdependency map, data-driven decision framework, DFS non-DFS decision, SSID decision).
- **Created** `docs/network/phase1-voice-qos-design.md` — pfSense HFSC + UniFi WMM/switch QoS design.
- **Created** `reports/2026-06-18-voice-quality-diagnostic.md` — per-phone RF findings + fixes.
- **Updated** `reports/2026-06-16-voice-quality-diagnostic.md`? (no) — voice-quality report Lauren->103 note.
- **No live network changes applied** (Cascades rule: explicit per-change go). UniFi port bounces were
temporary (restored). DFS/WAN tests were read-only/bounded.
## Credentials & Secrets
- No new credentials. Used existing: `infrastructure/uos-server-network-api-rw` (controller),
`clients/cascades-tucson/unifi-ap-ssh` (AP SSH for DFS sweep), `clients/cascades-tucson/pfsense-firewall`,
`clients/cascades-tucson/wifi-voice-ppsk` (key `V0!c38863171`).
## Infrastructure & Servers
- VOICE VLAN 30 `10.0.30.0/24`: 8 AudioCodes `.224-.231`, 22 Poly `.202-.223`, desktop `.201`.
- WAN1 fiber igc0 (522 Mbps up measured; RRD peaks 680 down/98 up). WAN2 coax igc3 (72.211.21.217, upload
unmeasured). pfSense `192.168.0.1` Plus 25.07, no existing shaper.
- UniFi UOS `172.16.3.29:11443` site `va6iba3v`. USW-16-PoE mac `d8:b3:70:21:94:5f` dev_id `685f39078e65331c46ef7e90`.
- SSIDs: CSCNet (427 clients, PPSK, 2g+5g), CSC ENT (131 clients, legacy, 2g+5g), Guest (13, 2g+5g+6g).
- DFS: 53 APs on DFS, 0 genuine radar over ~21-23h.
## Pending / Incomplete Tasks
- **RE-CLONE claudetools** (coord message 2026-06-19): old clone incompatible after history rewrite. Steps below.
- **Verify** this session's Cascades docs (master plan, QoS design, voice-quality + diagnostic reports, voice
inventory, logging plan) survived the rewrite into the new repo; if missing, recover from this .old working tree.
- **Recover this session log** into the new clone (it's uncommitted here).
- WAN2 (coax) upload number — measure from a WAN2-routed host / Cox bill (sizes the failover shaper).
- 6 straggler Poly phones (10.0.20.64/65/66/67/195, 192.168.1.126) — re-key to voice PPSK.
- Floors 5/6 (MemCare) RF + phones — deferred.
- Execute the optimization plan (start Phase 2b 2.4 Low->Medium with baseline capture) — pending Howard's go.
- Hand Vertical the phone-side config list (band 5GHz lock, DSCP-on, k/v roaming, U-APSD, firmware).
## Re-clone steps (Windows / C:\claudetools)
```
# from C:\
mv claudetools claudetools.old # or rename in Explorer
git clone https://git.azcomputerguru.com/azcomputerguru/claudetools.git claudetools
cp claudetools.old/.claude/identity.json claudetools/.claude/
cd claudetools && git submodule update --init --recursive
# then: diff clients/cascades-tucson against claudetools.old; cp any missing files (esp. this session's docs)
# recover this session log from claudetools.old/clients/cascades-tucson/session-logs/2026-06/
# verify, then delete claudetools.old
```
See RECLONE.md in the new repo. Pre-split backup bundle: Jupiter share Backups/Gitea-Storage.
## Reference Information
- Last pushed commit (old history): `2b2d094` (2026-06-18 19:16). Restructure force-push: ~2026-06-19 02:41 UTC.
- Master plan: `clients/cascades-tucson/docs/network/network-optimization-master-plan.md`
- QoS design: `clients/cascades-tucson/docs/network/phase1-voice-qos-design.md`
- Voice-quality diagnostic: `clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md`
- Existing RF audit + 2.4 runbook: `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`