cascades: recover 4 docs dropped by the history-rewrite/repo-split
The 2026-06-18 repo restructure (history rewrite + project->submodule split) dropped these 4 Cascades files from the new clone. Copied byte-identical from the pre-cutover claudetools.old clone (md5-verified): - docs/network/network-optimization-master-plan.md - docs/network/phase1-voice-qos-design.md - reports/2026-06-18-voice-quality-diagnostic.md - session-logs/2026-06/2026-06-18-howard-cascades-rf-voice-optimization-plan.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,197 @@
|
||||
# Cascades — Network Optimization Master Plan (all devices, holistic)
|
||||
|
||||
- **Created:** 2026-06-18 (Howard-Home / claude-main)
|
||||
- **Status:** PLAN — for execution tonight (floors 1–4) per Howard. Floors 5 & 6 (MemCare) EXCLUDED this round.
|
||||
- **Goal:** Fix the *system*, not one device at a time. Improve quality for **every** client (~587), not just
|
||||
the 31 voice devices, by sequencing AP + WLAN + QoS + firewall changes so we don't trade one problem for another.
|
||||
- **Builds on:** `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`
|
||||
(RF mechanics + gated apply commands), `reports/2026-06-18-voice-quality-diagnostic.md`, and the live
|
||||
2026-06-18 fleet sample. All RF changes use the gated `unifi-wifi` scripts (per-zone, dry-run, rollback JSON).
|
||||
|
||||
---
|
||||
|
||||
## 1. Current state (what's actually true right now)
|
||||
|
||||
| Layer | State | Verdict |
|
||||
|---|---|---|
|
||||
| **2.4 GHz** | **OVER-THINNED.** Overnight 6/17: 24 radios disabled + 42 set Low (~6 dBm). Interference dropped (cu_interf 64→32–48%) BUT **retry rose 17→23.4%, satisfaction fell 39→30** (time-of-day-controlled). Edge clients now reach farther/weaker APs. Mesh + Floors 5/6 untouched (full 23 dBm). | **Regressed — must correct power floor** |
|
||||
| **5 GHz** | 80 MHz width on ~76/77 (too wide for the density). 55/77 on DFS (empirically clean — 0 radar). Channels biased to busy upper (149/157). **AP 103 saturated: ch149, 75% airtime, ~25,900 retries, 12 clients** (and Lauren's phone now locked there). Dining/Rec Room high retry (810/1083). | **Constrain width + spread channels + relieve hotspots** |
|
||||
| **6 GHz** | 75 radios live, **~1 client.** Root cause: **CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`). Cleanest untapped capacity. | **Open it — the relief valve** |
|
||||
| **QoS** | **NONE.** Voice now isolated on VLAN 30 but not prioritized — voice packets compete with data under load → jitter/breaks. | **Add — guaranteed win, now possible** |
|
||||
| **pfSense/WAN/DHCP/DNS** | Healthy; ruled out as a WiFi factor (2026-06-16). Dual-WAN stable, DHCP 53% pool, unbound up. | **Fine — add voice QoS shaping only** |
|
||||
| **Switching / physical** | ~25 ports linked 100 M but gig-capable (caps some AP uplinks); 3 offline switches; AP 108 cable pending; p38 4% tx-drop. | **Physical work — not tonight, but tracked** |
|
||||
|
||||
---
|
||||
|
||||
## 2. Root-cause model (why "some devices" are bad)
|
||||
|
||||
Three compounding RF causes, plus a missing QoS layer:
|
||||
1. **2.4 GHz contention** — extreme neighbor density (ch6 ~33k BSSIDs). Any client that lands/sticks on 2.4 GHz
|
||||
suffers. Made *worse* by the over-thinning (weaker signal → more retransmits).
|
||||
2. **5 GHz over-width + hotspots** — 80 MHz halves the usable channel count → co-channel overlap → retries;
|
||||
a few APs (103) are simply overloaded.
|
||||
3. **6 GHz unused** — the clean band that should absorb modern clients is dark, so everything piles onto 5 GHz.
|
||||
4. **No voice prioritization** — even with perfect RF, voice breaks under data bursts without QoS.
|
||||
|
||||
**The trap we must avoid (the "whack-a-mole"):** narrowing 5 GHz to 40 MHz *without* first opening 6 GHz pushes
|
||||
more clients onto fewer 5 GHz channels → congestion moves, not improves. And dropping 2.4 power further (it's
|
||||
already too low) starves edge clients. **Sequence matters.**
|
||||
|
||||
---
|
||||
|
||||
## 3. The holistic sequence (open relief valves BEFORE constraining)
|
||||
|
||||
> Principle: **(A) add capacity/priority that can't hurt → (B) fix the regression → (C) then constrain/optimize
|
||||
> → (D) fine-tune → validate at every gate.** Each step is reversible; gate on live metrics before the next.
|
||||
|
||||
### PHASE 0 — Pre-flight + baseline (always)
|
||||
- VPN up; `live-stats.sh cascades | head -3` (expect 77 APs).
|
||||
- Baseline (compare after, same time-of-day): `live-stats.sh cascades > .claude/tmp/opt-pre.txt`;
|
||||
`radio-usage.sh cascades ng 77 > .claude/tmp/usage-pre.txt`.
|
||||
- Pick a watch AP per floor (`watch-ap.sh <ip>`).
|
||||
|
||||
### PHASE 1 — QoS for voice (orthogonal, lowest risk — but INSURANCE, not the everyday fix)
|
||||
Voice VLAN 30 is isolated → mark + prioritize it end-to-end so calls beat data under load.
|
||||
> **Reframe (measured 2026-06-18):** WAN1 fiber upload is **~522 Mbps** vs ~98 Mbps peak usage — huge
|
||||
> headroom, so the WAN is **not** the day-to-day voice bottleneck (that's RF, Phases 2–4). QoS still earns its
|
||||
> place as insurance for **WAN2 (coax) failover** and **rare WAN1 saturation** (you hit 680 Mbps down). Build
|
||||
> it (cheap, correct), but don't expect it to fix the complaints — the RF work does. Full design:
|
||||
> `docs/network/phase1-voice-qos-design.md`. **Phones confirmed marking DSCP EF** → rely on DSCP; subnet match is the net.
|
||||
- **UniFi (WLAN/switch):** ensure WMM/QoS on; the AudioCodes/Poly tag voice DSCP — trust/honor it. On the
|
||||
USW, voice VLAN traffic should hit the high-priority queue.
|
||||
- **pfSense:** add a traffic-shaper/limiter or floating QoS rule that puts `VOICE net (10.0.30.0/24)` DSCP EF
|
||||
(46) / RTP UDP into a priority queue on the WAN(s). Low risk — additive, voice-only.
|
||||
- **Validate:** place test calls during a data-heavy moment; confirm no breakup. (No RF change here.)
|
||||
- *Skill gap:* the `unifi-wifi` skill has no QoS verb — this is a pfSense + UniFi config task; consider a small
|
||||
`voice-qos` helper later.
|
||||
|
||||
### PHASE 2 — Open the relief valves (capacity + correct the regression)
|
||||
**2a. Enable 6 GHz on CSCNet + steering** (creates the offload path BEFORE we narrow 5 GHz):
|
||||
```
|
||||
apply-wlan.sh cascades bands all --wlan CSCNet --apply # -> [2g,5g,6g]
|
||||
apply-wlan.sh cascades bsstm on --wlan CSCNet --apply # 802.11v BSS-transition (assists up-band + roam)
|
||||
```
|
||||
Band-steering (`no2ghz_oui`) already ON. 6E/7 clients gravitate to clean 6 GHz, offloading 5 GHz. Validate:
|
||||
client mix shifts toward 6g; no SSID-visibility loss for legacy (2.4/5 stay on).
|
||||
|
||||
**2b. Correct the 2.4 over-thinning — Low → MEDIUM on kept radios** (restores edge signal; keeps cells smaller
|
||||
than full power). Per floor, dry-run then apply; regenerate the kept-radio list live:
|
||||
```
|
||||
for z in "Floor 1" "Floor 2" "Floor 3" "Floor 4"; do \
|
||||
apply-radio.sh cascades ng power medium --zone "$z" --apply; done # ~12–15 dBm
|
||||
```
|
||||
Do NOT expand disables. If a specific area shows a dead zone/complaint, re-enable that one radio
|
||||
(`ng enable --ap "<name>"`). **Gate:** re-measure retry%/satisfaction same time-of-day vs `opt-pre.txt` —
|
||||
expect retry back down from ~23% and satisfaction recovering.
|
||||
|
||||
### PHASE 3 — Constrain + optimize 5 GHz (now that 6 GHz absorbs load)
|
||||
**3a. Width 80 → 40 MHz** (doubles non-overlapping channels → spatial reuse):
|
||||
```
|
||||
for z in "Floor 3" "Floor 1" "Floor 2" "Floor 4"; do \
|
||||
apply-radio.sh cascades na width 40 --zone "$z" --apply; done # rollback: na width 80
|
||||
```
|
||||
**3b. Channel plan — NON-DFS ONLY (decided 2026-06-18 after rigorous DFS verification).**
|
||||
Use **UNII-1 (36–48) + UNII-3 (149–165) only**; do NOT use DFS channels (52–144) on this voice-critical
|
||||
network. A precise radar-detection sweep (real `radar found`/`NOL` signatures, CAC/control housekeeping
|
||||
excluded) found **ZERO genuine hits across all 53 DFS APs** — BUT the window was only ~21–23h (APs rebooted
|
||||
~23h ago, the 6/17 outage). Near Davis-Monthan AFB + TUS (~10 mi), military radar is sporadic and a single
|
||||
hit forces a 30-min channel vacate = **dropped calls** — unacceptable for voice. **Resilience > diversity.**
|
||||
The lost 5 GHz channel count is covered by **6 GHz (Phase 2a) absorbing capacity** — this is WHY 6 GHz comes first.
|
||||
```
|
||||
SURVEY=.claude/tmp/cascades-survey.json; SURVEY_JSON=$SURVEY survey-collect.sh cascades
|
||||
SURVEY_JSON=$SURVEY channel-plan.sh cascades na # dry-run; CONSTRAIN to non-DFS (36-48,149-165); review; apply per zone
|
||||
```
|
||||
**Periodic DFS monitoring:** the ~1-day window isn't conclusive, so add a recurring precise `dfs-check.sh`
|
||||
(fold into the network-logging plan). Staying on non-DFS means a future hit can't affect us; the monitor just
|
||||
confirms the choice stays right.
|
||||
**3c. Relieve AP 103 specifically** (it now carries Lauren + 11 others on a 75%-busy ch149): move it off 149 to
|
||||
a clean channel from the plan, 40 MHz. Verify Lauren `.202` retry drops after.
|
||||
**Gate:** 5 GHz retry down on the busy APs; AP 103 cu_total well under 50%; no client stranded.
|
||||
|
||||
### PHASE 4 — Fine-tune (after 1–3 settle)
|
||||
- **2.4 channel plan 1/6/11** (graph-color; co-channel pairs 92→35) + **pin the 4 off-plan APs** (128/108/108U7/salon)
|
||||
to 1/6/11.
|
||||
- **2.4 min-RSSI ON** for the 6 APs where it's OFF (615/608/505/517/622/salon) — *note 505/517/615/608/622 are
|
||||
Floors 5/6 → DEFER with the rest of 5/6*; do `salon` only this round.
|
||||
- **Roaming for voice continuity:** confirm 802.11k/v on CSCNet (r optional — test; some phones dislike 802.11r).
|
||||
Keeps calls alive when staff walk between APs.
|
||||
- **min-RSSI tuning:** only tighten where sticky-client far-AP behavior is proven; too aggressive blocks association.
|
||||
|
||||
### PHASE 5 — Physical (separate visit, not tonight — but it caps results)
|
||||
- Re-terminate/replace the ~25 cables on ports stuck at 100 M (limits those APs' uplink throughput).
|
||||
- Chase the 3 offline switches (2nd Floor #2, 4th Floor #2, USW Pro Max 16); finish AP 108 cable run.
|
||||
- p38 (1st Floor USW) 4% tx-drop after the above.
|
||||
|
||||
---
|
||||
|
||||
## 4. Interdependency map (read before changing anything)
|
||||
- **6 GHz BEFORE 5 GHz 40 MHz** — else 5 GHz congestion just relocates. (Phase 2a before 3a.)
|
||||
- **2.4 power MEDIUM not LOW** — Low already over-thinned; going lower starves edge clients. (Phase 2b.)
|
||||
- **AP-lock needs AP capacity** — Lauren locked to 103 ⇒ 103 must be relieved (Phase 3c) or she trades mesh for congestion.
|
||||
- **QoS is independent** — do it first; it can't hurt RF and guarantees a voice win even before RF settles. (Phase 1.)
|
||||
- **Disables + power-down compound** — never do both aggressively in the same area; we already saw the satisfaction hit.
|
||||
- **min-RSSI + power interact** — raising min-RSSI while lowering power can orphan clients; tune one lever at a time.
|
||||
- **Mesh-protected APs** (`2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108`) — never disable; power changes only with watch.
|
||||
|
||||
## 5. Data-driven decision framework — improve quality for ALL devices (measure → decide → adjust)
|
||||
|
||||
**Principle (Howard 2026-06-18): every choice is made FROM measured network data, not assumptions.** Each
|
||||
change is a hypothesis; we gate it on fleet-wide metrics before keeping it or moving on. The goal is *all*
|
||||
devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones.
|
||||
|
||||
### 5.1 How each change affects the OTHER (non-voice) devices
|
||||
Almost every change targets the **shared RF environment**, so it helps everyone — voice is just the most
|
||||
sensitive canary:
|
||||
| Change | Effect on non-voice devices |
|
||||
|---|---|
|
||||
| QoS (voice VLAN priority) | **Negligible** — voice is ~3 Mbps of 522; normally zero effect; ACK queue can make others snappier under load |
|
||||
| Enable 6 GHz on CSCNet | **Positive** — 6E devices move to clean 6 GHz → faster for them + clears 5 GHz for everyone left |
|
||||
| 2.4 Low→Medium power | **Positive for ALL 2.4 devices** — undoes the over-thinning regression (IoT/printers/2.4 DirecTV get signal back) |
|
||||
| 5 GHz 80→40 MHz | **Net positive (reliability), small peak-speed cost** — density win; lone heavy transfer sees lower peak |
|
||||
| AP 103 relief | **Positive for all 16 clients on 103**, not just Lauren |
|
||||
| 2.4 1/6/11 channel plan | **Positive for all 2.4 devices** (less co-channel) |
|
||||
| Phone-side (Vertical) | Phones only — **no effect on others** |
|
||||
|
||||
### 5.2 Trade-offs to WATCH in the data (don't help voice, hurt others)
|
||||
1. **Non-DFS 5 GHz + 5 GHz-only devices** — the DirecTV fleet + older laptops **can't use 6 GHz**, so they
|
||||
stay on the fewer non-DFS channels. 6 GHz offloading the newer devices is what keeps this OK; **watch
|
||||
non-DFS 5 GHz cu_total** — if it climbs, that's the signal to rebalance.
|
||||
2. **min-RSSI** affects every client on the AP — too aggressive orphans weak IoT/resident devices. Tune gently.
|
||||
3. **40 MHz** trades single-user peak for fleet reliability — right in density, but it is a trade.
|
||||
|
||||
### 5.3 Fleet-wide metrics we pull at every gate (the data we decide on)
|
||||
Same time-of-day comparison (load varies hourly). Capture before each change and ~15 min after:
|
||||
- `live-stats.sh cascades` → per-band **avg retry%, cu_total, cu_interf, satisfaction (min/median), client counts**
|
||||
- `radio-usage.sh cascades <band>` → per-AP outliers (saturated/high-retry APs)
|
||||
- `/stat/sta` band split (2.4 / 5 / 6 distribution) + count of clients retry>15% (by band) + satisfaction<70 count
|
||||
- Per-AP: any AP whose client count drops toward ~0 in a covered area (= coverage hole)
|
||||
|
||||
### 5.4 The GATE decision rule (per change, per zone)
|
||||
- **KEEP + proceed** only if: the target metric improved **AND** fleet-wide **satisfaction did not fall**,
|
||||
**retry% did not rise**, band split moved the intended way, and **no AP lost its clients** (no hole).
|
||||
- **HOLD** (stop, don't expand) if: target improved but a secondary metric regressed → investigate before more.
|
||||
- **ROLLBACK that step** if: fleet-wide retry up / satisfaction down / a coverage hole / user complaint.
|
||||
- Do **one lever per zone at a time** so cause/effect is attributable (the over-thinning happened because power-down
|
||||
+ disables were stacked).
|
||||
|
||||
### 5.5 Rollback
|
||||
Every `apply-radio`/`apply-wlan` writes a rollback JSON to `.claude/tmp/`; `device-control poe-cycle` for a hung
|
||||
AP (NOT force-provision). Power-up / width-80 / re-enable / channel-revert are all safe reversals.
|
||||
|
||||
## 6. Out of scope tonight (explicit)
|
||||
- **Floors 5 & 6 (MemCare)** — all RF + the MemCare voice phones (`.217/.218/.219/.220`) DEFERRED per Howard.
|
||||
- Physical cabling / offline switches (Phase 5 — separate visit).
|
||||
- The 6 straggler phones — Howard re-keying separately; they'll benefit from the RF work regardless.
|
||||
|
||||
## 7. Open decisions for Howard
|
||||
1. ~~**5 GHz channel plan:** clean-DFS vs non-DFS-only~~ — **RESOLVED 2026-06-18: NON-DFS ONLY** (UNII-1 36–48 + UNII-3 149–165). DFS sweep was clean but only a ~1-day window near Davis-Monthan/TUS; a radar vacate = dropped calls, so resilience wins. 6 GHz covers the capacity gap. (See Phase 3b.)
|
||||
2. **QoS depth:** UniFi WMM + DSCP-honor only, or also a pfSense WAN priority queue/limiter for RTP? Recommendation: both (additive).
|
||||
3. **802.11r** on CSCNet: enable for seamless voice roaming, or k/v only (safer for mixed phones)? Recommendation: k/v now, test r on one phone first.
|
||||
4. Tonight's stopping point: Phases 1–2 alone are a legitimate, lower-risk night; 3–4 can be a second night.
|
||||
5. ~~**Dedicated voice SSID?**~~ — **RESOLVED 2026-06-18: NO — voice stays on the shared CSCNet PPSK.** UniFi
|
||||
3-SSID cap (sound RF hygiene — each SSID = beacon airtime overhead at 77 APs). The only retirement candidate,
|
||||
CSC ENT, still has **131 active clients** (staff PCs, printers, DirecTV fleet) → not retireable soon. And it's
|
||||
not needed: **QoS is VLAN/DSCP-based (SSID-independent)**, band preference is best done **phone-side** (Vertical),
|
||||
and roaming/power-save are phone+AP settings — all work on the shared SSID. A dedicated voice SSID would only
|
||||
add voice-specific WiFi *policy* (per-SSID DTIM/min-RSSI/airtime), a marginal gain not worth a slot. Revisit only
|
||||
if/when CSC ENT's 131 clients migrate off it.
|
||||
111
clients/cascades-tucson/docs/network/phase1-voice-qos-design.md
Normal file
111
clients/cascades-tucson/docs/network/phase1-voice-qos-design.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# Cascades — Phase 1: Voice QoS Design (VLAN 30)
|
||||
|
||||
- **Created:** 2026-06-18 (Howard-Home / claude-main). Part of `network-optimization-master-plan.md` Phase 1.
|
||||
- **Status:** DESIGN — for review, then build (Howard drives pfSense GUI). Nothing applied.
|
||||
- **Risk:** LOW — additive, voice-only prioritization; rollback = disable the shaper. Main caution: size the
|
||||
shaper bandwidth correctly (a wrong value can throttle throughput) → test before/after.
|
||||
|
||||
## Objective
|
||||
Guarantee voice quality under load by prioritizing VLAN 30 traffic end-to-end. **The phones register to a
|
||||
CLOUD PBX (Vertical) over the internet**, so the bottleneck that breaks calls is **WAN upload saturation**
|
||||
(someone uploading / cloud backup / OneDrive sync fills the uplink → voice RTP queues → jitter, dropped
|
||||
audio). QoS keeps voice ahead of bulk data on the WAN.
|
||||
|
||||
## The big advantage of the VLAN move
|
||||
**All voice is now one subnet: `10.0.30.0/24`.** So QoS can match *all* voice by **source subnet** — no
|
||||
need to guess SIP/RTP port ranges per PBX. This is the cleanest, most robust match criterion and it only
|
||||
became possible because we isolated voice onto VLAN 30.
|
||||
|
||||
## Current state (verified 2026-06-18)
|
||||
- **No traffic shaper / limiter configured** on pfSense (clean build).
|
||||
- **Dual-WAN:** WAN1 `igc0` (Cox Fiber, primary, 1G link), WAN2 `igc3` (Cox Coax, 2.5G link); `WAN_Group`
|
||||
failover (`downlosslatency`). Shaping must be applied on **both** WAN interfaces.
|
||||
- pfSense Plus 25.07 (ALTQ shaper + dummynet limiters available).
|
||||
- **Phones mark DSCP EF — CONFIRMED (Howard 2026-06-18).** So we can rely on DSCP for WMM (Layer 2) + switch
|
||||
QoS (Layer 3); the `10.0.30.0/24` subnet match (Layer 1) is the safety net. **No pfSense set-DSCP rule needed.**
|
||||
|
||||
## Measured WAN bandwidth (2026-06-18) — REFRAMES QoS priority
|
||||
- **WAN1 (fiber, primary): upload ~522 Mbps** (Cloudflare single-stream from pfSense). RRD 3-day peaks:
|
||||
**680 Mbps down / 98 Mbps up** (actual usage).
|
||||
- **WAN2 (coax): not measurable remotely** (source-route bind to `72.211.21.217` failed; needs a WAN2-routed
|
||||
host or the Cox bill). Coax is typically asymmetric ~20–50 Mbps up — **size its shaper conservatively**.
|
||||
- **Implication:** 30 calls ≈ ~3 Mbps. WAN1 upload (~522 Mbps) vs peak usage (98 Mbps) = **huge headroom →
|
||||
the WAN is NOT the everyday voice bottleneck.** Everyday dropped-calls = **RF** (Phases 2–4 of the master
|
||||
plan). **QoS here is INSURANCE, not the day-to-day fix** — it earns its keep in two cases: (1) **WAN2
|
||||
failover** (small coax upload + a big upload → real congestion), (2) **rare WAN1 saturation** (backup /
|
||||
large upload; you do hit 680 Mbps down). Build it (cheap, correct), but set expectations: RF is the substance.
|
||||
|
||||
## Three layers (priority order; Layer 1 = insurance, see reframe above)
|
||||
|
||||
### Layer 1 — pfSense WAN shaper (PRIMARY — this is where calls break)
|
||||
**Type: HFSC** (hierarchical, lets us guarantee voice a floor while letting it borrow idle bandwidth).
|
||||
Per WAN interface, three queues:
|
||||
| Queue | Role | HFSC settings (starting point) |
|
||||
|---|---|---|
|
||||
| `qVoice` | voice (VLAN 30 / DSCP EF) | **priority 7**, realtime ~30% of WAN-up, link-share 30%, NOT default |
|
||||
| `qACK` | TCP ACKs (keeps downloads snappy) | priority 6, ~10% |
|
||||
| `qDefault` | everything else | **default**, link-share ~60% |
|
||||
|
||||
**Match rule (floating, WAN, direction out):** source `10.0.30.0/24` → `qVoice`. (Optionally also match
|
||||
DSCP EF if phones mark it — see Layer 4.) One floating rule per WAN, or interface = WAN_Group.
|
||||
|
||||
**Download side:** RTP from the PBX *to* the phones is shaped on the **LAN-side** queues. The wizard builds
|
||||
both directions; if hand-building, mirror a `qVoice` on the internal interfaces too. Upload is the more
|
||||
critical direction for cloud-PBX voice, but do both.
|
||||
|
||||
**Build path (GUI — Howard drives):**
|
||||
- Easiest: **Firewall → Traffic Shaper → Wizard → "Multiple Lan/Wan"** — set #WAN=2, #LAN as needed,
|
||||
enter each WAN's bandwidth (below), on the VoIP page choose **"prioritize by address" = `10.0.30.0/24`**
|
||||
with a guaranteed %; the wizard generates HFSC queues + the float rules. Then tune.
|
||||
- Or manual: Firewall → Traffic Shaper → By Interface → add HFSC on WAN1 + WAN2, create the 3 queues,
|
||||
then Firewall → Rules → Floating → match `10.0.30.0/24` out → Ackqueue/Queue = qACK/qVoice.
|
||||
|
||||
> **Sizing inputs:** WAN1 upload **~522 Mbps (measured 2026-06-18)** → shape `qVoice`'s parent to ~480–500
|
||||
> Mbps. **WAN2 (coax) upload still UNKNOWN** (remote source-route test failed) — get from the Cox bill or a
|
||||
> speedtest from a host routed via WAN2; size conservatively (assume ~35 Mbps up until measured). Shaping to
|
||||
> ~90–95% of actual upload keeps the queue in pfSense (where we control priority), not at the ISP. WAN2 is the
|
||||
> one that actually constrains voice (on failover), so its number matters most.
|
||||
|
||||
### Layer 2 — UniFi WMM (the WiFi phones — Poly)
|
||||
Over the air, **WMM** maps DSCP → WiFi access categories; voice (DSCP EF/46) → **WMM Voice AC** (gets TXOP
|
||||
priority over data). WMM is ON by default on UniFi — **verify it's enabled on CSCNet** and that the U7 APs
|
||||
honor DSCP→WMM. This is what protects the 22 Poly phones over the air during WiFi congestion. (Ties into the
|
||||
RF work — a clean 5/6 GHz + WMM = good wireless voice.)
|
||||
|
||||
### Layer 3 — UniFi switch QoS (the wired AudioCodes)
|
||||
UniFi switches honor 802.1p/DSCP and queue tagged voice to a high-priority egress queue — mostly automatic
|
||||
once the phones mark DSCP. LAN links are gig and rarely congested, so this is the least critical layer, but
|
||||
confirm the USW isn't stripping DSCP and that voice VLAN 30 frames get the priority queue.
|
||||
|
||||
### Layer 4 — DSCP marking (make the above reliable)
|
||||
- **Verify the phones mark voice:** AudioCodes + Poly typically tag RTP **EF (46)** and signaling **CS3 (24)**
|
||||
by default, often set via the PBX/provisioning. Confirm with Vertical (Richard) or capture a packet.
|
||||
- **If they DON'T mark (or inconsistently):** add a pfSense floating rule that **SETS DSCP EF** on
|
||||
`10.0.30.0/24` traffic (Advanced → "Match/Set DSCP"). Then Layer 1/2/3 can all match on EF too.
|
||||
- **Match-by-subnet (Layer 1) works regardless of DSCP** — it's the safety net. DSCP makes WMM (Layer 2)
|
||||
and switch QoS (Layer 3) automatic.
|
||||
|
||||
## Implementation order
|
||||
1. Get the Cox WAN upload numbers (blocker for Layer 1 sizing).
|
||||
2. Confirm phones mark DSCP EF (Vertical) — decides whether we add the pfSense set-DSCP rule.
|
||||
3. Build Layer 1 (pfSense HFSC + float rule) — dry-run mindset: set it, then validate.
|
||||
4. Verify Layer 2 (WMM on CSCNet) + Layer 3 (switch honoring DSCP).
|
||||
5. Validate (below). Tune `qVoice` % if needed.
|
||||
|
||||
## Validation (prove it works)
|
||||
- **Baseline:** from a LAN host, saturate the WAN upload (big upload / `iperf3 -u` / speedtest) WHILE on a
|
||||
call from a voice phone — note the breakup *without* QoS.
|
||||
- **After:** repeat the same saturation; call stays clean. Check Firewall → Traffic Shaper → Queues: `qVoice`
|
||||
carrying voice with ~0 drops while `qDefault` absorbs the saturation + drops.
|
||||
- Confirm both WANs (test on primary; fail to WAN2 and re-test).
|
||||
|
||||
## Rollback
|
||||
Firewall → Traffic Shaper → disable/remove the shaper; delete the floating rule. Zero residual effect
|
||||
(QoS only orders packets under congestion; removing it reverts to FIFO). The set-DSCP rule (if added) can stay
|
||||
or go independently.
|
||||
|
||||
## Notes / interplay with the rest of the plan
|
||||
- QoS is **independent of the RF work** — it helps wired + WiFi voice immediately and can be built tonight
|
||||
regardless of the 2.4/5/6 GHz changes.
|
||||
- It does NOT fix RF problems (a phone on a 50%-retry 2.4 GHz radio still suffers) — QoS handles *congestion/
|
||||
contention for bandwidth*, RF tuning handles *the air*. Both are needed; they're complementary.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Cascades — Voice Quality Diagnostic (post VLAN 30 cutover)
|
||||
|
||||
- **Date:** 2026-06-18 (Howard-Home / claude-main)
|
||||
- **Trigger:** All phones migrated to isolated VOICE VLAN 30 to improve call quality; users report
|
||||
**dropped calls, breaks in voice, reception issues.** This is the RF/quality assessment.
|
||||
- **Data source:** live UniFi controller `/stat/sta` + USW-16-PoE `port_table`, 2026-06-18.
|
||||
|
||||
## Cutover status — COMPLETE
|
||||
31 devices on VOICE (`10.0.30.0/24`): 8 AudioCodes (`.224-.231`), 22 Poly (`.202-.223`), Vertical
|
||||
desktop (`.201`). AudioCodes required a full power-off/on (external-powered; not PoE -> UniFi
|
||||
power-cycle is a no-op) before they re-DHCP'd.
|
||||
|
||||
## Headline finding
|
||||
**The VLAN move gives separation + sets up QoS, but it does NOT by itself fix call quality.** The
|
||||
dropped calls / voice breaks are an **RF problem on the WiFi (Poly) phones.** The wired AudioCodes
|
||||
are clean. Quality fixes are RF + QoS, below.
|
||||
|
||||
## Wired AudioCodes (8) — HEALTHY
|
||||
All USW-16-PoE ports 1-8: up, 100M full-duplex, **rx_err=0 tx_err=0 rx_drop=0.** No network-layer
|
||||
problem. 100M is fine for voice. With VLAN isolation + QoS these desk phones should be solid.
|
||||
|
||||
## WiFi Poly phones — RF problems (retry% = the call-quality killer)
|
||||
Thresholds: retry >10% = audible breaks; RSSI <-67 marginal, <-75 bad; voice wants 5 GHz.
|
||||
|
||||
**SEVERE — fix first:**
|
||||
| Phone (IP) | User/Loc | AP | Band | RSSI | Retry | Issue |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 10.0.30.202 | Lauren / Accounting | CC Bridge | **2.4** | -56 | **50%** | stuck on 2.4 GHz, half packets retransmit |
|
||||
| 10.0.30.218 | Shelby / MemCare Dir | MemCare Nurse Stn | **2.4** | -56 | **53%** | stuck on 2.4 GHz |
|
||||
| 10.0.30.220 | Christine / rm 515 | 517 | 5 | **-82** | 7% | coverage gap (signal near-unusable) |
|
||||
| 10.0.30.219 | Karen Rossini / rm 515 | 517 | 5 | -75 | 16% | weak + high retry |
|
||||
|
||||
**MODERATE:**
|
||||
| 10.0.30.212 | rm 204 | 204 | 5 | -74 | 13% | weak + retry |
|
||||
| 10.0.30.213 | Medtech rm 206 | 206 U7 Pro | 5 | -66 | 13% | 5 GHz congestion |
|
||||
| 10.0.30.214 / .215 | rm 210 | 210 | 5 | -72 | 7-9% | weak |
|
||||
| 10.0.30.206 | Dining Room | Dining Room | 5 | -70 | 9% | borderline |
|
||||
|
||||
**Healthy (reference):** .207/.209/.221/.222/.223 etc. — 5 GHz, RSSI -41 to -60, retry <3%.
|
||||
|
||||
### Three root causes
|
||||
1. **2.4 GHz with ~50% retry** (Lauren .202, Shelby .218) — the single worst issue; matches the
|
||||
documented Cascades 2.4 GHz saturation. **Must force these to 5 GHz.**
|
||||
2. **Coverage gaps** — rooms 515 (-82/-75), 210/204 (-72/-74): too far from the serving AP; weak
|
||||
signal drops calls when RF varies or people move.
|
||||
3. **5 GHz congestion** — several at 13-16% retry on 5 GHz (80 MHz width + channel overlap, per the
|
||||
2026-06-16 audit).
|
||||
|
||||
## Stragglers — 6 Poly phones NOT on VOICE
|
||||
Five on VLAN 20 (`10.0.20.64/.65/.66/.67/.195`) + one on `192.168.1.126`. `10.0.20.66` (Dining
|
||||
Room) is at **35% retry.** Missed in cutover or still on the old PPSK key -> migrate to the voice
|
||||
PPSK so all phones are isolated + benefit from voice QoS.
|
||||
|
||||
## Recommended fixes (prioritized; NONE applied — Cascades requires explicit per-change go)
|
||||
1. **QoS for the voice VLAN (NEW capability the move enables) — highest ROI, lowest risk.** Mark
|
||||
VLAN 30 voice traffic DSCP EF / priority on pfSense + UniFi so voice gets priority under load ->
|
||||
reduces jitter/breaks network-wide.
|
||||
2. **Force voice phones off 2.4 GHz** — on the CSCNet voice PPSK / the APs serving .202 & .218,
|
||||
disable 2.4 GHz association for voice (or band-steer to 5/6 GHz). Fixes Lauren + Shelby (the two
|
||||
worst) immediately.
|
||||
- **DONE 2026-06-18: Lauren `.202` locked to AP 103** (off the CC Bridge wireless-mesh AP -> wired AP). **INTERDEPENDENCY:** AP 103's 5 GHz is saturated (ch149, 75% airtime, ~25,900 retries, 12 clients) -> tonight's 5 GHz plan MUST relieve AP 103 (channel off 149 / 80->40 MHz / load-balance) or she trades a mesh problem for a congestion problem.
|
||||
- Shelby `.218` is floor 5/6 (MemCare) -> **out of scope tonight** per Howard.
|
||||
3. **Coverage** — rooms 515, 210, 204: check AP placement/power; consider a closer AP or raising the
|
||||
nearest AP's power; min-RSSI to push phones off far APs. (Ties into the staged coverage-thin /
|
||||
2.4 remediation runbooks.)
|
||||
4. **Migrate the 6 straggler phones** to the voice PPSK (VLAN 30).
|
||||
5. **5 GHz width/channel** — apply the staged audit recommendation (40 MHz width, non-DFS plan) to
|
||||
cut co-channel retry.
|
||||
|
||||
## Next
|
||||
Discuss + pick changes. QoS (#1) + 2.4 GHz force-off (#2) are the fastest wins for the complaints.
|
||||
@@ -0,0 +1,135 @@
|
||||
# Cascades — voice-quality diagnostic + holistic RF/QoS optimization master plan
|
||||
|
||||
## User
|
||||
- **User:** Howard Enos (howard)
|
||||
- **Machine:** Howard-Home
|
||||
- **Role:** tech
|
||||
|
||||
> NOTE: written in the OLD clone after the 2026-06-19 claudetools restructure coord message arrived.
|
||||
> NOT synced from here. Recover into the fresh re-clone (see Pending tasks).
|
||||
|
||||
## Session Summary
|
||||
|
||||
Continuation of the VLAN 30 voice cutover. First, completed the AudioCodes migration: the 8 wired
|
||||
AudioCodes would not pick up VLAN 30 addresses via port re-VLAN + UniFi PoE power-cycle (PoE is OFF on
|
||||
those ports — they run on external power bricks, so a UniFi power-cycle is a no-op; a UI port disable/enable
|
||||
didn't reset their uptime either). Root cause confirmed: they held their old main-LAN DHCP leases and never
|
||||
re-DHCP'd. Howard fully powered them off/on, after which all 8 pulled VOICE leases (10.0.30.224-231). Final
|
||||
state: 31 devices on VOICE (8 AudioCodes + 22 Poly + Vertical desktop).
|
||||
|
||||
Second, diagnosed voice quality (dropped calls / voice breaks). Wired AudioCodes: all 8 ports clean (100M
|
||||
full-duplex, zero errors). The problem is RF on the WiFi Poly phones: 14 flagged, worst = Lauren/.202 (2.4
|
||||
GHz, 50% retry, on the CC Bridge wireless-MESH AP) and Shelby/.218 (2.4 GHz, 53% retry, MemCare). Coverage
|
||||
gaps in rooms 515/210/204 (RSSI -72 to -82). AP 103 5 GHz saturated (75% airtime, ~25,900 retries). Also
|
||||
found 6 Poly phones NOT migrated (still on VLAN 20/Default) — fleet is 28 Poly, not 22; verified they are
|
||||
distinct active MACs, not ghosts. Howard locked Lauren's phone to AP 103 (off the mesh AP).
|
||||
|
||||
Third, built a holistic, all-device network optimization master plan grounded in the existing 2026-06-16
|
||||
audit + 2.4 GHz runbook + the over-thinning re-check. Key current-state fact: the network is OVER-THINNED on
|
||||
2.4 GHz (overnight 6/17: 24 radios disabled + 42 at Low/6 dBm -> interference down but retry 17->23%,
|
||||
satisfaction 39->30). The plan's central principle: open relief valves (6 GHz + correct 2.4 power) BEFORE
|
||||
constraining (5 GHz 40 MHz), to avoid relocating congestion. Sequenced phases: (1) QoS, (2a) enable 6 GHz on
|
||||
CSCNet + 2b correct 2.4 Low->Medium, (3) 5 GHz 80->40 MHz + non-DFS channel plan + relieve AP 103, (4)
|
||||
fine-tune, (5) physical. With an interdependency map and per-phase gates.
|
||||
|
||||
Fourth, verified DFS rigorously (Howard's concern re: Davis-Monthan AFB + TUS ~10 mi). The skill's dfs-check
|
||||
flagged 3 APs, but on inspection all were benign (CAC timers + DFS-control toggles on a non-DFS channel, not
|
||||
radar). A precise radar-detection-only sweep found ZERO genuine hits across all 53 DFS APs — but only over a
|
||||
~21-23h window (APs rebooted in the 6/17 outage). DECISION: go NON-DFS only (UNII-1 36-48 + UNII-3 149-165) —
|
||||
a radar vacate = dropped calls; resilience > diversity; 6 GHz covers the capacity gap.
|
||||
|
||||
Fifth, designed Phase 1 QoS (pfSense + UniFi). Measured WAN: WAN1 fiber upload ~522 Mbps (vs ~98 Mbps peak
|
||||
usage) -> the WAN is NOT the everyday voice bottleneck, so QoS is INSURANCE (WAN2 coax failover + rare
|
||||
saturation), not the everyday fix — RF is the substance. Match voice by source subnet 10.0.30.0/24 (the VLAN
|
||||
move's payoff). Phones confirmed to support DSCP EF. Determined a dedicated voice SSID is NOT viable (UniFi
|
||||
3-SSID cap; CSC ENT still has 131 clients, not retireable) and NOT needed (QoS is VLAN/DSCP-based,
|
||||
SSID-independent; band preference is phone-side). Added an all-devices impact + data-driven decision
|
||||
framework: every change gated on fleet-wide metrics (measure -> decide -> adjust), with the trade-offs to
|
||||
watch (non-DFS + 5GHz-only DirecTV fleet; min-RSSI orphaning; 40 MHz peak).
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **AudioCodes need a full power-cycle (off/on), not a UniFi PoE cycle** — they're externally powered (PoE off
|
||||
on the ports); a UI port bounce doesn't reset them.
|
||||
- **5 GHz: NON-DFS ONLY.** DFS sweep clean but only ~1-day window near a military base/airport; a radar vacate
|
||||
drops calls. Resilience over channel diversity; lean on 6 GHz for capacity.
|
||||
- **QoS reframed to INSURANCE, not the everyday fix** — WAN1 fiber has ~522 Mbps up vs 98 Mbps peak use; the
|
||||
everyday dropped-calls cause is RF. QoS matters on WAN2 (coax) failover + rare WAN1 saturation.
|
||||
- **No dedicated voice SSID** — 3-SSID cap is sound RF hygiene; CSC ENT (the only retirement candidate) still
|
||||
has 131 clients; and voice doesn't need a dedicated SSID (QoS is SSID-independent, band-pref is phone-side).
|
||||
- **Open relief valves before constraining** — 6 GHz + 2.4 Low->Medium BEFORE 5 GHz 40 MHz, or congestion just
|
||||
relocates.
|
||||
- **2.4 power Low->Medium, not lower** — Low already over-thinned (retry up, satisfaction down).
|
||||
- **Data-driven gates** (Howard) — base every choice on measured fleet-wide metrics; one lever per zone;
|
||||
keep/hold/rollback per the gate rule; validation measures ALL devices, not just voice.
|
||||
- **Phones are 5 GHz (not 6E)** — 6 GHz helps voice indirectly by clearing 5 GHz of resident devices.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **AudioCodes wouldn't move to VLAN 30** — root-caused to held DHCP leases + PoE-off ports (power-cycle no-op);
|
||||
resolved by full power-off/on.
|
||||
- **UniFi controller PUT 403s** — CSRF token extraction flaky; fixed by reading `x-updated-csrf-token` (with a
|
||||
TOKEN-cookie JWT fallback).
|
||||
- **pfSense SSH rate-limiting + controller throttling** after many rapid queries — switched between controller
|
||||
API and pfSense SSH as needed; one fleet pull hung in the background.
|
||||
- **Temp-file/sync friction (RECURRED 3x)** — controller-scratch files (.sta.json, .fleet325.dev) written
|
||||
CWD-relative got swept into commits by `git add -A` and blocked rebases (stray locked curl.exe held them).
|
||||
Fixed: killed the procs, untracked, broadened .gitignore (.fleet*, .ap[0-9]*, .vq[0-9]*, .q[0-9]*). Real fix:
|
||||
write API scratch OUTSIDE the repo (used mktemp -d for the DFS sweep).
|
||||
- **Cloudflare __down / WAN2-bound speedtests returned 0.0** — only WAN1 upload (522 Mbps) measured cleanly;
|
||||
WAN2 (coax) upload still unknown (needs a WAN2-routed host or Cox bill).
|
||||
|
||||
## Configuration Changes (all in clients/cascades-tucson/, committed up through 2b2d094 BEFORE the restructure)
|
||||
|
||||
- **Created** `docs/network/network-optimization-master-plan.md` — holistic all-device plan (sequencing,
|
||||
interdependency map, data-driven decision framework, DFS non-DFS decision, SSID decision).
|
||||
- **Created** `docs/network/phase1-voice-qos-design.md` — pfSense HFSC + UniFi WMM/switch QoS design.
|
||||
- **Created** `reports/2026-06-18-voice-quality-diagnostic.md` — per-phone RF findings + fixes.
|
||||
- **Updated** `reports/2026-06-16-voice-quality-diagnostic.md`? (no) — voice-quality report Lauren->103 note.
|
||||
- **No live network changes applied** (Cascades rule: explicit per-change go). UniFi port bounces were
|
||||
temporary (restored). DFS/WAN tests were read-only/bounded.
|
||||
|
||||
## Credentials & Secrets
|
||||
- No new credentials. Used existing: `infrastructure/uos-server-network-api-rw` (controller),
|
||||
`clients/cascades-tucson/unifi-ap-ssh` (AP SSH for DFS sweep), `clients/cascades-tucson/pfsense-firewall`,
|
||||
`clients/cascades-tucson/wifi-voice-ppsk` (key `V0!c38863171`).
|
||||
|
||||
## Infrastructure & Servers
|
||||
- VOICE VLAN 30 `10.0.30.0/24`: 8 AudioCodes `.224-.231`, 22 Poly `.202-.223`, desktop `.201`.
|
||||
- WAN1 fiber igc0 (522 Mbps up measured; RRD peaks 680 down/98 up). WAN2 coax igc3 (72.211.21.217, upload
|
||||
unmeasured). pfSense `192.168.0.1` Plus 25.07, no existing shaper.
|
||||
- UniFi UOS `172.16.3.29:11443` site `va6iba3v`. USW-16-PoE mac `d8:b3:70:21:94:5f` dev_id `685f39078e65331c46ef7e90`.
|
||||
- SSIDs: CSCNet (427 clients, PPSK, 2g+5g), CSC ENT (131 clients, legacy, 2g+5g), Guest (13, 2g+5g+6g).
|
||||
- DFS: 53 APs on DFS, 0 genuine radar over ~21-23h.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **RE-CLONE claudetools** (coord message 2026-06-19): old clone incompatible after history rewrite. Steps below.
|
||||
- **Verify** this session's Cascades docs (master plan, QoS design, voice-quality + diagnostic reports, voice
|
||||
inventory, logging plan) survived the rewrite into the new repo; if missing, recover from this .old working tree.
|
||||
- **Recover this session log** into the new clone (it's uncommitted here).
|
||||
- WAN2 (coax) upload number — measure from a WAN2-routed host / Cox bill (sizes the failover shaper).
|
||||
- 6 straggler Poly phones (10.0.20.64/65/66/67/195, 192.168.1.126) — re-key to voice PPSK.
|
||||
- Floors 5/6 (MemCare) RF + phones — deferred.
|
||||
- Execute the optimization plan (start Phase 2b 2.4 Low->Medium with baseline capture) — pending Howard's go.
|
||||
- Hand Vertical the phone-side config list (band 5GHz lock, DSCP-on, k/v roaming, U-APSD, firmware).
|
||||
|
||||
## Re-clone steps (Windows / C:\claudetools)
|
||||
```
|
||||
# from C:\
|
||||
mv claudetools claudetools.old # or rename in Explorer
|
||||
git clone https://git.azcomputerguru.com/azcomputerguru/claudetools.git claudetools
|
||||
cp claudetools.old/.claude/identity.json claudetools/.claude/
|
||||
cd claudetools && git submodule update --init --recursive
|
||||
# then: diff clients/cascades-tucson against claudetools.old; cp any missing files (esp. this session's docs)
|
||||
# recover this session log from claudetools.old/clients/cascades-tucson/session-logs/2026-06/
|
||||
# verify, then delete claudetools.old
|
||||
```
|
||||
See RECLONE.md in the new repo. Pre-split backup bundle: Jupiter share Backups/Gitea-Storage.
|
||||
|
||||
## Reference Information
|
||||
- Last pushed commit (old history): `2b2d094` (2026-06-18 19:16). Restructure force-push: ~2026-06-19 02:41 UTC.
|
||||
- Master plan: `clients/cascades-tucson/docs/network/network-optimization-master-plan.md`
|
||||
- QoS design: `clients/cascades-tucson/docs/network/phase1-voice-qos-design.md`
|
||||
- Voice-quality diagnostic: `clients/cascades-tucson/reports/2026-06-18-voice-quality-diagnostic.md`
|
||||
- Existing RF audit + 2.4 runbook: `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`
|
||||
Reference in New Issue
Block a user