From a0e83036c51045753990f070d12c46c00c6dcb21 Mon Sep 17 00:00:00 2001 From: Howard Enos Date: Thu, 18 Jun 2026 21:33:12 -0700 Subject: [PATCH] wiki: full recompile cascades-tucson (fold in recovered voice/RF/QoS docs) --full recompile. Folds the 4 repo-rewrite-recovered docs into the article (master plan, voice QoS design, voice-quality diagnostic, RF/voice session log) while preserving all existing depth. Key corrections/additions: - Voice VLAN 30 cutover now COMPLETE: 8 AudioCodes (.224-.231) added; prior compile had them 0/8 pending. AudioCodes needed a physical power-cycle (externally powered/PoE off; controller bounce is a no-op). - Poly fleet is 28 not 22 (6 stragglers still off VOICE). - Voice quality is an RF problem on the Poly WiFi phones, not the VLAN move (per-phone diagnostic; Lauren .202 50% retry -> locked to AP 103; AP 103 5GHz saturated; coverage gaps rooms 515/210/204). - 6 GHz dark root-caused (CSCNet not broadcasting 6g). - Measured WAN1 upload ~522 Mbps -> voice QoS is insurance, RF is the fix. - New Patterns: Voice QoS design, Network Optimization Master Plan, Decisions-2026-06-18 (non-DFS only; no dedicated voice SSID). - Active Work / History / HIPAA reconciled to the complete cutover. Live Syncro unchanged (55.75 hrs, 0 tickets, 29 assets). Synthesis was deliberate surgical enrichment (diff-reviewed), not a blind regenerate. Co-Authored-By: Claude Opus 4.8 (1M context) --- wiki/clients/cascades-tucson.md | 67 ++++++++++++++++++++++++++------- 1 file changed, 54 insertions(+), 13 deletions(-) diff --git a/wiki/clients/cascades-tucson.md b/wiki/clients/cascades-tucson.md index eed5afbb..cddc3124 100644 --- a/wiki/clients/cascades-tucson.md +++ b/wiki/clients/cascades-tucson.md @@ -201,7 +201,7 @@ Because per-user **Intune** never provisioned tenant-wide (`INTUNE_A = PendingIn ### Network -- **ISP / WAN:** Dual-WAN Cox. WAN1 igc0 `184.191.143.62/30` (Cox Fiber, primary, gateway `184.191.143.61`) + WAN2 igc3 `72.211.21.217/27` (Cox Coax, secondary, static); `WAN_Group` gateway group; both active full-duplex, no loss events (verified 2026-06-16). Both WAN IPs added as Cascades Named Location in Entra (ID: `061c6b06-b980-40de-bff9-6a50a4071f6f`). +- **ISP / WAN:** Dual-WAN Cox. WAN1 igc0 `184.191.143.62/30` (Cox Fiber, primary, gateway `184.191.143.61`) + WAN2 igc3 `72.211.21.217/27` (Cox Coax, secondary, static); `WAN_Group` gateway group; both active full-duplex, no loss events (verified 2026-06-16). Both WAN IPs added as Cascades Named Location in Entra (ID: `061c6b06-b980-40de-bff9-6a50a4071f6f`). **Measured bandwidth (2026-06-18):** WAN1 fiber **upload ~522 Mbps** (Cloudflare single-stream from pfSense); RRD 3-day peaks ~680 Mbps down / 98 Mbps up (actual usage). WAN2 coax upload **unmeasured** (remote source-route bind to `72.211.21.217` failed -- needs a WAN2-routed host or the Cox bill; assume asymmetric ~20-50 Mbps up). Implication: 30 calls ~= 3 Mbps vs ~522 Mbps fiber headroom -> **the WAN is NOT the everyday voice bottleneck** (RF is); voice QoS is insurance for WAN2 failover + rare WAN1 saturation. See voice QoS design. - **Firewall:** pfSense Plus **25.07-RELEASE** (Netgate) at `192.168.0.1`, cert CN=pfSense-685f277aa6886. Admin vault: `clients/cascades-tucson/pfsense-firewall`. SSH shell access works (no interactive menu). OpenVPN user Howard: vault `clients/cascades-tucson/pfsense-openvpn-howard` (split-tunnel; `route 192.168.0.0/22`; use OpenVPN GUI or OpenVPN Connect with DCO disabled for stability -- DCO/TAP instability seen 2026-06-16). pfSense-ssh.sh (unifi-wifi skill) provides scripted audit/dhcp/run access. **Logs are PLAIN TEXT on 25.07 -- read with tail/grep, NOT clog (clog returns empty).** pfSense has an **OpenVPN `--inactive` idle timeout (~300s)** configured on the server; it disconnects clients after ~5 min of no tunnel data (keepalive pings do NOT reset this counter). This is a config setting, not a fault -- raise/disable to fix the flapping (fix proposed 2026-06-18, not applied). **[OUTAGE 2026-06-17] pfSense was on UPS surge-only side -- moved to battery-backed outlets by Mike (rectified). On-box auto-backup (12:20 version) restored by Mike; config vaulted `clients/cascades-tucson/pfsense-config-backup-2026-06-17.sops.yaml`. Enable Netgate AutoConfigBackup to prevent off-box backup gap.** - **[INFO] pfSense health check (2026-06-16):** gateway ruled out as WiFi factor -- DHCP not exhausted (270/~507 active ~53% on the AP/WiFi pool), unbound DNS up, both WANs full-duplex/stable, firewall states 28-31k/790k, load 0.6. Minor: igc3/WAN2 Intel I225/226 2.5G counter quirk (1707 input-errors+collisions logged, full-duplex active, no loss) -- not a fault, no action needed. - **LAN / VLAN layout:** Primary staff/AP network `192.168.0.0/22` (pfSense .0.1, cascadesDS .0.120, UniFi APs + most WiFi clients on 192.168.2.x/3.x). DHCP pool 192.168.2.2-192.168.3.254 (~507 cap, ~270 active ~53%). Per-unit /28 VLANs: **199 DHCP subnets** total, mostly `10.x.y.0/28` per apartment (assisted-living L2 isolation) + Staff/Internal VLAN 20 (`10.0.20.0/24`, gw `10.0.20.1`) + Guest VLAN 50 (`10.0.50.0/24`, RFC1918 blocked) + **Voice VLAN 30** (`10.0.30.0/24`, gw `10.0.30.1`). DHCP backend: ISC (Kea config present, dormant). Unbound DNS. @@ -214,17 +214,20 @@ Because per-user **Intune** never provisioned tenant-wide (`INTUNE_A = PendingIn - **2.4 GHz is the primary pain band:** avg TX-retry ~10%, cu_total 69-94% live, catastrophic neighbor BSSID density (ch6 ~33k BSSIDs, ch1 ~19k, ch11 ~17k). 27 of the 40 worst clients on 2.4 GHz (retry 11-42%), mostly IoT/legacy. Root cause: high radio density running at excessive TX power. - **2.4 GHz Phase A status -- OVER-THINNED (as of 2026-06-17):** Floor-4 pilot (2026-06-16) applied 14/15 radios to 6 dBm (retry 13.2->9.5%, no coverage loss). Subsequently overnight 2026-06-17, Phase A was extended: **24 of 76 2.4 radios DISABLED + 42 set to Low (~6 dBm)**; Floors 5/6 + mesh untouched. Results DEGRADED: retry 17->23.4%, satisfaction 39->30 -- over-thinned. **Current recommendation: Low->Medium for the 42 at-Low radios.** Phase 0 (ping-check off, 3AM auto-upgrade disable, pfSense logging) + Phase 1 (combined radio_table PUT: ng power medium [42 radios], na ht 40 [76], na min_rssi -82 [69]) planned, dry-run clean, NOT applied -- pending explicit go-ahead. - **5 GHz:** Auto-channel reassignment applied via UniFi 2026-06-17 (Howard) -- made co-channel overlap **WORSE** (25->30 co-channel pairs from 173 strong neighbor pairs). `dfs-check.sh` 2026-06-16: **ZERO real radar events fleet-wide** (DFS empirically low-risk). Plan: **Option B = combined per-AP PUT of 40MHz + non-DFS optimized channel plan + min-RSSI -82 relax.** Width + channel are coupled (width alone fixed only 7/25 pairs; non-DFS needs 40MHz). Dry-run clean; NOT applied. NOTE: an earlier mid-session claim (2026-06-15 audit) that "DFS was the #1 problem" was an artifact of tooling bugs and was withdrawn -- do not repeat it. - - **6 GHz:** active on 75 radios; only 1 client. Largest untapped, clean, non-DFS capacity -- band-steering 6E-capable clients to 6 GHz is the top opportunity. + - **6 GHz:** active on 75 radios; only ~1 client. **Root cause (found 2026-06-18): CSCNet is not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`) -- the band is dark at the SSID level, so nothing can join it. Largest untapped, clean, non-DFS capacity. Opening 6 GHz on CSCNet (+ BSS-transition `bsstm`) is the **relief valve** that must come BEFORE narrowing 5 GHz to 40 MHz (else 5 GHz congestion just relocates). The Poly phones are 5 GHz (not 6E), so 6 GHz helps voice *indirectly* by pulling resident devices off 5 GHz. + - **AP 103 saturated (5 GHz):** ch149, ~75% airtime, ~25,900 retries, 12 clients. Lauren's voice phone (`.202`) was locked here 2026-06-18 (off the CC Bridge mesh AP) -- so AP 103 MUST be relieved (off ch149 / 80->40 MHz / load-balance) or she trades a mesh problem for a congestion one. - **AP-level satisfaction 95-100 fleet-wide.** Pain is in the client tail. + - **Client distribution by SSID (2026-06-18):** CSCNet 427 + CSC ENT 131 (legacy, not yet retireable) + Guest 13. - **Config flags:** 6 APs with 2.4 min-RSSI OFF (615, 608, 505, 517, 622, salon); 4 APs off the 1/6/11 plan (128 disabled, 108 offline, 108U7 Pro auto, salon auto). - **Known hardware:** AP 108 (Floor 1) offline pending a new cable run (expected). Stale duplicate controller object ("108" vs "108U7 Pro") to clean up separately. - **Creds (vault refs only):** `infrastructure/uos-server-ssh-key` (SSH/Mongo), `infrastructure/uos-server-network-api-rw` (RW controller admin), `clients/cascades-tucson/unifi-ap-ssh` (per-AP device auth via site VPN), `clients/cascades-tucson/pfsense-firewall` (pfSense admin for pfsense-ssh.sh). -- **VoIP (vendor: Vertical -- Richard Turner ):** Two phone fleets -- **8 AudioCodes** (OUI `00:90:8f`, WIRED on USW-16-PoE ports 1-8, Default/main LAN) and **22 Poly** (OUI `48:25:67`, WiFi via CSCNet PPSK -> VLAN 20 Internal, migrating to VLAN 30). The **Vertical-Remote management desktop** (`10.0.30.201`, MAC `e4:e7:49:52:3a:06`, WIRED USW-16-PoE port 16, VOICE VLAN 30, **DHCP** -- confirmed not static, LogMeIn remote access, no pfSense OpenVPN) is live on VLAN 30. No on-prem SIP PBX found -> phones appear to register to a **cloud/hosted PBX** (Vertical). -- **[2026-06-17 SUBSTANTIALLY COMPLETE] Voice VLAN (VLAN 30) consolidation:** dedicated isolated **VLAN 30 VOICE (`10.0.30.0/24`, gw `10.0.30.1`, pfSense igc1.30, DHCP `.100-.250`, DNS `8.8.8.8/1.1.1.1`)** holding ALL phones + the Vertical desktop; internet/cloud-PBX egress only, firewalled off VLAN 20 / main LAN / PHI / mgmt (HIPAA). Isolation rules verified via `pfctl -sr` (clone of GUEST VLAN -- the only actually-isolated net). Voice PPSK key on CSCNet -> VOICE: vaulted `clients/cascades-tucson/wifi-voice-ppsk`. **Cutover status as of 2026-06-18 (live inventory: `docs/network/voice-phone-inventory.md`):** +- **VoIP (vendor: Vertical -- Richard Turner ):** Two phone fleets -- **8 AudioCodes** (OUI `00:90:8f`, WIRED on USW-16-PoE ports 1-8, externally powered / PoE OFF) and **28 Poly** (OUI `48:25:67`, WiFi via CSCNet PPSK). As of 2026-06-18: all 8 AudioCodes + 22 Poly + the Vertical desktop are on VOICE VLAN 30 (31 devices); 6 Poly stragglers remain on VLAN 20/Default pending re-key. Phones confirmed marking **DSCP EF (46)** for voice (2026-06-18). The **Vertical-Remote management desktop** (`10.0.30.201`, MAC `e4:e7:49:52:3a:06`, WIRED USW-16-PoE port 16, VOICE VLAN 30, **DHCP** -- confirmed not static, LogMeIn remote access, no pfSense OpenVPN) is live on VLAN 30. No on-prem SIP PBX found -> phones appear to register to a **cloud/hosted PBX** (Vertical). +- **[2026-06-18 CUTOVER COMPLETE] Voice VLAN (VLAN 30) consolidation:** dedicated isolated **VLAN 30 VOICE (`10.0.30.0/24`, gw `10.0.30.1`, pfSense igc1.30, DHCP `.100-.250`, DNS `8.8.8.8/1.1.1.1`)** holding ALL phones + the Vertical desktop; internet/cloud-PBX egress only, firewalled off VLAN 20 / main LAN / PHI / mgmt (HIPAA). Isolation rules verified via `pfctl -sr` (clone of GUEST VLAN -- the only actually-isolated net). Voice PPSK key on CSCNet -> VOICE: vaulted `clients/cascades-tucson/wifi-voice-ppsk`. **31 devices on VOICE as of 2026-06-18 (live inventory: `docs/network/voice-phone-inventory.md`):** - Vertical-Remote desktop (port 16): DONE -- `10.0.30.201`. Re-VLANing a wired port requires bouncing the link (port disable/enable via controller API using CSRF token); a UniFi client block/unblock is MAC-filter only, not a link bounce. - - **22 of 22 Poly WiFi phones: ALL DONE** -- re-keyed to voice PPSK, on `10.0.30.202-.223`. Dial-tone + outbound calls verified. - - **8 AudioCodes (wired, USW-16-PoE ports 1-8): 0/8 REMAINING.** Flip port native VLAN to VOICE + PoE power-cycle each to force re-DHCP. - - **Full runbook:** `clients/cascades-tucson/docs/network/voice-vlan-cutover.md`. Live inventory: `docs/network/voice-phone-inventory.md`. + - **22 of 22 migrated Poly WiFi phones: DONE** -- re-keyed to voice PPSK, on `10.0.30.202-.223`. Dial-tone + outbound calls verified. **NOTE: the Poly fleet is actually 28, not 22** -- **6 stragglers remain off VOICE** (5 on VLAN 20 `10.0.20.64/.65/.66/.67/.195`, one on `192.168.1.126`; `.20.66` Dining Room at 35% retry); re-key these to the voice PPSK so all phones are isolated + get voice QoS. + - **8 AudioCodes (wired, USW-16-PoE ports 1-8): ALL DONE** -- on `10.0.30.224-.231`. **Gotcha: AudioCodes are externally powered (PoE OFF on those ports), so a UniFi PoE power-cycle AND a controller port disable/enable are both no-ops -- they held their old main-LAN DHCP leases. Required a full physical power-off/on** before they re-DHCP'd onto VOICE. + - **Quality caveat:** the VLAN move gives isolation + enables QoS but does NOT by itself fix call quality -- the dropped-calls/voice-breaks complaints are an **RF problem on the WiFi (Poly) phones** (the wired AudioCodes are clean). See the Wireless / Voice QoS patterns and the 2026-06-18 voice-quality diagnostic. + - **Full runbook:** `clients/cascades-tucson/docs/network/voice-vlan-cutover.md`. Live inventory: `docs/network/voice-phone-inventory.md`. Voice-quality diagnostic: `reports/2026-06-18-voice-quality-diagnostic.md`. Holistic optimization plan: `docs/network/network-optimization-master-plan.md`; voice QoS design: `docs/network/phase1-voice-qos-design.md`. ### External Vendors & Mail Senders @@ -356,7 +359,8 @@ Cascades' line-of-business / reporting SaaS (the systems they pull data OUT of, - **Fleet (full audit 2026-06-16):** 77 U7-Pro APs, **12 switches**, ~587 wireless clients. Controller: UOS at 172.16.3.29, HTTPS 11443 (see [[uos-server]]); site short name `va6iba3v`, site_id `685f39068e65331c46ef6dd2`. No UniFi gateway (pfSense is the gateway). pfSense ruled out as WiFi factor 2026-06-16 (DHCP not exhausted, DNS up, WAN stable -- see Network section). - **Primary pain band is 2.4 GHz.** Avg TX-retry ~10%; cu_total 69-94% live; catastrophic neighbor BSSID density (ch6 ~33k BSSIDs, ch1 ~19k, ch11 ~17k). 27 of the 40 worst clients stuck on 2.4 GHz (retry 11-42%), mostly IoT/legacy hardware (Ring cameras, robotic cleaner, smart plugs, EPSON printer, Poly phone, handheld scanners, smartwatch). Root cause: ~75 2.4 GHz radios running at auto (full) TX power in extreme density. Experience splits by band: 5/6 GHz clients are fine; clients that land or stick on 2.4 GHz suffer. - **5 GHz -- DFS concern is theoretical; empirically clean.** 76/77 radios on 80 MHz width (should be 40 MHz at this density). 55/77 radios on DFS channels (52-144) near Davis-Monthan AFB + TUS airport radar. `dfs-check.sh` 2026-06-16: **ZERO real radar events fleet-wide** (55 DFS APs, full `dmesg` sweep, precise pattern match) -- DFS is empirically low-risk here. Measured TX-retry DFS (8.4%) ~= non-DFS (9.0%) -- no throughput penalty. Still recommended to move to non-DFS (UNII-1 36-48 + UNII-3 149-161) for resilience. NOTE: an earlier mid-session claim (2026-06-15 audit) that "DFS was the #1 problem" was an artifact of tooling bugs (raw counter + 15-AP head cap) and was corrected before session end -- do not repeat it. -- **6 GHz is nearly unused.** 75 radios active; only 1 client. Largest untapped, clean, non-DFS capacity. Band-steering 6E-capable clients to 6 GHz is the highest-ROI tuning opportunity. +- **6 GHz is nearly unused -- root cause: CSCNet not broadcasting 6 GHz** (`wlan_bands=[2g,5g]`, found 2026-06-18). 75 radios active but only ~1 client because the band is dark at the SSID level. Largest untapped, clean, non-DFS capacity -- enabling 6 GHz on CSCNet (`apply-wlan bands all` + `bsstm on`) is the **relief valve** and must precede 5 GHz width-narrowing. The Poly voice phones are 5 GHz (not 6E), so 6 GHz helps voice indirectly by clearing 5 GHz of resident devices. +- **AP 103 saturated (5 GHz):** ch149, ~75% airtime, ~25,900 retries, 12 clients. Lauren's voice phone (`.202`) locked here 2026-06-18 (off the CC Bridge mesh AP) -> AP 103 must be relieved (off ch149 / 80->40 MHz / load-balance) before/with that lock or she trades a mesh problem for congestion. - **Switch audit (2026-06-16):** ~25 ports linked at 100 Mbps but gig-capable (systematic cabling/NIC issue, 1st/2nd/3rd-floor switches; investigate after WiFi Phase A). PoE budgets healthy. 3 offline switches: Switch 2nd Floor #2, Switch 4th Floor #2, USW Pro Max 16. Port p38 (1st Floor USW) 4.0% tx-drop rate. - **AP-level satisfaction 95-100 fleet-wide.** Network is healthy on average; pain is in the client tail. - **Remediation status (as of 2026-06-17 -- OVER-THINNED):** @@ -382,6 +386,31 @@ Cascades' line-of-business / reporting SaaS (the systems they pull data OUT of, ### VoIP / Network Device Migration - **Re-VLANing a wired switch port requires a link bounce to force re-DHCP.** Changing the native VLAN on a UniFi switch port does not reset the NIC link; the device holds its old DHCP lease (renewal unicast to the old DHCP server is blocked by the new VLAN's firewall rules). Fix: bounce the port (PoE power-cycle for PoE devices; disable/enable via controller API for non-PoE). A UniFi client block/unblock is a MAC-address filter only -- it does NOT bounce the link. Controller API port-bounce requires the `X-CSRF-Token` from the login response header (`x-updated-csrf-token`). Confirmed on the Vertical-Remote desktop (2026-06-17). +- **Externally-powered devices (AudioCodes desk phones) need a PHYSICAL power-cycle, not a controller bounce.** The 8 AudioCodes sit on USW-16-PoE ports 1-8 but run on **external power bricks (PoE OFF on those ports)** -- so a UniFi PoE power-cycle is a no-op AND a controller port disable/enable did not reset their uptime either. They held their old main-LAN DHCP leases and never re-DHCP'd onto VLAN 30 until **Howard physically powered each off/on** (2026-06-18), after which all 8 pulled VOICE leases `.224-.231`. For any externally-powered wired device, plan an on-site/hands-on power-cycle for a VLAN move. +- **UniFi controller PUT 403 / CSRF:** rapid controller writes can 403 -- read the CSRF token from the `x-updated-csrf-token` response header (TOKEN-cookie JWT as fallback). pfSense SSH and the controller API both rate-limit under many rapid queries; alternate between them. +- **API scratch files must be written OUTSIDE the repo.** Controller-scratch (`.sta.json`, `.fleet*.dev`, etc.) written CWD-relative got swept into commits by `git add -A` and blocked rebases (a stray locked `curl.exe` held them). Use `mktemp -d` outside the repo; `.gitignore` patterns (`.fleet*`, `.ap[0-9]*`, `.vq[0-9]*`, `.q[0-9]*`) added as a backstop. + +### Voice QoS (VLAN 30) -- design (2026-06-18, NOT yet built) + +Full design: `docs/network/phase1-voice-qos-design.md`. Status DESIGN -- nothing applied (Cascades per-change-go rule). + +- **The VLAN move's QoS payoff: all voice is one subnet `10.0.30.0/24`,** so QoS matches *all* voice by **source subnet** -- no per-PBX SIP/RTP port guessing. This is the cleanest match criterion and only became possible by isolating voice onto VLAN 30. Phones confirmed marking **DSCP EF (46)** (2026-06-18), so DSCP drives WMM (L2) + switch QoS (L3); subnet match is the safety net (no pfSense set-DSCP rule needed). +- **QoS is INSURANCE, not the everyday fix.** Phones register to a CLOUD PBX (Vertical) over the internet, so the theoretical bottleneck is WAN-upload saturation. But measured WAN1 fiber upload **~522 Mbps** vs ~98 Mbps peak usage = huge headroom -> the WAN is not the day-to-day constraint. QoS earns its place for (1) **WAN2 (coax) failover** (small upload + a big upload = real congestion) and (2) rare WAN1 saturation (backup/large upload). The everyday dropped-calls cause is **RF** -- build QoS (cheap, correct) but set expectations. +- **Layered design:** **L1 pfSense HFSC shaper** on BOTH WANs -- 3 queues `qVoice` (prio 7, realtime ~30%, source `10.0.30.0/24` via floating out rule), `qACK` (~10%), `qDefault` (default ~60%); shape to ~90-95% of actual upload to keep the queue in pfSense. **L2 UniFi WMM** maps DSCP EF -> WiFi Voice AC (protects the Poly phones over the air -- verify WMM on CSCNet). **L3 UniFi switch QoS** queues tagged voice (mostly automatic; confirm USW isn't stripping DSCP). **L4 DSCP marking** (confirmed EF on the phones). Blocker for L1 sizing: the WAN2 coax upload number (remote test failed). +- **Build path:** Firewall -> Traffic Shaper -> Wizard "Multiple Lan/Wan" (prioritize by address `10.0.30.0/24`), or hand-build HFSC + floating rule. Howard drives the pfSense GUI. Rollback = disable/remove the shaper (QoS only orders packets under congestion; removing reverts to FIFO, zero residual). Skill gap: `unifi-wifi` has no QoS verb (pfSense + UniFi config task). + +### Network Optimization Master Plan (all-device, 2026-06-18, NOT yet executed) + +Full plan: `docs/network/network-optimization-master-plan.md`. Goal: fix the *system* for all ~587 clients, not one device at a time. Floors 1-4 only this round; **Floors 5/6 (MemCare) RF + phones DEFERRED** per Howard. + +- **Core principle: open relief valves BEFORE constraining,** or congestion just relocates (the "whack-a-mole" trap). Sequence: **P0** baseline capture (same time-of-day) -> **P1** voice QoS (orthogonal, do first) -> **P2a** enable 6 GHz on CSCNet + BSS-transition (the offload path) + **P2b** correct the 2.4 over-thinning **Low->MEDIUM** (~12-15 dBm, not lower -- Low already starved edge clients) -> **P3** 5 GHz 80->40 MHz + **non-DFS** channel plan + relieve AP 103 -> **P4** fine-tune (2.4 1/6/11, min-RSSI, 802.11k/v roaming) -> **P5** physical cabling (separate visit). +- **Interdependencies:** 6 GHz before 5 GHz-40MHz; 2.4 power Medium not Low; AP 103 must be relieved (Lauren locked there); never stack disables + power-down in one area (that caused the over-thinning); tune one lever per zone; never disable mesh-protected APs (2nd Floor Atrium, CC Bridge, salon, 206 U7 Pro, 108). +- **Data-driven gate rule (Howard):** every change is a hypothesis gated on fleet-wide metrics (avg retry%, cu_total, cu_interf, satisfaction, band split, per-AP coverage holes). KEEP+proceed only if the target improved AND fleet-wide satisfaction didn't fall / retry didn't rise / no AP lost its clients; HOLD if a secondary metric regressed; ROLLBACK on any fleet regression or complaint. Validate ALL devices (CSCNet 427 + CSC ENT 131 + Guest 13), not just the 31 voice phones. Every `apply-radio`/`apply-wlan` writes a rollback JSON. + +### Decisions resolved 2026-06-18 (voice/RF) + +- **5 GHz: NON-DFS ONLY** (UNII-1 36-48 + UNII-3 149-165). A precise radar sweep found ZERO genuine hits across all 53 DFS APs -- but only over a ~21-23h window (APs rebooted in the 6/17 outage), and near Davis-Monthan AFB + TUS (~10 mi) a single sporadic military-radar hit forces a 30-min channel vacate = **dropped calls**. Resilience > channel diversity for a voice-critical net; 6 GHz (Phase 2a) covers the lost capacity. Add periodic `dfs-check.sh` monitoring. (Supersedes the earlier "move to non-DFS for resilience" as now a firm decision.) +- **NO dedicated voice SSID** -- voice stays on the shared CSCNet PPSK. UniFi 3-SSID cap is sound RF hygiene (each SSID = beacon airtime at 77 APs); the only retirement candidate CSC ENT still has 131 active clients (staff PCs, printers, DirecTV) so a slot isn't free; and a voice SSID isn't needed (QoS is VLAN/DSCP-based and SSID-independent, band preference is best set phone-side via Vertical, roaming/power-save are phone+AP settings). Revisit only if CSC ENT's clients migrate off. ### pfSense Operations @@ -418,7 +447,7 @@ Cascades' line-of-business / reporting SaaS (the systems they pull data OUT of, - **Backup gap closed (2026-06-15):** Mike installed ACG cloud backup (MSP360/CloudBerry -> ACG-backup server) on CS-SERVER. Verify first full backup completes and set retention; confirm image-based / bare-metal + system-state for DC recoverability. - **Restored 7 deleted mailboxes (2026-04-25)** for HIPAA SS164.316(b)(2) 7-year retention. - **Termination policy established:** Convert to shared mailbox, hide from GAL, retain 7 years. -- **Voice VLAN 30 (HIPAA-isolated):** All voice gear (phones + Vertical desktop) being migrated to an isolated network with internet/cloud-PBX egress only; blocked from PHI/LAN/VLAN20/mgmt. 22/22 Poly done; AudioCodes pending. +- **Voice VLAN 30 (HIPAA-isolated):** All voice gear (phones + Vertical desktop) migrated to an isolated network with internet/cloud-PBX egress only; blocked from PHI/LAN/VLAN20/mgmt. **Cutover complete 2026-06-18: 31 devices on VOICE (8 AudioCodes + 22 Poly + desktop);** 6 Poly stragglers still on VLAN 20/Default pending re-key. --- @@ -430,7 +459,12 @@ Syncro live pull 2026-06-18: **0 open tickets.** No hours drawn from the 2026-06 - **[URGENT] Order replacement workstation for Lupe Sanchez (DESKTOP-TRCIEJA).** Decision made 2026-06-18. EOL Gateway ZX6971 / i3-2120 / 8 GB / Win11-unsupported. On new machine: provision GuruRMM + Bitdefender only; do NOT carry over the Datto stack. - **[URGENT] Rotate exposed Synology Cloud Signin Portal credential.** Vault commit 1fbc0e1 committed it plaintext; encrypted go-forward but credential is exposed in git history. Also verify MDM service account + WiFi CSCNet from that same commit were never plaintext. -- **[IN PROGRESS] Voice VLAN (VLAN 30) AudioCodes cutover: 0/8 remaining.** 22/22 Poly + Vertical desktop DONE. Flip USW-16-PoE ports 1-8 native VLAN to VOICE + PoE power-cycle each AudioCodes to re-DHCP. Runbook: `docs/network/voice-vlan-cutover.md`; inventory: `docs/network/voice-phone-inventory.md`. +- **[DONE 2026-06-18] Voice VLAN (VLAN 30) cutover -- 31 devices on VOICE** (8 AudioCodes `.224-.231` + 22 Poly `.202-.223` + Vertical desktop `.201`). AudioCodes needed a physical power-off/on (externally powered; PoE/controller bounce was a no-op). **Remaining:** re-key the **6 Poly stragglers** still on VLAN 20/Default (`10.0.20.64/.65/.66/.67/.195`, `192.168.1.126`) to the voice PPSK. +- **[PENDING - voice quality] Dropped-calls/voice-breaks are an RF problem on the WiFi (Poly) phones, not the VLAN move.** 14 phones flagged 2026-06-18; worst Lauren `.202` (was 2.4GHz/50% retry -> locked to AP 103) and Shelby `.218` (2.4GHz/53%, MemCare -- deferred). Coverage gaps rooms 515/210/204. Fixes (none applied): voice QoS (#1), force voice phones off 2.4 GHz (#2), coverage/min-RSSI (#3), migrate 6 stragglers (#4), 5 GHz width/channel (#5). Diagnostic: `reports/2026-06-18-voice-quality-diagnostic.md`. +- **[PENDING - build] Voice QoS for VLAN 30** (pfSense HFSC 3-queue on both WANs matching `10.0.30.0/24` + UniFi WMM/switch QoS). Design done, not built (Howard drives pfSense GUI). Blocker for sizing: the WAN2 coax upload number. QoS is insurance (WAN has headroom); RF is the everyday fix. Design: `docs/network/phase1-voice-qos-design.md`. +- **[PENDING - execute] Network optimization master plan (floors 1-4; MemCare deferred).** Sequenced P1 QoS -> P2a enable 6 GHz on CSCNet + P2b 2.4 Low->Medium -> P3 5 GHz 40 MHz + non-DFS + relieve AP 103 -> P4 fine-tune -> P5 physical. Open relief valves before constraining; per-zone, dry-run, gated on fleet metrics. Start = P2b (baseline capture + 2.4 Low->Medium). Pending Howard's go + evening window. Plan: `docs/network/network-optimization-master-plan.md`. (Supersedes the older "Wireless RF Phase 0 + Phase 1" item below -- same work, holistic framing.) +- **[PENDING] Measure WAN2 (coax) upload** -- remote source-route test failed; get from a WAN2-routed host or the Cox bill (sizes the failover voice shaper). +- **[PENDING] Hand Vertical (Richard Turner) the phone-side config list** -- 5 GHz band lock, DSCP-on, 802.11k/v roaming, U-APSD/power-save, firmware. - **[PLANNED] Network logging / observability (spec written, build later).** Diagnosis 2026-06-18: the UniFi controller retains **ZERO** client events/alarms for Cascades (7-day pull) and pfSense logs roll over in hours -> device drops/kicks/deauths are not captured, so the network is a black box after the fact. Plan: **Synology cascadesDS (DSM Log Center syslog server) as the on-site collector** (NOT CS-SERVER -- fragile EOL DC), with pfSense + UniFi-controller + AP syslog as sources and a 1-2 min `/stat/sta` client snapshotter to fill the controller's history gap. Optional later: Container Manager Graylog/Loki + Discord alerting. Spec: `docs/network/network-logging-plan.md`. Next: confirm Synology model/RAM/DSM. - **[PENDING] Wireless RF Phase 0 + Phase 1 (pending go-ahead + evening window):** - Phase 0 (safe anytime): pfSense ping-check off for 240 DHCP pools, disable 3 AM AP firmware auto-upgrade, enable full pfSense logging (DHCP/DNS/firewall/system/gateway) with rotation. @@ -524,6 +558,7 @@ Syncro live pull 2026-06-18: **0 open tickets.** No hours drawn from the 2026-06 | 2026-06-18 | **Synology Drive sync architecture diagnosed; Team Folder migration plan produced.** Current Drive sync is Sync-user My Drive only (not the real shared folders). Real NAS shares (Server 1.9 G, Management 5.5 G, Public ~50 G, SalesDept ~23 G) are not mirrored. Plan: Team Folder Download-only tasks into `D:\Shares\_SynMigration\` staging; pilot on `/volume1/Server`. No changes made. | | 2026-06-18 | **DESKTOP-TRCIEJA (Lupe Sanchez) performance diagnosed; replace-not-remediate decision.** Root causes: (a) EOL hardware -- Gateway ZX6971 AIO, Intel i3-2120 (2011, 2C/4T), 8 GB RAM, Win11 unsupported; (b) dual real-time AV -- ACG Bitdefender (keep) + leftover Datto stack (Datto RMM/CentraStage + Datto EDR/Infocyte + bundled DattoAV) both scanning every file on a 2-core CPU under memory pressure. OneDrive ruled out (desktop is local). Howard decided: no remediation; order replacement. Another instance of the fleet-wide leftover-Datto-stack cleanup. | | 2026-06-18 | **Voice VLAN 30: all 22 Poly phones migrated; network-logging spec written.** Completed the Poly cutover live -- all 22 WiFi phones re-keyed to the voice PPSK onto `10.0.30.202-.223` (per-phone location inventory in `docs/network/voice-phone-inventory.md`); first phone (Lauren Hasselman) dial-tone + outbound call verified. Vertical desktop fixed via port-16 bounce (controller API + CSRF) -> `10.0.30.201`. AudioCodes (8, wired) still pending (flip + PoE power-cycle). Separately, found the UniFi controller retains **ZERO** client events for Cascades (drop/kick history not captured) -> wrote a network-logging spec (`docs/network/network-logging-plan.md`): Synology Log Center on-site collector, pfSense+UniFi syslog sources, client snapshotter. Plan only -- build later. | +| 2026-06-18 | **Voice VLAN 30 cutover COMPLETE (8 AudioCodes added); voice-quality diagnosed; holistic all-device optimization master plan built.** AudioCodes finished -- they wouldn't re-DHCP via PoE/controller bounce (externally powered, PoE off); Howard physically power-cycled all 8 -> VOICE leases `.224-.231` (31 devices total on VLAN 30). Diagnosed the dropped-calls complaints: **the VLAN move does NOT fix call quality -- it's RF on the Poly WiFi phones** (wired AudioCodes clean). 14 Poly flagged; worst Lauren `.202` (2.4GHz/50% retry -> locked to AP 103) + Shelby `.218` (2.4GHz/53%, MemCare/deferred); coverage gaps rooms 515/210/204; found 6 unmigrated Poly stragglers (fleet is 28, not 22). Built `network-optimization-master-plan.md` (open-relief-valves-before-constraining sequence: QoS -> 6 GHz on CSCNet + 2.4 Low->Medium -> 5 GHz 40 MHz/non-DFS/relieve AP 103 -> fine-tune -> physical) with interdependency map + data-driven gate framework, floors 1-4 only. Designed Phase 1 voice QoS (`phase1-voice-qos-design.md`: pfSense HFSC + UniFi WMM, match `10.0.30.0/24`, phones mark DSCP EF; measured WAN1 up ~522 Mbps -> QoS is insurance, RF is the substance). Rigorous DFS re-verification (0 genuine radar/~1-day window) -> **decision: NON-DFS only**. **Decision: no dedicated voice SSID** (3-SSID cap, CSC ENT still 131 clients, QoS is SSID-independent). 6 GHz root-caused dark: CSCNet not broadcasting 6g. NO live network changes applied (per-change-go rule). | --- @@ -531,8 +566,10 @@ Syncro live pull 2026-06-18: **0 open tickets.** No hours drawn from the 2026-06 **Session logs read:** all prior sessions + 2026-06-17/18 logs: voice VLAN 30 build + Poly cutover, Poly phone-drop root cause + wireless smoothing plan, power-outage recovery + 5GHz option analysis, CS-SERVER drive review, KPI dashboard scoping, power-outage follow-up (OpenVPN + printer), Synology Drive sync diagnosis, DESKTOP-TRCIEJA (Lupe Sanchez) perf diagnosis. Date range: 2026-03-06 through 2026-06-18. -**New this compile (2026-06-18):** -- Voice VLAN 30: status updated to 22/22 Poly + desktop DONE; AudioCodes 0/8 still pending. PPSK vaulted. Wired-port link-bounce pattern documented. +**Full recompile addendum (2026-06-18 -- recovered-docs fold-in):** folded in the 4 docs restored after the repo-rewrite (master plan, voice QoS design, voice-quality diagnostic, RF/voice-optimization session log). Key corrections + additions: **Voice VLAN 30 cutover is COMPLETE** (8 AudioCodes `.224-.231` added -- prior compile had them 0/8 pending); AudioCodes physical-power-cycle gotcha; Poly fleet is 28 (6 stragglers off VOICE); voice quality is an RF problem (per-phone diagnostic); 6 GHz dark because CSCNet isn't broadcasting 6g; AP 103 5 GHz saturation; measured WAN1 upload ~522 Mbps (QoS = insurance); new Patterns subsections (Voice QoS design, Network Optimization Master Plan, Decisions-resolved-2026-06-18: non-DFS-only + no-voice-SSID); Active Work + History + HIPAA reconciled to the complete cutover. + +**Prior compile (2026-06-18, refresh + initial):** +- Voice VLAN 30: status updated to 22/22 Poly + desktop DONE; AudioCodes were 0/8 pending at that point (now complete -- see addendum). PPSK vaulted. Wired-port link-bounce pattern documented. - Power outage (2026-06-17): full incident documented. pfSense UPS placement rectified. Duplicate dhcpd, 2nd-floor switch L2 failure, Cox modem reboot step. Post-outage straggler pattern (power-cycle) documented. pfSense config vaulted. Synology signin credential exposure flagged (vault commit 1fbc0e1). - Wireless: Phase A extended overnight 2026-06-17 and over-thinned (retry 17->23.4%, satisfaction 39->30). 5GHz auto-channel made co-channel overlap worse. Both corrective plans staged (Low->Medium, Option B) but not applied. Phone drop mystery closed (intentional pfSense reboot). - DESKTOP-TRCIEJA (Lupe Sanchez): added to key contacts and migration table; EOL hardware + dual-AV root cause; replace decision. @@ -564,7 +601,11 @@ Syncro live pull 2026-06-18: **0 open tickets.** No hours drawn from the 2026-06 **Resolved since last compile (2026-06-17 -> 2026-06-18):** - Poly phone drops: closed (intentional 2026-06-16 pfSense reboot; transient) -- Voice VLAN 30: built, verified, Poly cutover complete (22/22); Vertical desktop done +- Voice VLAN 30: cutover COMPLETE -- 8 AudioCodes (`.224-.231`) + 22 Poly + Vertical desktop = 31 devices on VOICE (6 Poly stragglers remain off VOICE) +- Voice-quality root cause: identified as RF on the WiFi Poly phones (not the VLAN move); per-phone diagnostic produced +- 6 GHz dark: root-caused (CSCNet `wlan_bands=[2g,5g]` -- not broadcasting 6g) +- 5 GHz DFS question: RESOLVED -- non-DFS only (resilience near Davis-Monthan/TUS) +- Dedicated voice SSID question: RESOLVED -- no (shared CSCNet; QoS is SSID-independent) - pfSense 25.07 log format: documented (plain text, not clog) - pfSense config backup: vaulted post-restore (2026-06-17) - pfSense on battery-backed UPS: rectified (Mike, 2026-06-17)