sync: auto-sync from HOWARD-HOME at 2026-06-17 17:49:01

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-17 17:49:01
2026-06-17 17:49:10 -07:00
parent 7ad6202e6e
commit dc4560cf27
3 changed files with 105 additions and 2 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -99,7 +99,7 @@
 - [Dashboard beta-first deploy](feedback_dashboard_beta_first.md) — Dashboard auto-builds to rmm-beta.azcomputerguru.com on push; prod (rmm.azcomputerguru.com) is explicit promote-only via promote-dashboard.sh --confirm. Never hand-rsync prod. One artifact, nginx sub_filter BETA banner. Stood up 2026-06-02.

 ### Cascades
- [Cascades operational rules](feedback_cascades.md) — Active rules: (1) folder redirection (fdeploy) needs subfolders PRE-CREATED before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1. (2) ALWAYS ask which security group(s) a new user goes into — never auto-derive from OU. (3) Do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use.
+- [Cascades operational rules](feedback_cascades.md) — Active rules: (1) folder redirection (fdeploy) needs subfolders PRE-CREATED before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1. (2) ALWAYS ask which security group(s) a new user goes into — never auto-derive from OU. (3) Do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use. (4) NEVER change Cascades production infra (pfSense/UniFi/switches/DHCP) without discussing it + explicit per-change go — read-only/dry-run until then.
 - [Cascades FR GPO fix](reference_cascades_fr_gpo_fix.md) — Native Folder Redirection was DOA on every machine: redirect targets were in a misnamed `fdeploy1.ini` (Windows reads `fdeploy.ini`) → empty target path → silent no-op → per-user registry workaround every time. Fixed 2026-06-08 (correct fdeploy.ini + version bump). Also: CS-SERVER live RMM agent is `c39f1de7...` (old `6766e973` stale).
 - [feedback_ascii_only_api_payloads](feedback_ascii_only_api_payloads.md) -- On Windows/Git-bash, non-ASCII chars (em-dash, arrow, smart quotes) in JSON payload TEXT passed to curl get mangled and rejected — Discord bot-alert returns 400, the coord API returns "error parsing the body". Use ASCII-only in API payload text, or a single-quoted heredoc.

--- a/.claude/memory/feedback_cascades.md
+++ b/.claude/memory/feedback_cascades.md
@@ -1,6 +1,6 @@
 ---
 name: Cascades-specific operational rules (folder redirect, security groups)
-description: Active rules for Cascades work — (1) folder redirection (fdeploy) needs subfolders pre-created before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1; (2) always ASK which security group(s) a new user goes into — never auto-derive from OU; (3) do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use. Root-cause / incident detail in project_cascades_history.md.
+description: Active rules for Cascades work — (1) folder redirection (fdeploy) needs subfolders pre-created before first logon or it caches a failure forever; recovery via fix-shell-redirect.ps1; (2) always ASK which security group(s) a new user goes into — never auto-derive from OU; (3) do NOT lock down the legacy Main\Company Web Docs\Accounting (Everyone:Full) folder — still in active use; (4) NEVER apply a change to Cascades production infra (pfSense, UniFi controller, switches, DHCP) without discussing it and getting an explicit per-change go — investigate read-only / dry-run only until then. Root-cause / incident detail in project_cascades_history.md.
 type: feedback
 ---

@@ -45,3 +45,13 @@ OU placement is mechanical (controls Entra Connect sync scope); group membership
 ## 3. Do NOT lock down the legacy `Main\Company Web Docs\Accounting` folder

 The accounting folder under the Synology-Drive-synced tree (`D:\Shares\Main\Company Web Docs\Accounting`, `Everyone:FullControl`) stays as-is — Howard confirmed 2026-06-10 the team is **still actively using it**. Do not scope/tighten its ACL or "clean it up" as a HIPAA hardening step, even though the wide-open Everyone:Full looks like an obvious target. The 2026-06-09 scan-to-folder build deliberately created a *separate* clean share (`\\CS-SERVER\AcctDept` → `D:\Shares\Accounting`) rather than reusing this folder; that is the lockdown story, and the legacy folder is intentionally left untouched.
+
+---
+
+## 4. NEVER change Cascades production infra without discussing it first
+
+Do not apply ANY change to the Cascades production network — pfSense (firewall rules, DHCP, `ping-check`, service restarts, reboots), the UniFi controller (radio power/channel/min-RSSI/disable, WLAN/PPSK settings), switches, or DHCP scopes — until it has been **discussed and explicitly approved, per change**. Investigate **read-only / dry-run only** (e.g. `apply-radio` without `--apply`, `pfsense-ssh.sh audit`/read-only `run`) and present proposals; wait for an explicit go before any write.
+
+**Why:** Howard set this explicitly 2026-06-17 during the Poly-phone-drop investigation — it's a live HIPAA assisted-living network (~780 clients, residents' medical/IoT devices) where a bad change has real patient-care and compliance impact, and changes need coordination (another session was concurrently doing radio work; Mike should be looped in on pfSense changes).
+
+**How to apply:** dry-run/read-only by default; stage changes as reviewable proposals; one explicit approval per change, not a blanket one. Pair with the per-change confirmation already required for hard-to-reverse/outward-facing actions. Coordinate via the coord API when another session may touch the same gear. Note the per-room /28 segmentation is intentional HIPAA L2 isolation — do not "clean it up." See [[project_cascades]].
--- a/clients/cascades-tucson/session-logs/2026-06/2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md
+++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md
@@ -0,0 +1,93 @@
+# Cascades — Poly phone-drop root cause + whole-network smoothing plan (IN PROGRESS, no changes applied)
+
+- **Date:** 2026-06-17
+- **Machine:** Howard-Home
+- **Client:** Cascades of Tucson
+- **Status:** Investigation complete; remediation PLANNED + dry-run verified; **NOTHING APPLIED**. Resume at "Pending / Next steps".
+
+## User
+- **User:** Howard Enos (howard)
+- **Machine:** Howard-Home
+- **Role:** tech
+
+## Session Summary
+
+Started as a CS-SERVER failing-drive review (separate log: `2026-06-17-howard-cs-server-drive-review-and-spike-question.md`), pivoted to a Cascades WiFi "noon network spike" question, then to a live complaint that the **Poly WiFi phones are dropping** ("around 102"). Ran a deep, multi-angle investigation against the UniFi controller (UOS `172.16.3.29`, Cascades site `va6iba3v` / `685f39068e65331c46ef6dd2`) via `.claude/scripts/uos-mongo.sh`, and pfSense (`192.168.0.1`, Plus 25.07) via `unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson` (admin SSH; VPN already up — pfSense reachable). All read-only / dry-run.
+
+**Root cause of the phone drops = the INTENTIONAL pfSense reboot on 6/16 at 22:38:12 MST.** Built a per-phone drop-rate timeline from `ace_stat.stat_hourly` (o:'user', `duration` field is in MILLISECONDS): the 30 Poly phones run ~99.9% connected every day EXCEPT 6/16 (uptime 97.95%, 28 drop-hours) — one fleet-wide event where 28 of 30 phones each dropped exactly once, on every floor including **Floors 5/6 which the radio work left untouched**. Only a gateway-level event (the pfSense reboot, kern.boottime = Tue Jun 16 22:38:12 2026) explains all-floors-at-once; the 2.4 radio apply cannot (it only reprovisions touched APs). Today (6/17) the phones are back to ~99.77% / 1 blip. Howard confirmed the reboot was intentional → phone mystery CLOSED. The complaint came in today but today's data is healthy; goal reframed to **smooth the entire network for all devices**.
+
+Corrected several of my own earlier wrong reads during the session (logged honestly): (1) the site-wide `sta_dhcp_failures` spike since 6/14 is DECOUPLED from the phones (phones were 99.99% on 6/14–15); (2) DHCP is HEALTHY — my "empty DHCP log" was a tooling error: pfSense 25.07 uses PLAIN-TEXT logs, I was reading with `clog` (old binary format); read directly, dhcpd.log shows 1241 ACK / 1 NAK / 0 no-free-leases; (3) band steering + min-RSSI are PRE-EXISTING, not the new trigger; (4) the per-room /28 subnets are INTENTIONAL HIPAA isolation and are healthy (fullest 12/13), not the cause.
+
+Produced a prioritized whole-network smoothing plan, dry-ran the radio changes, and answered Howard's questions on client limits and the residents bandwidth cap. Howard set a hard rule: **do not change anything on Cascades prod infra without discussing it + explicit per-change go** (saved to memory `feedback_cascades.md` rule #4 + MEMORY.md). Saving now to survive context condensation, then continuing.
+
+## Key Decisions / Findings
+
+- **Phone drop root cause = intentional 6/16 22:38 pfSense reboot** (one-time, transient; resolved). NOT the 2.4 radio work, NOT DHCP, NOT the 6/14 DHCP-metric spike.
+- **DHCP server is healthy** (read logs DIRECTLY, not via clog). `sta_dhcp_failures` metric is client/WiFi-side (frames lost at 100% retry), not pfSense.
+- **No WLAN client-device limit is set** (CSCNet/CSC ENT/Guest empty). Busiest AP = Dining Room 67 clients; median busy ~18; U7 Pro handles it. No limit needed; if a safety cap is wanted, ~127/radio. Not a drop cause.
+- **40 Mbps residents cap = PER-CLIENT** (`Resident` user group 40 down/10 up), not a shared VLAN pipe. Generous; NOT a drop cause; raising it won't help drops. NOTE: CSCNet's default usergroup is actually 'Default' = 100/100 (id `685f39078e65331c46ef7edb`); `Resident` group is `685f39078e65331c46ef7edc` with enabled-flag `undefined` — VERIFY whether the 40 cap is actually active/which clients before touching.
+- **Auto-action settings verified OFF/safe**: `network_optimization`=false, `ips`=IDS-only (non-blocking), `radio_ai.auto_enabled`=false, mesh=none (0 wireless uplinks), PMF/802.11r/v/k=off, group_rekey=0. No hidden auto-deauth.
+- **Per-room /28 segmentation = intentional HIPAA L2 isolation** via CSCNet shared-PPSK (~230–242 per-key→room-VLAN mappings, rooms 101–631). Healthy occupancy (fullest 12/13). DO NOT flatten.
+
+## Approved scope (Howard, this session)
+
+- **DO Phase 0 + Phase 1.** Dry-run first to verify (done — clean).
+- **SKIP item (e):** recover 2 down APs + 3 offline switches — requires onsite, deferred.
+- **SKIP DFS channel change** for now ("if the logs don't show issues then it's fine").
+- **KEEP band steering ON** ("I don't want band steering off just yet").
+- **Turn on EVERY kind of logging** available (pfSense) to solve quickly.
+- **Client limit:** none set; advised not needed (Howard asked for a number → ~127 if wanted).
+- **Residents 40 Mbps cap:** advised per-client + not a drop cause; verify-before-change.
+- **HARD RULE:** no changes without discussing + explicit per-change go; read-only/dry-run until then.
+
+## The verified change-set (dry-run clean; NOT applied)
+
+**Phase 1 — combined per-AP `radio_table` PUT (ONE bounce per AP — Howard's ask to avoid double-bounce):**
+- 2.4 `ng` `tx_power_mode` → **medium** — ONLY the 42 radios currently at `low` (EXCLUDE the 24 disabled [would re-enable them] + ~10 Floor-5/6 `auto` radios). `apply-radio ... ng power medium` blanket = WRONG (hits all 77).
+- 5 GHz `na` `ht` → **40** (80→40 MHz) — 76 radios (1 already 40). Clean, no disabled/wrong-direction hits.
+- 5 GHz `na` `min_rssi` → **‑82** — ONLY the 69 radios currently at `-77` enabled (EXCLUDE the 5 currently OFF: 615/608/505/517/622 [would enable], and 108 at -89 [would tighten]).
+- NO `na` channel/DFS change. NO band steering. NO client limit. NO cap change.
+- Tooling note: `apply-radio.sh` does ONE band+setting per reprovision = multiple bounces. To get ONE bounce/AP, BUILD A COMBINED radio_table PUT (ng power + na ht + na min_rssi in a single PUT per AP, correctly scoped per above). RW write cred IS vaulted: `infrastructure/uos-server-network-api-rw` (so `--apply` is possible once gone-ahead). apply-radio rollback auto-saves to `.claude/tmp/apply-rollback-*.json`.
+
+**Phase 0 — non-disruptive:**
+- pfSense `ping-check` → off (all 240 DHCP pools currently have ping-check true; adds ~1s/lease, causes client-side DHCP timeouts = the failure metric). Reversible.
+- Disable 3 AM AP firmware auto-upgrade (`mgmt.auto_upgrade=true, hour=3` — nightly AP reboots/drops).
+- Turn on full pfSense logging: DHCP (already on), DNS resolver (unbound), firewall (filter, incl. default-block + resident-VLAN rules), system, gateway; BUMP log rotation/retention. CAVEAT: DNS-query + firewall-pass logging is high-volume → set sane rotation, dial back chattiest after we catch the issue.
+
+## Problems Encountered (my own corrections — for linting)
+
+- **clog vs plain-text logs (pfSense 25.07):** read DHCP/system logs with `clog` → returned empty → I wrongly concluded "DHCP log empty / phones not doing DHCP." 25.07 logs are PLAIN TEXT; use `tail`/`grep` directly. Real data: dhcpd.log healthy (1241 ACK/1 NAK). WASTED a hypothesis. (Worth an errorlog `--friction`.)
+- **stat duration units:** `ace_stat.stat_hourly` o:'user' `duration` is MILLISECONDS, not seconds — first uptime calc was inflated 1000×. Corrected.
+- Conflated two metrics (site-wide DHCP-failures vs per-phone drops) → Howard caught it; they're decoupled (different populations, different dates).
+
+## Infrastructure & Servers
+
+- **UniFi UOS controller** `172.16.3.29`; Cascades site name `va6iba3v`, id `685f39068e65331c46ef6dd2`. Access: `infrastructure/uos-server-ssh-key` + `.claude/scripts/uos-mongo.sh` (db `ace` config, `ace_stat` history). Write cred: `infrastructure/uos-server-network-api-rw` (vaulted); read: `infrastructure/uos-server-network-api`.
+- **Cascades pfSense** `192.168.0.1`, Plus 25.07-RELEASE. Admin SSH cred `clients/cascades-tucson/pfsense-firewall`. Access via `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson audit|dhcp|run "<cmd>"`. VPN (OpenVPN Connect) already up — pfSense reachable from Howard-Home. kern.boottime 2026-06-16 22:38:12 MST.
+- **WiFi:** 77 APs (mostly U7PRO), ~573–788 clients. CSCNet = shared-PPSK SSID (wlanconf `685f39078e65331c46ef7ee5`, networkconf `685f39078e65331c46ef8ac4`, ~242 PPSK keys), bands 2g+5g (no 6e), `no2ghz_oui=true` (band steering ON), usergroup `Default`(100/100). Other SSIDs: CSC ENT (`...7ee4`), Guest (`...7ee6`, VLAN 50 isolated), element (`...7ee3`).
+- **Poly phones:** OUI `48:25:67`, 30 seen (~22 active fleet), 26/30 on 5 GHz (na), 4 on 2.4 (ng), on VLAN 20 (`10.0.20.x`, network "Internal"). AudioCodes (8, wired) + Vertical-Remote desktop are the wired voice gear.
+- **pfSense DHCP:** 240 pools, 236 are /28 (13-host) per-room HIPAA VLANs; VLAN 20 = `10.0.20.0/24` range .50–.239 (190 hosts, 61 active), ping-check on ALL pools, lease 7200/86400. ISC dhcpd (Kea dormant).
+- **RF prior work (other session, in repo):** overnight 6/17 applied 24 of 76 2.4 radios DISABLED + 42 set to Low (~6 dBm); Floors 5/6 + mesh untouched. Over-thinned (retry 17→23.4%, satisfaction 39→30). Recommend Low→Medium. Reports: `reports/2026-06-16-unifi-full-audit.md`, `reports/2026-06-16-2.4ghz-remediation-runbook.md`. 5 GHz: 55/77 on DFS, 76/77 at 80 MHz; audit found 0 radar events (DFS low-risk). 2 APs down (108 + dup); 3 switches offline (2nd-Flr#2, 4th-Flr#2, USW Pro Max 16); ~25–34 gig ports linked at 100M.
+
+## Commands & Outputs (key)
+
+- Per-phone drop timeline: `ace_stat.stat_hourly` o:'user', oid in Poly macs, `duration` (MS) per hour → uptime%. 6/16 = 97.95% / 28 drop-hrs (the event); all other days 99.9%+.
+- pfSense boot: `sysctl -n kern.boottime` → 1781674692 = Tue Jun 16 22:38:12 2026 MST; uptime 18:13 (single reboot).
+- DHCP health (DIRECT read, NOT clog): `grep -c DHCPACK /var/log/dhcpd.log` → 1241; DHCPNAK → 1; "no free leases" → 0. `file /var/log/dhcpd.log` → ASCII text.
+- Dry-runs (all gated, no --apply): `apply-radio cascades ng power medium` (42 low correct; 24 disabled + 10 auto WRONG to include), `apply-radio cascades na width 40` (76 change clean), `apply-radio cascades na minrssi -82` (69 at -77 correct; 5 OFF + 108 WRONG).
+- Coord: posted hold-note to other radio session (msg id `5690f1b8...`).
+
+## Pending / Next steps (RESUME HERE)
+
+1. **Decide Phase 1 window** — it kicks every client off each AP as it reprovisions. Recommend per-zone (floor-by-floor) in an evening window (e.g. after 8pm), NOT all-at-once during the day. Howard to choose window (or accept brief per-AP blips now). NOT yet answered.
+2. **Get explicit go**, then:
+   - **Phase 0** (safe anytime): apply ping-check off + disable 3 AM auto-upgrade + enable full pfSense logging (with rotation).
+   - **Phase 1** (windowed): build + apply the COMBINED per-AP radio_table PUT (ng power medium [42] + na ht 40 [76] + na min_rssi -82 [69]), scoped exactly as above, per-zone, validate live with watch-ap before/after.
+3. Deferred: DFS change (revisit only if new logs show DFS issues), band steering (Howard wants kept ON for now), onsite item (e) down APs/switches, VLAN 30 voice cutover, residents 40 Mbps cap (verify active first), the 6/14 site-wide DHCP-failure metric (other clients — re-check with now-readable logs).
+4. HARD RULE: discuss + explicit per-change go before ANY apply. Read-only/dry-run only otherwise.
+
+## Reference
+
+- Memory: `feedback_cascades.md` rule #4 (no prod changes w/o discussing); `project_cascades_isolated_vlan_pattern.md` (the /28 HIPAA design + pfSense PHP API path).
+- Phone-mystery sibling log: `2026-06-17-howard-cs-server-drive-review-and-spike-question.md`.
+- Client wiki: `wiki/clients/cascades-tucson.md`.