diff --git a/.claude/skills/unifi-wifi/SKILL.md b/.claude/skills/unifi-wifi/SKILL.md index ba693d8..edd60ca 100644 --- a/.claude/skills/unifi-wifi/SKILL.md +++ b/.claude/skills/unifi-wifi/SKILL.md @@ -10,6 +10,21 @@ performance + stability for connected devices in congested environments by analy controller knows and making prioritized, validated changes. Built for any site; **Cascades** (77 APs, ~550 clients, brutal 2.4GHz) is the reference hard case. +## Status (2026-06-15) +- **[WORKING] WiFi monitoring + RF tuning** — complete data-gathering for any UOS site/client: + config + interference (`audit-site`), live per-AP + per-client (`live-stats`), airtime history + + roam graph (`model-rank`/`optimize-radios`), measured per-channel occupancy (`survey-collect`), + empirical DFS radar history (`dfs-check`), the **AP-to-AP SNR neighbor matrix** from + `/proc/ui_neighbor` (`neighbor-collect`), per-AP live stream (`watch-ap`), and gated config apply + (`apply-radio`). All site-parameterized → works on every UniFi site we monitor. +- **[WIP] Switches / PoE, gateways / WAN / firewall, adoption, client DHCP/DNS** — not yet wrapped + as collectors. The access layer reaches them (the same `uos-mongo.sh` covers the whole `ace` DB; + the controller API + device SSH reach switches/gateways) — these just need dedicated scripts. + Use ad-hoc for now; productize as the need arises (or split into a sibling `unifi` skill). +- **Per-client requirement:** `watch-ap`/`neighbor-collect`/`survey-collect`/`dfs-check` default the + AP device-auth SSH cred to `clients/cascades-tucson/unifi-ap-ssh`; for another client, vault its + own `clients//unifi-ap-ssh` and pass it as the script's vault-path arg. + ## First, load context - **[references/data-access.md](references/data-access.md)** — what data the UOS exposes and how to read it (the two planes: Mongo config/interference now, live Network API later). @@ -62,6 +77,21 @@ controller knows and making prioritized, validated changes. Built for any site; min data rates -> manual 1/6/11 plan -> min-RSSI + roaming -> steer to 6GHz). 3. **Recommend** a prioritized, per-zone change plan. Roll out per zone, not site-wide at once. +### AP-side collectors (direct AP SSH over the site VPN — non-disruptive, foreground) +These read each AP directly (controller hides this data). All take ` [ap-ssh-vault-path]`, +use sshpass-or-SSH_ASKPASS auth, and must run in the **foreground**: +```bash +# AP-to-AP SNR neighbor matrix (/proc/ui_neighbor) -> redundancy = data-backed disable candidates +bash .claude/skills/unifi-wifi/scripts/neighbor-collect.sh cascades [vault-path] [snr_min=20] +# measured per-channel busy%/noise per AP -> cleanest-channel plan (iw survey dump) +bash .claude/skills/unifi-wifi/scripts/survey-collect.sh cascades [vault-path] +# empirical DFS radar-event history per AP (dmesg) -> is DFS safe at this site? +bash .claude/skills/unifi-wifi/scripts/dfs-check.sh cascades [vault-path] +``` +The neighbor matrix is the breakthrough: UniFi exposes managed-AP-to-managed-AP visibility +**nowhere** in the controller API/DB (all filter our own APs), but each AP keeps it in +`/proc/ui_neighbor/{ess_ap_list,ssid/*}`, populated non-disruptively by background RRM scanning. + Ad-hoc Mongo queries: `.claude/scripts/uos-mongo.sh` (recipes in data-access.md). Access is the vaulted dedicated key `infrastructure/uos-server-ssh-key` (works from any fleet machine). diff --git a/.claude/skills/unifi-wifi/scripts/dfs-check.sh b/.claude/skills/unifi-wifi/scripts/dfs-check.sh new file mode 100644 index 0000000..56b509d --- /dev/null +++ b/.claude/skills/unifi-wifi/scripts/dfs-check.sh @@ -0,0 +1,88 @@ +#!/usr/bin/env bash +# dfs-check.sh — empirical DFS radar-event history across a site's APs (the real DFS reality check). +# +# UniFi does NOT surface radar/DFS events in the controller DB/API. Each AP's kernel ring buffer +# (`dmesg`) records actual radar detections + channel-availability-check (CAC) events. This sweeps +# every AP, greps `dmesg` with PRECISE patterns (the loose `cac`/`dfs` grep false-matches "cached +# ifindex" — excluded here), and reports which APs on DFS channels (52-144) have actually been hit. +# Answers "is DFS safe to use at this site?" with data (matters near military/airport radar). +# Non-disruptive read. Works for any UOS site. +# +# Needs: controller cred (infrastructure/uos-server-network-api-rw, for AP list+5GHz channels) + +# per-site AP device-auth SSH cred + site VPN reach. RUN IN FOREGROUND. +# +# Usage: bash .claude/skills/unifi-wifi/scripts/dfs-check.sh [ap-ssh-vault-path] +set -uo pipefail +REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)" +VAULT="$REPO/.claude/scripts/vault.sh" +HOST="${UOS_HOST:-172.16.3.29}"; PORT="${UOS_HTTPS_PORT:-11443}" +SITEARG="${1:?usage: dfs-check.sh [ap-ssh-vault-path]}" +VP="${2:-clients/cascades-tucson/unifi-ap-ssh}" +TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT + +CU="$(bash "$VAULT" get-field infrastructure/uos-server-network-api-rw credentials.username 2>/dev/null)" +CP="$(bash "$VAULT" get-field infrastructure/uos-server-network-api-rw credentials.password 2>/dev/null)" +[ -n "$CU" ] && [ -n "$CP" ] || { echo "[ERROR] no controller cred (infrastructure/uos-server-network-api-rw)"; exit 1; } +base="https://$HOST:$PORT"; CJ="$TMP/cj" +code=$(curl -sk -c "$CJ" -o /dev/null -w '%{http_code}' -X POST "$base/api/auth/login" -H 'Content-Type: application/json' \ + --data-binary "$(python -c 'import json,sys;print(json.dumps({"username":sys.argv[1],"password":sys.argv[2]}))' "$CU" "$CP")") +[ "$code" = "200" ] || { echo "[ERROR] controller login HTTP $code"; exit 1; } +SHORT="$(curl -sk -b "$CJ" "$base/proxy/network/api/self/sites" | python -c " +import sys,json; d=json.load(sys.stdin).get('data',[]); q='''$SITEARG'''.lower() +for s in d: + if s.get('_id')=='''$SITEARG''' or s.get('name')=='''$SITEARG''' or q in (s.get('desc','').lower()): print(s.get('name')); break +")"; [ -n "$SHORT" ] || SHORT="$SITEARG" +echo "[INFO] site=$SHORT" +# AP list with current 5GHz channel (flag DFS APs). +# NOTE: temp paths via ARGV (MSYS translates POSIX->Windows for python.exe); $TMP inside a +# `python -c` string is not translated and fails on Windows. +curl -sk -b "$CJ" "$base/proxy/network/api/s/$SHORT/stat/device" -o "$TMP/dev.json" +python - "$TMP/dev.json" "$TMP/aps.tsv" <<'PY' +import sys,json +out=[] +for a in json.load(open(sys.argv[1])).get('data',[]): + if a.get('type')!='uap' or a.get('state')!=1 or not a.get('ip'): continue + na=[r for r in a.get('radio_table_stats',[]) if r.get('radio')=='na'] + ch=na[0].get('channel') if na else '?' + try: dfs='DFS' if 52<=int(ch)<=144 else 'clear' + except: dfs='?' + out.append(f"{a.get('name') or a.get('mac')}\t{a.get('ip')}\t{ch}\t{dfs}") +open(sys.argv[2],'w',newline='\n').write('\n'.join(out)) +print(f"[INFO] {len(out)} online APs; {sum(1 for x in out if chr(9)+'DFS' in x)} currently on DFS 5GHz channels") +PY + +AU="$(bash "$VAULT" get-field "$VP" credentials.username 2>/dev/null)" +AP_PW="$(bash "$VAULT" get-field "$VP" credentials.password 2>/dev/null)"; export AP_PW +[ -n "$AU" ] && [ -n "$AP_PW" ] || { echo "[ERROR] no AP device-auth cred at vault:$VP"; exit 1; } +SSH_OPTS=(-o ConnectTimeout=10 -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null \ + -o PreferredAuthentications=password -o PubkeyAuthentication=no -o NumberOfPasswordPrompts=1) +if command -v sshpass >/dev/null 2>&1; then + ap_ssh() { SSHPASS="$AP_PW" sshpass -e ssh "${SSH_OPTS[@]}" "$@" "$ASKP"; chmod +x "$ASKP" + ap_ssh() { SSH_ASKPASS="$ASKP" SSH_ASKPASS_REQUIRE=force DISPLAY="${DISPLAY:-:0}" ssh "${SSH_OPTS[@]}" "$@" &2 + ev=$(ap_ssh "$AU@$ip" "dmesg 2>/dev/null | grep -iE '$PAT' | grep -iv 'cached' | tail -4" 2>/dev/null) + cnt=$(printf '%s' "$ev" | grep -c . ) + mark=$([ "$dfs" = "DFS" ] && echo ' *' || echo '') + if [ "$cnt" -gt 0 ]; then + hits=$((hits+1)) + echo "[RADAR] $name (5GHz ch$ch$mark): $cnt event(s)" + printf '%s\n' "$ev" | sed 's/^/ /' + fi +done < "$TMP/aps.tsv"; echo "" >&2 +echo "" +if [ "$hits" -eq 0 ]; then + echo "[OK] No real radar/DFS events found in any AP's dmesg → DFS appears low-risk at this site." + echo " (dmesg is bounded; re-run periodically. Absence over long uptime = strong signal DFS is usable.)" +else + echo "[WARNING] $hits AP(s) logged radar/DFS events → DFS is being hit; prefer non-DFS (UNII-1 36-48 + UNII-3 149-165)." +fi diff --git a/.claude/skills/unifi-wifi/scripts/survey-collect.sh b/.claude/skills/unifi-wifi/scripts/survey-collect.sh new file mode 100644 index 0000000..9f4a382 --- /dev/null +++ b/.claude/skills/unifi-wifi/scripts/survey-collect.sh @@ -0,0 +1,114 @@ +#!/usr/bin/env bash +# survey-collect.sh — measured per-channel RF occupancy (busy% + noise) for every AP in a site. +# +# Reads `iw dev survey dump` from each AP (NON-DISRUPTIVE — the AP's background scanning +# already populated it). Per AP, per band, reports the in-use channel's busy% and the cleanest +# available channels by measured airtime — the data-driven input for a manual channel plan +# (vs. inferring from the foreign-neighbor `rogue` map). Pairs with neighbor-collect.sh (overlap) +# and audit-site.sh (config). Works for any UOS site; /proc + iw exist on every UniFi AP. +# +# Non-disruptive. Needs: controller cred (infrastructure/uos-server-network-api-rw, for the AP +# name/ip list) + per-site AP device-auth SSH cred + L3 reach to the AP mgmt VLAN (site VPN). +# AP SSH uses sshpass if present, else SSH_ASKPASS fallback. RUN IN FOREGROUND (a detached +# background process can't spawn the askpass helper). +# +# Usage: bash .claude/skills/unifi-wifi/scripts/survey-collect.sh [ap-ssh-vault-path] +set -uo pipefail +REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)" +VAULT="$REPO/.claude/scripts/vault.sh" +HOST="${UOS_HOST:-172.16.3.29}"; PORT="${UOS_HTTPS_PORT:-11443}" +SITEARG="${1:?usage: survey-collect.sh [ap-ssh-vault-path]}" +VP="${2:-clients/cascades-tucson/unifi-ap-ssh}" +TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT + +# --- controller login + AP (name,ip) list --- +CU="$(bash "$VAULT" get-field infrastructure/uos-server-network-api-rw credentials.username 2>/dev/null)" +CP="$(bash "$VAULT" get-field infrastructure/uos-server-network-api-rw credentials.password 2>/dev/null)" +[ -n "$CU" ] && [ -n "$CP" ] || { echo "[ERROR] no controller cred (infrastructure/uos-server-network-api-rw)"; exit 1; } +base="https://$HOST:$PORT"; CJ="$TMP/cj" +code=$(curl -sk -c "$CJ" -o /dev/null -w '%{http_code}' -X POST "$base/api/auth/login" -H 'Content-Type: application/json' \ + --data-binary "$(python -c 'import json,sys;print(json.dumps({"username":sys.argv[1],"password":sys.argv[2]}))' "$CU" "$CP")") +[ "$code" = "200" ] || { echo "[ERROR] controller login HTTP $code"; exit 1; } +SHORT="$(curl -sk -b "$CJ" "$base/proxy/network/api/self/sites" | python -c " +import sys,json; d=json.load(sys.stdin).get('data',[]); q='''$SITEARG'''.lower() +for s in d: + if s.get('_id')=='''$SITEARG''' or s.get('name')=='''$SITEARG''' or q in (s.get('desc','').lower()): print(s.get('name')); break +")"; [ -n "$SHORT" ] || SHORT="$SITEARG" +echo "[INFO] site=$SHORT" +# NOTE: pass temp paths as ARGV (MSYS translates POSIX->Windows for the python.exe); embedding +# $TMP inside a `python -c` string is NOT translated and fails on Windows. +curl -sk -b "$CJ" "$base/proxy/network/api/s/$SHORT/stat/device" -o "$TMP/dev.json" +python - "$TMP/dev.json" "$TMP/aps.tsv" <<'PY' +import sys,json +d=json.load(open(sys.argv[1])) +aps=[(a.get('name') or a.get('mac'),a.get('ip')) for a in d.get('data',[]) if a.get('type')=='uap' and a.get('state')==1 and a.get('ip')] +open(sys.argv[2],'w',newline='\n').write('\n'.join(f"{n}\t{i}" for n,i in aps)) +print(f"[INFO] {len(aps)} online APs") +PY + +# --- AP SSH auth (sshpass or SSH_ASKPASS fallback) --- +AU="$(bash "$VAULT" get-field "$VP" credentials.username 2>/dev/null)" +AP_PW="$(bash "$VAULT" get-field "$VP" credentials.password 2>/dev/null)"; export AP_PW +[ -n "$AU" ] && [ -n "$AP_PW" ] || { echo "[ERROR] no AP device-auth cred at vault:$VP"; exit 1; } +SSH_OPTS=(-o ConnectTimeout=10 -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null \ + -o PreferredAuthentications=password -o PubkeyAuthentication=no -o NumberOfPasswordPrompts=1) +if command -v sshpass >/dev/null 2>&1; then + ap_ssh() { SSHPASS="$AP_PW" sshpass -e ssh "${SSH_OPTS[@]}" "$@" "$ASKP"; chmod +x "$ASKP" + ap_ssh() { SSH_ASKPASS="$ASKP" SSH_ASKPASS_REQUIRE=force DISPLAY="${DISPLAY:-:0}" ssh "${SSH_OPTS[@]}" "$@" > "$RAW" + if ap_ssh "$AU@$ip" 'for r in wifi0 wifi1 wifi2 ath0 ath1; do iw dev $r survey dump 2>/dev/null; done' >> "$RAW" 2>/dev/null; then ok=$((ok+1)); else echo "@@UNREACHABLE" >> "$RAW"; fi + printf '\r[INFO] surveyed %d/%d (ok %d) ' "$n" "$tot" "$ok" >&2 +done < "$TMP/aps.tsv"; echo "" >&2 + +# --- parse + report cleanest channels per AP per band --- +python - "$RAW" <<'PY' +import sys,re +def band(f): + f=int(f) + if 2400=5000: return (f-5000)//5 + return f +DFS=set(range(52,145)) +cur=None; rec={}; data={} # data[ap][band]=list of (busy%,ch,inuse,noise) +def flush(): + if cur and rec.get('f') and rec.get('act',0)>0: + b=band(rec['f']); c=ch(rec['f']); busy=round(100*rec.get('busy',0)/rec['act']) + data.setdefault(cur,{}).setdefault(b,[]).append((busy,c,rec.get('inuse',False),rec.get('noise','?'))) +for ln in open(sys.argv[1],encoding='utf-8',errors='replace'): + ln=ln.rstrip() + if ln.startswith('###AP'): flush(); rec={}; cur=ln.split('\t')[1] if '\t' in ln else ln[6:]; continue + if ln.startswith('@@UNREACHABLE'): continue + if 'frequency:' in ln: flush(); m=re.search(r'(\d+) MHz',ln); rec={'f':m.group(1) if m else None,'inuse':'in use' in ln} + elif 'noise:' in ln: m=re.search(r'(-?\d+) dBm',ln); rec['noise']=m.group(1) if m else '?' + elif 'channel active time' in ln: m=re.search(r'(\d+) ms',ln); rec['act']=int(m.group(1)) if m else 0 + elif 'channel busy time' in ln: m=re.search(r'(\d+) ms',ln); rec['busy']=int(m.group(1)) if m else 0 +flush() +print(f"\n==== MEASURED RF OCCUPANCY — cleanest channels per AP ({len(data)} APs) ====") +print("(in-use channel busy%, then 3 lowest-busy NON-DFS channels measured; * = DFS)\n") +for ap in sorted(data): + print(f"{ap}:") + for b in ('2.4','5','6'): + rows=data[ap].get(b,[]) + if not rows: continue + inuse=[r for r in rows if r[2]] + iu=f"ch{inuse[0][1]}={inuse[0][0]}%" if inuse else "?" + nondfs=sorted([r for r in rows if r[1] not in DFS], key=lambda r:r[0])[:3] + clean=", ".join(f"ch{c}({bz}%)" for bz,c,_,_ in nondfs) + print(f" {b}GHz in-use {iu} | cleanest non-DFS: {clean}") +PY +echo "" +echo "[next] use the cleanest-channel data for a manual 1/6/11 (2.4) + non-DFS (5GHz) plan; apply via apply-radio.sh per zone." diff --git a/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md b/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md index 62130fe..eee4b39 100644 --- a/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md +++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md @@ -275,3 +275,29 @@ unifi-WIFI): switch/PoE, gateway/WAN/firewall, adoption — reachable via the sa - [ ] Per-client AP device-auth creds for other clients when extending beyond Cascades. - [ ] Floor-4 2.4 power-down pilot (still the next live change; nothing applied yet). - Coord this update: neighbor-collect.sh announce 3dbe2437. + +--- + +## Update: 23:50 PT — gap 1 + gap 2 closed: survey-collect + dfs-check built; SKILL marked wifi-working / switches-WIP + +GAP 1 (fold ad-hoc recipes into reusable collectors) — DONE, both validated on Cascades: +- **survey-collect.sh** ` [ap-ssh-vault-path]` — per-AP per-channel busy%/noise from + `iw survey dump` -> in-use busy% + cleanest non-DFS channels per band. 72 APs reported. + Result: 2.4 saturated 94-97% on every channel; 5GHz cleanest non-DFS = ch36/48/153/161; 6GHz ~0%. +- **dfs-check.sh** ` [ap-ssh-vault-path]` — per-AP `dmesg` radar history, PRECISE patterns + (excludes the "cached ifindex" false-positive). Result: **55/75 APs on DFS, ZERO real radar + events fleet-wide** -> DFS empirically low-risk at Cascades despite Davis-Monthan proximity + (confirms the retry-rate finding; the DFS concern was theoretical). + +GAP 2 (scope) — DONE via SKILL.md Status block: WiFi monitoring+tuning = **WORKING** for every UOS +site we monitor (audit-site, live-stats, model-rank, optimize-radios, survey-collect, dfs-check, +neighbor-collect, watch-ap, apply-radio — all site-parameterized). Switches/PoE, gateway/WAN/ +firewall, adoption = **WIP** (access layer reaches them; need dedicated collectors). Per-client: +vault each client's own `clients//unifi-ap-ssh`, pass as the script arg. + +CROSS-PLATFORM LESSON (baked into both new scripts): pass temp paths to python via ARGV — MSYS +translates POSIX->Windows for python.exe; `$TMP` embedded in a `python -c` string is NOT translated +and fails ("No such file or directory") on Windows. Use `python - "$TMP/x" <<'PY' ... sys.argv[1]`. + +Coord: collectors+status announce c3ccaa07. Next: wire neighbor-collect redundancy into +optimize-radios.sh; Floor-4 2.4 power-down pilot (still nothing applied to live radios).