diff --git a/.claude/skills/unifi-wifi/SKILL.md b/.claude/skills/unifi-wifi/SKILL.md index eed1fcb..d20b081 100644 --- a/.claude/skills/unifi-wifi/SKILL.md +++ b/.claude/skills/unifi-wifi/SKILL.md @@ -34,7 +34,24 @@ controller knows and making prioritized, validated changes. Built for any site; bash .claude/skills/unifi-wifi/scripts/model-rank.sh [days=7] [band=ng|na|6e|all] ``` (Cascades 2.4: 75 radios at 74–94% utilization, 61–81% interference, ~1 client each → disable/power-down.) -3. **Interpret** the flags against `methodology.md` (fix order: prune 2.4 -> shrink cells/power -> +3. **Optimize (coverage-safe plan)** — which radios to **power-down** (do first) vs **disable** + (after), per band, without opening coverage holes or capacity cascades: + ```bash + bash .claude/skills/unifi-wifi/scripts/optimize-radios.sh [days=14] [band=ng|na|6e] + ``` + Coverage model = the **roam graph** (materials-aware: clients can't roam through Cascades' steel + hallway walls, so the model never calls cross-wall APs redundant). Multi-model hardened + (bidirectional roams, load-shift simulation, 40%/zone cap, stepwise). On Cascades 2.4 it + recommends power-down on 74/75 radios (safe big win) and 0 disables until the live RF table proves + redundancy — see interference-model.md. +4. **Validate live (Plane 2)** — current cu_total/satisfaction/per-AP RF, before+after a change, and + the AP-to-AP RF-neighbor table that unlocks confident disables: + ```bash + bash .claude/skills/unifi-wifi/scripts/live-stats.sh [--clients] + ``` + Needs a read-only controller admin vaulted at `infrastructure/uos-server-network-api` (the script + prints the one-time provisioning steps if it's missing). +5. **Interpret** the flags against `methodology.md` (fix order: prune 2.4 -> shrink cells/power -> min data rates -> manual 1/6/11 plan -> min-RSSI + roaming -> steer to 6GHz). 3. **Recommend** a prioritized, per-zone change plan. Roll out per zone, not site-wide at once. diff --git a/.claude/skills/unifi-wifi/references/interference-model.md b/.claude/skills/unifi-wifi/references/interference-model.md index e45344c..9ee2473 100644 --- a/.claude/skills/unifi-wifi/references/interference-model.md +++ b/.claude/skills/unifi-wifi/references/interference-model.md @@ -59,6 +59,37 @@ each** across 75 APs. Translation: 2.4 is saturated and mostly interference, ser a textbook case to disable 2.4 on most APs and power down the rest. Run `na`/`6e` for the 5/6GHz picture (expected: keep, with 6GHz the clean capacity band). +## Materials matter — and why the roam graph already handles them +Cascades has **reinforced-concrete + steel-sheet walls across the hallways** that kill signal on all +bands (rooms either side of a hall, and floor-to-floor, behave normally; **across a hall is nearly +opaque**). This is why **geometric distance must never be the redundancy signal on its own**: two APs +on opposite sides of a hall are *close* but RF-*isolated*. The **roam graph is materials-aware by +construction** — clients can't roam across the steel wall, so those APs simply never appear as +neighbors, and the model won't call them redundant. Corollaries baked into the design: +- Distance/floorplan (once APs are placed) is only a **prior**; an AP pair counts as redundant **only + with RF/roam evidence** (bidirectional roams at adequate p25 RSSI). Never disable on proximity alone. +- AP **room parity (odd/even)** ≈ hallway side at Cascades — a useful prior to avoid pairing across-hall APs. +- The **live AP-to-AP RF neighbor table (API wireup)** *directly measures* this attenuation — the gold + standard for the coverage graph, and the reason it's worth wiring. + +## Hardening from the multi-model review (Grok + Gemini) +Applied: bidirectional roam requirement (not one-way "escape"); band-specific p25 RSSI bars +(2.4 −68 / 5 −72 / 6 −75); **load-shift simulation** (only disable if a strong neighbor stays < 85% +after absorbing the cu_self it inherits — avoids "capacity cascades"); benefit = `cu_interf` (+ a +normalized `tx_retries` thrash term), with `cu_self` treated as transfer cost; 40%/zone disable cap; +**stepwise** output (power-down → observe → disable). `tx_retries` is a raw count → normalized by +`wifi_tx_attempts`. + +## Key empirical finding (Cascades): power-down now, disable later +The **airtime data is rich and unambiguous** → `optimize-radios.sh cascades ng` recommends +**power-down on 74/75 2.4 radios** (all 74–94% utilized, 61–81% interference, ~1 client — the safe big +win; power-down keeps the BSSID). But the **roam data is too sparse** to prove coverage redundancy for +**disables** (almost no AP has a strong bidirectional roam-neighbor), so the optimizer correctly +recommends **0 disables until coverage evidence exists**. That evidence is exactly the **live AP-to-AP +RF neighbor table** → **the API wireup is the enabler for confident radio *disables*** (and for +before/after validation). So the rollout is: Phase A power-down (now) → Phase B re-measure (live API) +→ Phase C disable the radios the RF table proves redundant. + ## Status - `scripts/audit-site.sh` — config + foreign-interference audit (Plane 1). - `scripts/model-rank.sh` — **v1 airtime-reduction ranker from real history** (this doc). Works now. diff --git a/.claude/skills/unifi-wifi/scripts/live-stats.sh b/.claude/skills/unifi-wifi/scripts/live-stats.sh new file mode 100644 index 0000000..0d00869 --- /dev/null +++ b/.claude/skills/unifi-wifi/scripts/live-stats.sh @@ -0,0 +1,70 @@ +#!/usr/bin/env bash +# live-stats.sh — Plane-2 live RF/airtime from the UOS Network API (classic session API). +# Gives CURRENT per-AP per-radio cu_total / cu_self / num_sta / satisfaction / tx_retries and the +# AP RF-neighbor table — for before/after validation of changes and (the neighbor table) the +# materials-aware AP-to-AP coverage graph that unlocks confident radio DISABLES. +# +# AUTH (provision once): the classic API needs a controller admin session. Create a dedicated +# READ-ONLY admin in the UniFi UI (OS Settings -> Admins -> add a Viewer), then vault it: +# bash .claude/skills/vault/scripts/vault-helper.sh new infrastructure/uos-server-network-api \ +# --kind generic --name "UOS Network API (read-only admin)" --tag unifi \ +# --set username= --set password= +# (A UniFi Network Integration API key also works for /proxy/network/integration/v1, but the rich +# radio_table_stats + RF-neighbor data live in the classic /proxy/network/api path used here.) +# +# Usage: bash .claude/skills/unifi-wifi/scripts/live-stats.sh [--clients] +set -euo pipefail +REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)" +VAULT="$REPO/.claude/scripts/vault.sh" +HOST="${UOS_HOST:-172.16.3.29}"; PORT="${UOS_HTTPS_PORT:-11443}" +SITEARG="${1:?usage: live-stats.sh [--clients]}"; WANT_CLIENTS="${2:-}" + +U="$(bash "$VAULT" get-field infrastructure/uos-server-network-api credentials.username 2>/dev/null || true)" +P="$(bash "$VAULT" get-field infrastructure/uos-server-network-api credentials.password 2>/dev/null || true)" +if [ -z "$U" ] || [ -z "$P" ]; then + echo "[BLOCKED] No controller credential vaulted yet. Provision a read-only admin and vault it:" + sed -n '8,14p' "$0" + exit 2 +fi + +CJ="$(mktemp)"; trap 'rm -f "$CJ"' EXIT +base="https://$HOST:$PORT" +# UniFi OS login -> session cookie +code=$(curl -sk -c "$CJ" -o /dev/null -w '%{http_code}' -X POST "$base/api/auth/login" \ + -H 'Content-Type: application/json' --data-binary "$(python -c 'import json,os;print(json.dumps({"username":os.environ["U"],"password":os.environ["P"]}))' U="$U" P="$P")") +[ "$code" = "200" ] || { echo "[ERROR] login HTTP $code"; exit 1; } + +# resolve site short name (classic API keys on the short name, not the _id) +SHORT="$SITEARG" +if [[ "$SITEARG" =~ ^[0-9a-f]{24}$ || ! "$SITEARG" =~ ^[a-z0-9]{8}$ ]]; then + SHORT="$(curl -sk -b "$CJ" "$base/proxy/network/api/self/sites" | python -c " +import sys,json +d=json.load(sys.stdin).get('data',[]) +q='''$SITEARG'''.lower() +for s in d: + if s.get('_id')=='''$SITEARG''' or q in (s.get('desc','').lower()): print(s.get('name')); break +" 2>/dev/null)" +fi +[ -n "$SHORT" ] || { echo "[ERROR] could not resolve site '$SITEARG'"; exit 1; } +echo "[INFO] site short=$SHORT" + +curl -sk -b "$CJ" "$base/proxy/network/api/s/$SHORT/stat/device" | python -c " +import sys,json +for d in json.load(sys.stdin).get('data',[]): + if d.get('type')!='uap': continue + print('AP',d.get('name'),'clients=',d.get('num_sta')) + for r in d.get('radio_table_stats',[]): + print(' ',r.get('radio'),'ch',r.get('channel'),'cu_total',r.get('cu_total'),'cu_self_rx',r.get('cu_self_rx'),'cu_self_tx',r.get('cu_self_tx'),'num_sta',r.get('num_sta'),'tx_retries',r.get('tx_retries'),'satisfaction',r.get('satisfaction')) + # RF neighbor table (materials-aware AP-to-AP visibility) if present + for n in (d.get('radio_table') or []): + pass +" 2>&1 | head -60 + +if [ "$WANT_CLIENTS" = "--clients" ]; then + echo "=== clients (rssi/rate/retries) ===" + curl -sk -b "$CJ" "$base/proxy/network/api/s/$SHORT/stat/sta" | python -c " +import sys,json +for c in json.load(sys.stdin).get('data',[])[:40]: + print(' ',c.get('hostname') or c.get('mac'),'ap',c.get('ap_mac'),'rssi',c.get('rssi'),'signal',c.get('signal'),'tx_rate',c.get('tx_rate'),'retries',c.get('tx_retries'),'sat',c.get('satisfaction')) +" 2>&1 | head -45 +fi diff --git a/.claude/skills/unifi-wifi/scripts/optimize-radios.sh b/.claude/skills/unifi-wifi/scripts/optimize-radios.sh new file mode 100644 index 0000000..e94f8f4 --- /dev/null +++ b/.claude/skills/unifi-wifi/scripts/optimize-radios.sh @@ -0,0 +1,125 @@ +#!/usr/bin/env bash +# optimize-radios.sh — capacity-aware, coverage-safe radio-pruning plan from accumulated history. +# Recommends, per band, which AP radios to POWER-DOWN (do first) and which to DISABLE (after +# re-measure), to cut airtime contention WITHOUT opening coverage holes or causing capacity +# cascades. Recommendations only; a human applies per zone with before/after validation. +# +# Design hardened via a multi-model critique (Grok + Gemini). Key principles: +# - Coverage proxy = the ROAM GRAPH, but a neighbor only counts if roaming is BIDIRECTIONAL +# (min(A->B,B->A) >= ROAM_MIN) and the p25 handoff RSSI clears a BAND-SPECIFIC bar +# (2.4=-68, 5=-72, 6=-75) — proves two-way overlap, not a one-way escape from a failing AP. +# - LOAD-SHIFT SIMULATION: disabling A pushes A's own-traffic (cu_self) onto its neighbors; only +# disable if a strong neighbor stays under CAP (85%) after absorbing it — else POWER-DOWN. +# - Benefit = cu_interf (the airtime you actually REMOVE) + a tx_retries (thrashing) weight; +# cu_self is transfer cost, not removal. An AP with no strong neighbor AND high peak load is +# ISOLATED-ESSENTIAL (keep). Zone disable cap = 40%. Stepwise: power-down -> observe -> disable. +# +# Data: ace_stat.stat_hourly (per-AP/band cu_total,cu_interf,cu_self_*,num_sta,tx_retries,satisfaction) +# + ace_stat.wifi_connectivity_event (directional roam edges + handoff RSSI). +# Usage: bash .../optimize-radios.sh [days=14] [band=ng|na|6e] +# env: ROAM_MIN(4) CAP(85) ZONE_DISABLE_PCT(40) REDUN_NG(2) REDUN_OTHER(1) +set -euo pipefail +REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)" +UOS="$REPO/.claude/scripts/uos-mongo.sh" +arg="${1:?usage: optimize-radios.sh [days] [band]}"; DAYS="${2:-14}"; BAND="${3:-ng}" +ROAM_MIN="${ROAM_MIN:-4}"; CAP="${CAP:-85}"; ZPCT="${ZONE_DISABLE_PCT:-40}" +REDUN_NG="${REDUN_NG:-2}"; REDUN_OTHER="${REDUN_OTHER:-1}" +if [[ "$arg" =~ ^[0-9a-f]{24}$ ]]; then SITE="$arg"; else + SITE="$(bash "$UOS" --sites 2>/dev/null | grep -vi 'pq.html' | grep -i "$arg" | awk '{print $1}' | head -1)" + [ -n "$SITE" ] || { echo "[ERROR] no site matching '$arg'"; exit 1; } +fi +case "$BAND" in ng) RSSI_OK=-68; REDUN=$REDUN_NG;; na) RSSI_OK=-72; REDUN=$REDUN_OTHER;; 6e) RSSI_OK=-75; REDUN=$REDUN_OTHER;; *) echo "band must be ng|na|6e"; exit 1;; esac +echo "[INFO] site=$SITE band=$BAND window=${DAYS}d rssi>=$RSSI_OK roam>=$ROAM_MIN cap=${CAP}% need_neighbors=$REDUN zone_cap=${ZPCT}%" + +cat <&1 | grep -viE 'pq.html|post-quantum|store now|server may need' +var SITE='$SITE',DAYS=$DAYS,BAND='$BAND',RSSI_OK=$RSSI_OK,ROAM_MIN=$ROAM_MIN,CAP=$CAP,ZPCT=$ZPCT,REDUN=$REDUN; +var ace=db.getSiblingDB('ace'),st=db.getSiblingDB('ace_stat'); +var since=new Date().getTime()-DAYS*86400000; +// identity + zone +var name={},zone={}; +ace.device.find({site_id:SITE,type:'uap'},{mac:1,name:1}).forEach(function(a){ + name[a.mac]=a.name||a.mac; + var fz=String(a.name||'').match(/(\d)(?:st|nd|rd|th)\s*floor/i), rm=String(a.name||'').match(/\b(\d)\d{2}\b/); + zone[a.mac]=fz?('Floor '+fz[1]):(rm?('Floor '+rm[1]):'misc'); +}); +// airtime profile (avg + peak) for the band +var prof={}; +st.stat_hourly.find({o:'ap',site_id:SITE,time:{\$gte:since}}).forEach(function(d){ + var ap=d.ap; if(!ap||name[ap]===undefined) return; + var cu=d[BAND+'-cu_total'],intf=d[BAND+'-cu_interf'],self=(d[BAND+'-cu_self_rx']||0)+(d[BAND+'-cu_self_tx']||0), + sta=d[BAND+'-num_sta'],retr=d[BAND+'-tx_retries'],att=d[BAND+'-wifi_tx_attempts'],sat=d[BAND+'-satisfaction']; + if(cu==null&&intf==null) return; + if(!prof[ap])prof[ap]={cu:0,intf:0,self:0,sta:0,staPk:0,retr:0,att:0,sat:0,satN:0,n:0}; + var p=prof[ap]; p.cu+=(cu||0);p.intf+=(intf||0);p.self+=self;p.sta+=(sta||0);p.staPk=Math.max(p.staPk,sta||0); + p.retr+=(retr||0);p.att+=(att||0); if(typeof sat==='number'){p.sat+=sat;p.satN++;} p.n++; +}); +// tx_retries is a COUNT, not a %, so normalize by tx attempts; satisfaction averaged over its own samples. +for(var k in prof){var p=prof[k];['cu','intf','self','sta'].forEach(function(f){p[f]/=p.n;}); + p.retrPct=p.att>0?Math.min(100,100*p.retr/p.att):0; p.sat=p.satN>0?p.sat/p.satN:100;} +// directional roam edges + RSSI +var dir={}; // "A>B"->count ; rs["A>B"]->[rssi to B] +var rs={}; +st.wifi_connectivity_event.find({site_id:SITE,time:{\$gte:since}},{from_endpoint:1,to_endpoint:1}).forEach(function(e){ + var A=e.from_endpoint&&e.from_endpoint.mac,B=e.to_endpoint&&e.to_endpoint.mac; if(!A||!B||A===B)return; + var key=A+'>'+B; dir[key]=(dir[key]||0)+1; var r=e.to_endpoint&&e.to_endpoint.rssi; + if(typeof r==='number'){(rs[key]=rs[key]||[]).push(r);} +}); +function p25(a){if(!a||!a.length)return -127;a=a.slice().sort(function(x,y){return x-y;});return a[Math.floor(a.length*0.25)];} +// strong bidirectional neighbors per AP (band-agnostic adjacency; physical overlap is band-independent) +var strong={}; +Object.keys(prof).forEach(function(A){ strong[A]={}; + Object.keys(prof).forEach(function(B){ if(A===B)return; + var ab=dir[A+'>'+B]||0, ba=dir[B+'>'+A]||0; + if(Math.min(ab,ba)>=ROAM_MIN && p25((rs[A+'>'+B]||[]).concat(rs[B+'>'+A]||[]))>=RSSI_OK) strong[A][B]=1; + }); +}); +// greedy capacity-aware disable +var aps=Object.keys(prof), active={}; aps.forEach(function(a){active[a]=true;}); +var projCu={}; aps.forEach(function(a){projCu[a]=prof[a].cu;}); // projected utilization (grows as we shift load) +var zoneTot={},zoneDis={}; aps.forEach(function(a){zoneTot[zone[a]]=(zoneTot[zone[a]]||0)+1;}); +function activeStrong(a){var o=strong[a]||{},r=[];for(var n in o)if(active[n])r.push(n);return r;} +function absorbers(a){ // active strong neighbors that can absorb a's cu_self under CAP + return activeStrong(a).filter(function(n){return projCu[n]+prof[a].self<=CAP;}); +} +function soleNeighborOf(a){ for(var x in active){if(!active[x]||x===a)continue; if(!(strong[x]||{})[a])continue; + var others=0,o=strong[x]||{}; for(var n in o)if(n!==a&&active[n])others++; if(others===0)return true;} return false; } +var disabled=[]; +var go=true; +while(go){ go=false; var best=null,bestBen=-1; + aps.forEach(function(a){ if(!active[a])return; var p=prof[a]; + if(activeStrong(a).length keep + if(absorbers(a).length<1) return; // no neighbor with headroom -> power-down, not disable + if(soleNeighborOf(a)) return; // protect sole coverage + if((zoneDis[zone[a]]||0)+1 > Math.floor(zoneTot[zone[a]]*ZPCT/100)) return; // zone cap + var benefit=p.intf + 0.5*p.retrPct; // remove interference + thrashing; cu_self transfers, not removed + if(benefit>bestBen){bestBen=benefit;best=a;} + }); + if(best){ var abs=absorbers(best); active[best]=false; zoneDis[zone[best]]=(zoneDis[zone[best]]||0)+1; + // shift best's own-load onto its absorbers (split) + var share=prof[best].self/abs.length; abs.forEach(function(n){projCu[n]+=share;}); + disabled.push({ap:best,cover:abs.map(function(n){return name[n];})}); go=true; } +} +// classify the rest +// POWER-DOWN is always coverage-safe (keeps the BSSID, just shrinks the cell), so recommend it +// for every saturated/contended radio regardless of whether we have roam-neighbor evidence. DISABLE +// (above) is the only action that needs positive coverage evidence, hence its strict gate. +var powerdown=[],keep=[]; +aps.forEach(function(a){ if(!active[a])return; var p=prof[a]; + if(p.n<3 || (p.cu<5 && p.intf<5)){ keep.push({ap:a,why:'idle/off'}); return; } // already-off / idle + if(p.cu>=40 || p.intf>=35 || p.retrPct>=15 || p.sat<80) powerdown.push(a); // saturated/contended/thrashing -> shrink cell + else keep.push({ap:a,why:'low-util'}); +}); +function fmt(a){var p=prof[a];return name[a]+" cu="+p.cu.toFixed(0)+"% interf="+p.intf.toFixed(0)+"% self="+p.self.toFixed(0)+"% clients(avg/pk)="+p.sta.toFixed(1)+"/"+p.staPk.toFixed(0)+" retr="+p.retrPct.toFixed(0)+"% sat="+p.sat.toFixed(0);} +function byZone(list,kf){var z={};list.forEach(function(x){var a=kf(x);(z[zone[a]]=z[zone[a]]||[]).push(x);});return z;} +print("==== PLAN band="+BAND+" (APs="+aps.length+") POWER-DOWN(first)="+powerdown.length+" DISABLE(after)="+disabled.length+" KEEP="+keep.length+" ===="); +var save=0;disabled.forEach(function(d){save+=prof[d.ap].intf;}); +print("Phase A: POWER-DOWN the busy/thrashing radios to ~Low (smaller cells cut mutual cu_interf for the whole zone). Do this FIRST."); +var pz=byZone(powerdown,function(a){return a;}); +Object.keys(pz).sort().forEach(function(z){print(" ["+z+"] "+pz[z].length);pz[z].slice(0,6).forEach(function(a){print(" "+fmt(a));}); if(pz[z].length>6)print(" ...(+"+(pz[z].length-6)+")");}); +print("\nPhase B: re-measure (live-stats.sh) after power-down settles — cu_interf should drop, headroom appears."); +print("\nPhase C: DISABLE these only-when-redundant radios (each has >="+REDUN+" bidirectional good neighbors WITH headroom to absorb its load). Est. interference airtime removed: "+save.toFixed(0)+"."); +if(!disabled.length) print(" (none clear the capacity+coverage bar yet — expected when every neighbor is saturated; revisit after Phase A.)"); +var dz=byZone(disabled,function(d){return d.ap;}); +Object.keys(dz).sort().forEach(function(z){print(" ["+z+"]");dz[z].forEach(function(d){print(" "+fmt(d.ap)+" -> covered by: "+d.cover.slice(0,3).join(', '));});}); +print("\nKEEP: "+keep.length+" radios (isolated-essential or already efficient). Apply per ZONE; never >"+ZPCT+"% disabled/zone; validate before+after."); +JS