unifi-wifi: data-driven channel selection — add survey-report, kill non-DFS bias

Codifies the scan-first/data-driven workflow proven on Cascades (where the baked-in non-DFS bias picked the congested channels and a data-driven DFS plan halved 5GHz retry): - NEW survey-report.py: rolls survey-collect JSON into the fleet per-channel/per-band-group measured busy% table + cleanest/dirtiest ranking + a suggested clean 40MHz palette. The decision-driver that was missing (we built it by hand). - channel-plan.sh: na palette is now DATA-DRIVEN, not hardcoded non-DFS. Adds --channels (explicit palette) + --dfs ok|avoid|only; default considers ALL 40MHz primaries and lets measured busy% choose. Adds load-balancing + a local-search pass -> strong co-channel to 0. - survey-collect.sh: per-AP "cleanest" report no longer pre-filters out DFS (DFS is usually cleanest here); marks DFS with *, points at survey-report. - SKILL.md: documents the mandatory scan -> survey-report -> channel-plan --channels -> apply -> validate order + the Cascades lesson. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 05:00:47 -07:00
parent e5193b4f13
commit fb835fe756
4 changed files with 153 additions and 18 deletions
--- a/.claude/skills/unifi-wifi/SKILL.md
+++ b/.claude/skills/unifi-wifi/SKILL.md
@@ -255,14 +255,34 @@ Closing an internet-facing PPTP usually = `pf-set-ports VPN 80,443` (drop tcp 17
 admin REST (`rest/portforward|firewallrule|firewallgroup`). `block-ips` clones an existing WAN_IN rule's
 schema for firmware compatibility — verify the new rule's precedence in the UI. Dry-run validated 2026-06-16
 on Grabb & Durando (USG-3P): identified the live `VPN` forward (80,443,1723→.200) + `GRE` WAN_IN accept.
+**Channel data-gathering — the MANDATORY scan-first workflow (do this BEFORE any channel change).**
+Choosing channels without the measured scan is how you pick the *congested* channels. Order:
+```bash
+# 1. AP-to-AP SNR neighbor matrix (who-hears-who = the co-channel graph)
+NBR_JSON=.claude/tmp/<site>-nbr.json neighbor-collect.sh <site>
+# 2. measured per-channel busy%/noise for EVERY AP (iw survey) -> SURVEY_JSON
+SURVEY_JSON=.claude/tmp/<site>-survey.json survey-collect.sh <site>     # be patient: run to 74/74, not partial
+# 3. FLEET congestion analysis -> the data that makes the channel choice (incl. DFS-vs-non-DFS)
+python scripts/survey-report.py .claude/tmp/<site>-survey.json na       # per-channel busy% + suggested clean palette
+```
+`survey-report.py` is the **decision-driver**: it rolls the survey up into a per-channel / per-band-group
+measured busy% table, ranks cleanest↔dirtiest, and prints a suggested clean 40MHz palette to feed
+channel-plan via `--channels`. **DFS channels are usually the cleanest** (consumer gear avoids them) — do
+NOT assume non-DFS; let this report decide (then weigh the radar-vacate tradeoff via `dfs-check.sh`).
+
 **Channel plan — `scripts/channel-plan.sh`** (computes + applies a co-channel-minimizing plan):
 ```bash
 NEIGHBOR_JSON=...nbr.json SURVEY_JSON=...survey.json \
-  channel-plan.sh <site> ng|na [--apply]    # ng: 1/6/11 graph-color; na: cleanest NON-DFS + separation
+  channel-plan.sh <site> ng|na [--channels 52,60,100,108,116,124,132,140] [--dfs ok|avoid|only] [--apply]
 ```
-ng uses the neighbor matrix to graph-color 1/6/11; na picks each AP's lowest-cost non-DFS channel
-(measured busy% + neighbor-collision penalty). Reports co-channel pairs before/after. (Cascades dry-run:
-ng 92→35 pairs; na 20→0 + all off DFS.) `survey-collect.sh` emits its JSON via `SURVEY_JSON=<path>`.
+ng graph-colors 1/6/11. **na palette is DATA-DRIVEN, not a hardcoded non-DFS list** (that bias picked the
+congested 149/157 at Cascades): pass `--channels` from survey-report, or `--dfs only|avoid`, else the
+default considers ALL 40MHz primaries and the measured busy% chooses. Cost = co-channel collision (dominant)
+ load-balance + measured busy%; a local-search pass drives strong co-channel pairs to **0**. Reports
+before/after. `survey-collect.sh` emits its JSON via `SURVEY_JSON=<path>`.
+**LESSON (Cascades 2026-06-19):** a blind non-DFS reshuffle moved APs onto the 2 dirtiest channels (149=12%,
+157=28% busy) and did nothing; the data-driven clean-DFS plan (52/60/100/108/116/124/132/140, all ≤3% busy,
+0 co-channel) **halved 5GHz retry (8.7→3.8)**. Always: scan → survey-report → channel-plan --channels → apply → validate.
 GOTCHA (handled): a manual min rate is only honored when `minrate_setting_preference=manual` — the
 script sets it; `minrate ... auto` hands rate management back to the controller. Write path validated
 2026-06-16 on a 0-client WLAN (Green Valley Computer Club) — apply->verify->restore.
--- a/.claude/skills/unifi-wifi/scripts/channel-plan.sh
+++ b/.claude/skills/unifi-wifi/scripts/channel-plan.sh
@@ -15,7 +15,12 @@ REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)"
 UOS="$REPO/.claude/scripts/uos-mongo.sh"; VAULT="$REPO/.claude/scripts/vault.sh"
 HOST="${UOS_HOST:-172.16.3.29}"; PORT="${UOS_HTTPS_PORT:-11443}"
 SITEARG="${1:?usage: channel-plan.sh <site> <ng|na> [--apply]}"; BAND="${2:?band ng|na}"; APPLY=0
-shift 2; while [ $# -gt 0 ]; do case "$1" in --apply) APPLY=1; shift;; *) shift;; esac; done
+CHANS=""; DFSPOL=""
+shift 2; while [ $# -gt 0 ]; do case "$1" in
+  --apply) APPLY=1; shift;;
+  --channels) CHANS="${2:-}"; shift 2;;          # explicit palette, e.g. 52,60,100,108,116,124,132,140
+  --dfs) DFSPOL="${2:-}"; shift 2;;              # ok|avoid|only — data-driven policy (default: ok = all)
+  *) shift;; esac; done
 case "$BAND" in ng|na) ;; *) echo "band must be ng|na (channel planning is for 2.4/5GHz)"; exit 1;; esac
 NEIGHBOR_JSON="${NEIGHBOR_JSON:-}"; SURVEY_JSON="${SURVEY_JSON:-}"; SNR_MIN="${NBR_SNR_MIN:-20}"
 [ -n "$NEIGHBOR_JSON" ] && [ -f "$NEIGHBOR_JSON" ] || { echo "[ERROR] NEIGHBOR_JSON required (run neighbor-collect.sh with NBR_JSON=...)"; exit 1; }
@@ -26,14 +31,26 @@ if [[ "$SITEARG" =~ ^[0-9a-f]{24}$ ]]; then SITE="$SITEARG"; else
  SITE="$(bash "$UOS" --sites 2>/dev/null | grep -vi 'pq.html' | grep -i "$SITEARG" | awk '{print $1}' | head -1)"; fi
 [ -n "$SITE" ] || { echo "[ERROR] site not found"; exit 1; }
 echo "[INFO] channel-plan site=$SITE band=$BAND  mode=$([ $APPLY = 1 ] && echo APPLY || echo DRY-RUN)  (neighbor=$NEIGHBOR_JSON survey=${SURVEY_JSON:-none})"
-export CP_SITE="$SITE" CP_BAND="$BAND" CP_APPLY="$APPLY" CP_NBR="$NEIGHBOR_JSON" CP_SURVEY="$SURVEY_JSON" CP_SNR="$SNR_MIN" RW_U="$U" RW_P="$P" REPO
+export CP_SITE="$SITE" CP_BAND="$BAND" CP_APPLY="$APPLY" CP_NBR="$NEIGHBOR_JSON" CP_SURVEY="$SURVEY_JSON" CP_SNR="$SNR_MIN" CP_CHANS="$CHANS" CP_DFS="$DFSPOL" RW_U="$U" RW_P="$P" REPO
 python - <<'PY'
 import os,sys,json,ssl,urllib.request,http.cookiejar
 band=os.environ['CP_BAND']; apply=os.environ['CP_APPLY']=='1'; SNR=int(os.environ['CP_SNR'])
 nbr=json.load(open(os.environ['CP_NBR']))
 survey=json.load(open(os.environ['CP_SURVEY'])) if os.environ.get('CP_SURVEY') and os.path.exists(os.environ['CP_SURVEY']) else {}
 sb={'ng':'2.4','na':'5'}[band]
-CH = [1,6,11] if band=='ng' else [36,40,44,48,149,153,157,161]   # non-DFS only (radar-safe)
+# Palette: data-driven. ng is always 1/6/11. na: --channels overrides; else --dfs policy decides.
+# Default (na) is NOT a hardcoded non-DFS list anymore -- that baked-in bias picked the congested
+# channels at Cascades. Run survey-report.py first and pass its suggested palette via --channels.
+_chans=os.environ.get('CP_CHANS','').strip()
+_dfs=os.environ.get('CP_DFS','').strip().lower()
+NONDFS_NA=[36,40,44,48,149,153,157,161]; DFS_NA=[52,60,100,108,116,124,132,140]; ALL_NA=sorted(NONDFS_NA+DFS_NA)
+if band=='ng':
+    CH=[1,6,11]
+elif _chans:
+    CH=[int(x) for x in _chans.replace(' ','').split(',') if x]
+elif _dfs=='only': CH=DFS_NA
+elif _dfs=='avoid': CH=NONDFS_NA
+else: CH=ALL_NA          # 'ok'/default: consider ALL 40MHz primaries; measured busy% chooses
 H="172.16.3.29";PORT=11443;base=f"https://{H}:{PORT}"
 ctx=ssl.create_default_context();ctx.check_hostname=False;ctx.verify_mode=ssl.CERT_NONE
 cj=http.cookiejar.CookieJar();op=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj),urllib.request.HTTPSHandler(context=ctx))
@@ -70,16 +87,29 @@ plan={}
 def busy(ap,ch):
    try: return survey.get(ap,{}).get(sb,{}).get(str(ch),50)
    except: return 50
+from collections import Counter
+def cost(a,ch,pl):
+    coll=sum(1 for n in adj[a] if pl.get(n)==ch)             # strong-neighbor co-channel (dominant)
+    load=Counter(pl.values()).get(ch,0)                      # balance APs across the palette
+    bz=busy(a,ch) if band=='na' else 0                       # measured congestion (na)
+    return coll*1000 + load*10 + bz
+# greedy: most-constrained first
 for a in order:
-    best=None;bestcost=1e9
-    for ch in CH:
-        coll=sum(1 for n in adj[a] if plan.get(n)==ch)   # strong neighbors already on this channel
-        cost = coll*1000 + (busy(a,ch) if band=='na' else 0)   # ng: pure separation; na: separation + measured busy
-        if cost<bestcost: bestcost=cost; best=ch
-    plan[a]=best
+    plan[a]=min(CH,key=lambda ch:cost(a,ch,plan))
+# local-search: repeatedly move the worst-collision AP to its best channel until stable
+for _ in range(8000):
+    worst=max(aps,key=lambda a:sum(1 for n in adj[a] if plan.get(n)==plan.get(a)))
+    cur_c=sum(1 for n in adj[worst] if plan.get(n)==plan.get(worst))
+    if cur_c==0: break
+    alt=min(CH,key=lambda ch:cost(worst,ch,plan))
+    if sum(1 for n in adj[worst] if plan.get(n)==alt) < cur_c: plan[worst]=alt
+    else: break
 changes=[(a,cur[a],plan[a]) for a in sorted(plan) if cur.get(a)!=plan[a]]
 print(f"\n==== CHANNEL PLAN band={band}  ({len(aps)} APs, {len(changes)} would change) ====")
-print(f"allowed channels: {CH}  (non-DFS only)")
+_pal='explicit --channels' if _chans else (_dfs+' policy' if _dfs else 'data-driven (all 40MHz primaries; busy% chooses)')
+print(f"allowed channels: {CH}  [{_pal}]")
+if band=='na' and not _chans:
+    print("  TIP: run survey-report.py on the SURVEY_JSON first; pass its suggested clean palette via --channels.")
 for a,c0,c1 in changes[:60]:
    print(f"  {a:<22} {str(c0):>4} -> {c1}   (strong-neighbor adj={len(adj[a])})")
 if len(changes)>60: print(f"  ...(+{len(changes)-60} more)")
--- a/.claude/skills/unifi-wifi/scripts/survey-collect.sh
+++ b/.claude/skills/unifi-wifi/scripts/survey-collect.sh
@@ -101,7 +101,9 @@ for ln in open(sys.argv[1],encoding='utf-8',errors='replace'):
    elif 'channel busy time' in ln: m=re.search(r'(\d+) ms',ln); rec['busy']=int(m.group(1)) if m else 0
 flush()
 print(f"\n==== MEASURED RF OCCUPANCY - cleanest channels per AP ({len(data)} APs) ====")
-print("(in-use channel busy%, then 3 lowest-busy NON-DFS channels measured; * = DFS)\n")
+print("(in-use channel busy%, then 3 lowest-busy channels measured -- ALL channels; * = DFS)")
+print("NOTE: DFS channels are often the cleanest (consumer gear avoids them). Do NOT pre-filter them out;")
+print("      run survey-report.py for the fleet rollup + a data-driven palette, then channel-plan --channels.\n")
 for ap in sorted(data):
    print(f"{ap}:")
    for b in ('2.4','5','6'):
@@ -109,9 +111,9 @@ for ap in sorted(data):
        if not rows: continue
        inuse=[r for r in rows if r[2]]
        iu=f"ch{inuse[0][1]}={inuse[0][0]}%" if inuse else "?"
-        nondfs=sorted([r for r in rows if r[1] not in DFS], key=lambda r:r[0])[:3]
-        clean=", ".join(f"ch{c}({bz}%)" for bz,c,_,_ in nondfs)
-        print(f"   {b}GHz in-use {iu}  | cleanest non-DFS: {clean}")
+        cleanest=sorted(rows, key=lambda r:r[0])[:3]   # ALL channels (DFS + non-DFS), * marks DFS
+        clean=", ".join(f"ch{c}{'*' if c in DFS else ''}({bz}%)" for bz,c,_,_ in cleanest)
+        print(f"   {b}GHz in-use {iu}  | cleanest (all): {clean}")
 OUT=sys.argv[2] if len(sys.argv)>2 else 'NONE'
 if OUT!='NONE':
    j={}
--- a/.claude/skills/unifi-wifi/scripts/survey-report.py
+++ b/.claude/skills/unifi-wifi/scripts/survey-report.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+# survey-report.py -- fleet-wide channel-congestion analysis from a survey-collect JSON.
+#
+# THE DECISION-DRIVER: turns the raw per-AP survey (SURVEY_JSON from survey-collect.sh) into the
+# fleet-wide per-channel measured busy% table + per-band-group rollup + cleanest/dirtiest ranking,
+# so the channel plan is chosen from MEASURED FACTS, not policy. This is what makes the
+# DFS-vs-non-DFS (and any channel) call obvious instead of assumed.
+#
+# WHY THIS EXISTS (Cascades 2026-06-19): the skill collected the survey but never aggregated it, and
+# both survey-collect's report and channel-plan's palette were hardcoded to NON-DFS. On this site the
+# non-DFS channels (149/157) measured 12-28% busy while DFS measured 2-3% -- the opposite of the
+# baked-in assumption. A blind non-DFS plan made things worse; the data-driven DFS plan halved 5GHz
+# retry. Always run THIS before channel-plan, and feed channel-plan a palette derived from it.
+#
+# Usage:
+#   python survey-report.py <SURVEY_JSON> [band=na|ng]
+#   SURVEY_JSON=.claude/tmp/<site>-survey.json python survey-report.py - na
+#
+# Output: per-channel median/mean/max busy% (n APs), band-group rollup (UNII-1/UNII-2a-DFS/
+# UNII-2c-DFS/UNII-3), the cleanest + dirtiest channels, and a suggested clean 40MHz palette to
+# hand to channel-plan via --channels.
+
+import json, os, sys, statistics as st
+from collections import defaultdict
+
+path = sys.argv[1] if len(sys.argv) > 1 and sys.argv[1] != '-' else os.environ.get('SURVEY_JSON', '')
+band = sys.argv[2] if len(sys.argv) > 2 else 'na'
+if not path or not os.path.exists(path):
+    sys.exit("usage: survey-report.py <SURVEY_JSON> [na|ng]   (or SURVEY_JSON env). file not found.")
+d = json.load(open(path))
+sb = {'na': '5', 'ng': '2.4'}[band]
+
+def band_of(c):
+    if band == 'ng':
+        return '2.4 GHz'
+    if 36 <= c <= 48:  return 'UNII-1 (non-DFS)'
+    if 52 <= c <= 64:  return 'UNII-2a (DFS)'
+    if 100 <= c <= 144: return 'UNII-2c (DFS)'
+    if 149 <= c <= 165: return 'UNII-3 (non-DFS)'
+    return '?'
+
+ch = defaultdict(list)
+for ap, bands in d.items():
+    for c, busy in bands.get(sb, {}).items():
+        try: ch[int(c)].append(busy)
+        except: pass
+if not ch:
+    sys.exit(f"[survey-report] no {band} ({sb}GHz) data in {path}")
+
+print(f"==== FLEET CHANNEL CONGESTION ({band}, {len(d)} APs scanned) -- measured busy% (lower=cleaner) ====")
+print(f"{'ch':>4} {'band':<18} {'median':>7} {'mean':>6} {'max':>5}  n")
+rows = []
+for c in sorted(ch):
+    v = ch[c]; rows.append((st.median(v), c, st.mean(v), max(v), len(v)))
+    print(f"{c:>4} {band_of(c):<18} {st.median(v):>6.0f}% {st.mean(v):>5.0f}% {max(v):>4.0f}% {len(v):>3}")
+
+grp = defaultdict(list)
+for ap, bands in d.items():
+    for c, busy in bands.get(sb, {}).items():
+        try: grp[band_of(int(c))].append(busy)
+        except: pass
+print("\nBY BAND GROUP (median busy% across all AP-channel samples):")
+for g in sorted(grp):
+    v = grp[g]
+    print(f"  {g:<18} median={st.median(v):>4.0f}%  mean={st.mean(v):>4.0f}%  (n={len(v)})")
+
+clean = sorted(rows)            # by median busy asc
+print("\nCLEANEST channels:", ", ".join(f"ch{c}({m:.0f}%)" for m, c, *_ in clean[:8]))
+print("DIRTIEST channels:", ", ".join(f"ch{c}({m:.0f}%)" for m, c, *_ in clean[::-1][:5]))
+
+if band == 'na':
+    # suggest a clean 40MHz palette: lower-primary channels whose 40MHz pair (c, c+4) are both clean
+    medbusy = {c: st.median(v) for c, v in ch.items()}
+    THRESH = max(8, st.median([m for m in medbusy.values()]))  # "clean" = <= this
+    pairs = [(52, 56), (60, 64), (100, 104), (108, 112), (116, 120), (124, 128), (132, 136), (140, 144),
+             (36, 40), (44, 48), (149, 153), (157, 161)]
+    palette = [lo for lo, hi in pairs
+               if medbusy.get(lo, 99) <= THRESH and medbusy.get(hi, 99) <= THRESH]
+    dfs_clean = [c for c in palette if 52 <= c <= 144]
+    print(f"\nSUGGESTED clean 40MHz palette (both halves <= {THRESH:.0f}% busy): {palette}")
+    print(f"  of those, DFS (cleaner here but radar-vacate risk): {dfs_clean}")
+    print(f"  -> feed channel-plan:  channel-plan.sh <site> na --channels {','.join(map(str,palette))} --apply")
+    print("  NOTE: choose DFS vs non-DFS from THIS data + the radar tradeoff, not a hardcoded policy.")