diff --git a/.claude/skills/unifi-wifi/SKILL.md b/.claude/skills/unifi-wifi/SKILL.md index 4178a19..a49df7d 100644 --- a/.claude/skills/unifi-wifi/SKILL.md +++ b/.claude/skills/unifi-wifi/SKILL.md @@ -37,6 +37,10 @@ path is Cascades — override with the script's vault-path arg per client. list/disable/enable/delete/re-scope port-forwards (`pf-*`), toggle WAN firewall rules (`fw-disable`/ `fw-enable`), and drop attacker IPs at the edge (`block-ips`). The write companion to gw-audit; closes an internet-facing exposure (e.g. a brute-forced PPTP). Gated/DRY-RUN, rollback saved. Controller-side. +- **[WORKING] Scheduled fleet monitoring** — `scripts/monitor-run.sh `: controller-side + read-only health digest per site (gateway/WAN flags + switch/PoE flags + WiFi config flag count), + cron-friendly for ongoing monitoring of every client. AP-side collectors now **retry per-AP** (3x) + to ride out transient VPN flaps without wasting a sweep. - **[WIP] Client DHCP/DNS policy, deeper VPN (server) config, adoption *remediation* depth** — port-forward + WAN firewall is now covered (gw-control); remaining gateway config (VPN server stand-up, DHCP/DNS) is future. - **[PROPOSED] pfSense gateway compatibility layer** — the gateway verbs (gw-audit / gw-control / VPN) speak diff --git a/.claude/skills/unifi-wifi/references/ROADMAP.md b/.claude/skills/unifi-wifi/references/ROADMAP.md index c9f387a..03c8a43 100644 --- a/.claude/skills/unifi-wifi/references/ROADMAP.md +++ b/.claude/skills/unifi-wifi/references/ROADMAP.md @@ -69,9 +69,10 @@ side, multi-client enablement, and non-WiFi scope. Build/validate new apply acti client DHCP/DNS policy. Config beyond port-forward/WAN-firewall; future. Access layer reaches it. ## D. Robustness / ops -- [ ] **VPN-flap resilience** in the AP-side loops (resume/retry so a mid-run tunnel drop doesn't waste - a 4-min sweep). Background runs can't spawn the SSH_ASKPASS helper — must run foreground. -- [ ] **Scheduling** — periodic `dfs-check` + neighbor/survey refresh (DFS is time-varying). +- [x] **VPN-flap resilience** — AP-side loops now retry per AP (3x, capture-to-var so a failed try never + appends partial data); dfs-check distinguishes unreachable from no-events. Validated (74/74). Still foreground. +- [x] **Scheduling / fleet monitoring** — `monitor-run.sh ` = cron-friendly read-only health + digest (gateway + switch/PoE flags + WiFi flag count) per site. Validated. (Cron the `all` sweep nightly.) - [ ] Vault read-only `infrastructure/uos-server-network-api` (least-privilege; RW does double duty now). ## E. pfSense gateway support — gateway "compatibility layer" (NEW, proposed 2026-06-16) diff --git a/.claude/skills/unifi-wifi/scripts/dfs-check.sh b/.claude/skills/unifi-wifi/scripts/dfs-check.sh index 9192cdb..6706639 100644 --- a/.claude/skills/unifi-wifi/scripts/dfs-check.sh +++ b/.claude/skills/unifi-wifi/scripts/dfs-check.sh @@ -70,7 +70,9 @@ n=0; tot=$(wc -l < "$TMP/aps.tsv"); hits=0 while IFS=$'\t' read -r name ip ch dfs; do ip="${ip%$'\r'}"; dfs="${dfs%$'\r'}"; [ -z "$ip" ] && continue; n=$((n+1)) printf '\r[INFO] checking %d/%d ' "$n" "$tot" >&2 - ev=$(ap_ssh "$AU@$ip" "dmesg 2>/dev/null | grep -iE '$PAT' | grep -iv 'cached' | tail -4" 2>/dev/null) + ev=""; t=0; reach=0 # retry per AP (transient VPN flaps); ap_ssh rc distinguishes unreachable from no-events + while [ $t -lt 3 ]; do if ev="$(ap_ssh "$AU@$ip" "dmesg 2>/dev/null | grep -iE '$PAT' | grep -iv 'cached' | tail -4" 2>/dev/null)"; then reach=1; break; fi; t=$((t+1)); sleep 2; done + [ "$reach" = 1 ] || { echo "[UNREACHABLE] $name ($ip) - skipped"; continue; } cnt=$(printf '%s' "$ev" | grep -c . ) mark=$([ "$dfs" = "DFS" ] && echo ' *' || echo '') if [ "$cnt" -gt 0 ]; then diff --git a/.claude/skills/unifi-wifi/scripts/gw-audit.sh b/.claude/skills/unifi-wifi/scripts/gw-audit.sh index 190dcfa..2c182c2 100644 --- a/.claude/skills/unifi-wifi/scripts/gw-audit.sh +++ b/.claude/skills/unifi-wifi/scripts/gw-audit.sh @@ -59,7 +59,7 @@ else: # internet (www) print(f" internet: status={st(www.get('status'))} latency={www.get('latency')}ms drops={www.get('drops')} " f"speedtest={www.get('xput_down')}/{www.get('xput_up')} Mbps ping={www.get('speedtest_ping')}ms") -if www and st(www.get('status'))!='ok': flags.append(f"internet(www) status={www.get('status')}") +if ngw and www and st(www.get('status'))!='ok': flags.append(f"internet(www) status={www.get('status')}") try: if www.get('latency') and float(www.get('latency'))>80: flags.append(f"internet latency {www.get('latency')}ms") if www.get('drops') and float(www.get('drops'))>5: flags.append(f"internet drops={www.get('drops')}") diff --git a/.claude/skills/unifi-wifi/scripts/monitor-run.sh b/.claude/skills/unifi-wifi/scripts/monitor-run.sh new file mode 100644 index 0000000..a065671 --- /dev/null +++ b/.claude/skills/unifi-wifi/scripts/monitor-run.sh @@ -0,0 +1,44 @@ +#!/usr/bin/env bash +# monitor-run.sh — scheduled fleet/site health sweep. Runs the CONTROLLER-SIDE audits (no AP cred / +# VPN needed) and prints a compact per-site digest of flags only, so it's cron-friendly for ongoing +# monitoring of every client we manage. Pure read-only. +# per site -> gw-audit flags (WAN/internet/disconnected/pending/firmware) + switch-audit flags +# (underspeed/PoE/errors/offline) + audit-site WiFi flag count. +# +# Usage: +# bash .claude/skills/unifi-wifi/scripts/monitor-run.sh # one site +# bash .claude/skills/unifi-wifi/scripts/monitor-run.sh all # every UOS site (slow) +# Cron example (nightly digest): 0 6 * * * bash .../monitor-run.sh all >> /var/log/unifi-monitor.log 2>&1 +set -uo pipefail +REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)" +UOS="$REPO/.claude/scripts/uos-mongo.sh"; D="$REPO/.claude/skills/unifi-wifi/scripts" +ARG="${1:?usage: monitor-run.sh }" +clean(){ grep -viE 'post-quantum|store now|upgraded|openssh.com|WARNING: connection'; } + +sweep_one(){ # $1 = site id, $2 = display name + local sid="$1" nm="$2" + echo "================================================================" + echo "SITE: ${nm:-$sid} ($sid)" + # gateway/WAN/health flags + local g; g="$(bash "$D/gw-audit.sh" "$sid" 2>&1 | clean | sed -n '/^== FLAGS/,$p' | grep -E '^\s*\[' )" + echo " gateway/health:"; [ -n "$g" ] && echo "$g" | sed 's/^/ /' || echo " [OK]" + # switch/PoE flags (summary line + any flags) + local sflag; sflag="$(bash "$D/switch-audit.sh" "$sid" 2>&1 | clean | grep -E 'UNDERSPEED|ERRORS|DROPS|POE-|OFFLINE|flag\(s\)' )" + echo " switches:"; [ -n "$sflag" ] && echo "$sflag" | sed 's/^/ /' | head -20 || echo " [OK] no switches / no issues" + # wifi config flag count + local wc; wc="$(bash "$D/audit-site.sh" "$sid" 2>&1 | clean | grep -cE '^\s*\[!\]' )" + echo " wifi config flags: ${wc:-0}" +} + +if [ "$ARG" = all ]; then + echo "[INFO] fleet health sweep — $(date '+%Y-%m-%d %H:%M') — all UOS sites (controller-side, read-only)" + bash "$UOS" --sites 2>/dev/null | clean | grep -E '^[0-9a-f]{24}' | while read -r sid rest; do + sweep_one "$sid" "$rest" + done +else + if [[ "$ARG" =~ ^[0-9a-f]{24}$ ]]; then SID="$ARG"; NM=""; else + line="$(bash "$UOS" --sites 2>/dev/null | clean | grep -i "$ARG" | head -1)"; SID="${line%% *}"; NM="${line#* }"; fi + [ -n "$SID" ] || { echo "[ERROR] site not found: $ARG"; exit 1; } + echo "[INFO] health sweep — $(date '+%Y-%m-%d %H:%M') — $NM" + sweep_one "$SID" "$NM" +fi diff --git a/.claude/skills/unifi-wifi/scripts/neighbor-collect.sh b/.claude/skills/unifi-wifi/scripts/neighbor-collect.sh index 9fac190..76b5e08 100644 --- a/.claude/skills/unifi-wifi/scripts/neighbor-collect.sh +++ b/.claude/skills/unifi-wifi/scripts/neighbor-collect.sh @@ -89,11 +89,10 @@ while IFS=$'\t' read -r name ip; do ip="${ip%$'\r'}"; name="${name%$'\r'}" # strip any CR (Windows line endings) so ssh target is valid [ -z "$ip" ] && continue; n=$((n+1)) echo "###AP $name $ip" >> "$RAW" - if ap_ssh "$AU@$ip" 'echo "@@ESS"; cat /proc/ui_neighbor/ess_ap_list 2>/dev/null; for s in /proc/ui_neighbor/ssid/*; do echo "@@SSID $s"; cat "$s" 2>/dev/null; done' >> "$RAW" 2>/dev/null; then - ok=$((ok+1)) - else - echo "@@UNREACHABLE" >> "$RAW" - fi + # retry per AP (transient VPN flaps); capture to var so a failed try never appends partial data + out=""; t=0 + while [ $t -lt 3 ]; do if out="$(ap_ssh "$AU@$ip" 'echo "@@ESS"; cat /proc/ui_neighbor/ess_ap_list 2>/dev/null; for s in /proc/ui_neighbor/ssid/*; do echo "@@SSID $s"; cat "$s" 2>/dev/null; done' 2>/dev/null)" && [ -n "$out" ]; then break; fi; t=$((t+1)); sleep 2; done + if [ -n "$out" ]; then printf '%s\n' "$out" >> "$RAW"; ok=$((ok+1)); else echo "@@UNREACHABLE" >> "$RAW"; fi printf '\r[INFO] harvested %d/%d (reachable %d) ' "$n" "$(wc -l < "$TMP/aps.tsv")" "$ok" >&2 done < "$TMP/aps.tsv" echo "" >&2 diff --git a/.claude/skills/unifi-wifi/scripts/survey-collect.sh b/.claude/skills/unifi-wifi/scripts/survey-collect.sh index fea96c3..d71aaea 100644 --- a/.claude/skills/unifi-wifi/scripts/survey-collect.sh +++ b/.claude/skills/unifi-wifi/scripts/survey-collect.sh @@ -64,7 +64,9 @@ RAW="$TMP/raw.txt"; n=0; ok=0; tot=$(wc -l < "$TMP/aps.tsv") while IFS=$'\t' read -r name ip; do ip="${ip%$'\r'}"; name="${name%$'\r'}"; [ -z "$ip" ] && continue; n=$((n+1)) echo "###AP $name" >> "$RAW" - if ap_ssh "$AU@$ip" 'for r in wifi0 wifi1 wifi2 ath0 ath1; do iw dev $r survey dump 2>/dev/null; done' >> "$RAW" 2>/dev/null; then ok=$((ok+1)); else echo "@@UNREACHABLE" >> "$RAW"; fi + out=""; t=0 # retry per AP (transient VPN flaps); capture-to-var avoids partial appends + while [ $t -lt 3 ]; do if out="$(ap_ssh "$AU@$ip" 'for r in wifi0 wifi1 wifi2 ath0 ath1; do iw dev $r survey dump 2>/dev/null; done' 2>/dev/null)" && [ -n "$out" ]; then break; fi; t=$((t+1)); sleep 2; done + if [ -n "$out" ]; then printf '%s\n' "$out" >> "$RAW"; ok=$((ok+1)); else echo "@@UNREACHABLE" >> "$RAW"; fi printf '\r[INFO] surveyed %d/%d (ok %d) ' "$n" "$tot" "$ok" >&2 done < "$TMP/aps.tsv"; echo "" >&2 diff --git a/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md b/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md index 5788118..c231b32 100644 --- a/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md +++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-15-howard-cascades-wifi-rf-audit.md @@ -502,3 +502,24 @@ NEW channel-plan.sh ng|na [--apply] (NEIGHBOR_JSON + SURVEY_JSON): SKILL now: WiFi (monitor+tune+full apply+device-lock+client/device control+channel-plan) + switch/PoE audit + gateway/WAN/site-health + multi-client. ROADMAP nearly clear (deeper firewall/VPN policy + per-client AP creds/VPN remain). Coord: this msg. + +--- + +## Update: 2026-06-16 07:44 PT — robustness (ROADMAP D): monitor-run.sh + per-AP retry; gw-audit pfSense fix + +Synced first to pick up Mike's gw-control.sh (eb87710, firewall/port-forward router actions — the +"deeper firewall/VPN policy" item; no dup with my robustness work). + +NEW scripts/monitor-run.sh — cron-friendly controller-side read-only fleet health digest: +per site -> gateway/WAN flags + switch/PoE flags + WiFi config flag count. Validated Sonoran (healthy) ++ Cascades (flags 2 disc APs / 3 disc switches / underspeed / firmware). Cron 'all' nightly. + +VPN-flap resilience: neighbor-collect / survey-collect / dfs-check now RETRY per AP (3x, capture-to-var +so a failed attempt never appends partial data; dfs-check distinguishes UNREACHABLE vs no-events). +Validated neighbor-collect end-to-end (reachable 74/74, redundancy 73/74, JSON 74 APs - identical). + +Fix: gw-audit no longer false-flags internet status=unknown on third-party-firewall sites (gated on num_gw). + +SKILL.md + ROADMAP updated (D items done). Skill is feature-complete for monitoring+tuning+apply across +WiFi/switch/gateway, multi-client, with scheduling + flap resilience. Remaining: per-client AP creds/VPN, +read-only cred (Mike to create UI admin), gateway VPN-server/DHCP-DNS (Mike). Coord: this msg.