unifi-wifi: pfSense gateway access via SSH (pfSense-ssh.sh) + pfSense health section; layer OFF HOLD

DECISION (Mike, 2026-06-16): drop the RESTAPI package — VPN + SSH shell reads the same data and makes
changes. Confirmed Cascades pfSense is Plus 25.07-RELEASE (current; the "too old" premise was wrong) and
admin SSH = real shell (no menu). The upgrade/package blocker is moot; compat layer is off hold.

- NEW scripts/pfsense-ssh.sh: audit (version/WAN-media/gateway-events/DHCP-exhaustion/states/DNS/load/NIC),
  dhcp (pool utilization + no-free-leases), run "<cmd>" (arbitrary, incl changes; operator-gated). Cred
  from clients/<slug>/pfsense-firewall; system OpenSSH via askpass. Validated live on Cascades.
- audit report: added "pfSense health check (2026-06-16)" — DHCP NOT exhausted (192.168.0.0/22 pool 270/507,
  0 no-free-leases), DNS up, dual-WAN stable (no gateway flaps), states/load healthy => gateway is NOT a
  WiFi factor; the 2.4 GHz RF work is the sole fix. (Minor: igc3/WAN2 I225 2.5G counter quirk, not a fault.)
- ROADMAP §E + SKILL.md updated to the SSH backend decision; REST pfsense-backend.sh kept dormant/optional.
- Remaining: named gated CONTROL verbs over SSH (easyrule block-ips, pf/fw toggles) + optional gw-* dispatch.
- Closed obsolete coord todo (upgrade-pfSense-for-RESTAPI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-16 18:42:54 -07:00
parent e42ad8f163
commit 58ecc5ad40
4 changed files with 119 additions and 17 deletions

View File

@@ -43,9 +43,12 @@ path is Cascades — override with the script's vault-path arg per client.
to ride out transient VPN flaps without wasting a sweep.
- **[WIP] Client DHCP/DNS policy, deeper VPN (server) config, adoption *remediation* depth** — port-forward
+ WAN firewall is now covered (gw-control); remaining gateway config (VPN server stand-up, DHCP/DNS) is future.
- **[SCAFFOLDED — ON HOLD] pfSense gateway compatibility layer** — `scripts/pfsense-backend.sh` (REST API pkg backend).
ON HOLD (Howard 2026-06-16): the RESTAPI package needs a newer pfSense than Cascades runs — **blocked on a
pfSense upgrade** before any live use. Code is complete; see ROADMAP §E "BLOCKER / Resume trigger".
- **[WORKING] pfSense gateway access via SSH** — `scripts/pfsense-ssh.sh <slug> audit|dhcp|run "<cmd>"`.
DECISION (Mike 2026-06-16): **no RESTAPI package needed** — VPN + SSH shell reads the same data and makes
changes. Cred = `clients/<slug>/pfsense-firewall`. Validated live on Cascades (pfSense Plus 25.07; admin
SSH = real shell). `audit`/`dhcp` are read-only; `run` executes arbitrary commands (incl. changes —
operator-gated, no dry-run). Structured/gated CONTROL verbs (block-ips via easyrule, pf/fw toggles) are
the remaining build — ROADMAP §E. (REST `pfsense-backend.sh` kept as a dormant optional alternative.)
`gw-audit.sh`/`gw-control.sh` now **auto-dispatch** to it when a site has no UniFi gateway (num_gw=0) AND a
pfSense API cred is vaulted at `clients/<slug>/pfsense-api` (or pass `--pfsense <slug>` when the UOS site
name differs from the client slug) — the SAME verbs (`gw-audit`, `pf-list/disable/enable/set-ports`,

View File

@@ -119,22 +119,26 @@ exists for at least two sites; per-client pfSense cred vaulting mirrors the AP-S
collectors). DONE: writes are `--apply`-gated and save a per-object rollback to `.claude/tmp/`, and
pfSense `firewall/apply` is called after each change. config.xml backup-first is the SSH-fallback's job.
**STATUS: SCAFFOLDED — ON HOLD (blocked on pfSense upgrade).** Build complete (backend + dispatch +
setup helper); the BLOCKED/setup/no-cred-hint paths are tested. The live REST calls
(audit/pf-*/fw-*/block-ips) need a reachable pfSense with the API pkg installed + a key vaulted; REST
endpoint paths follow the v2 schema and must be verified against the installed API version on first live run.
**DECISION (Mike, 2026-06-16): backend = SSH, NOT the REST API package.** "We don't need the RESTAPI —
with VPN + SSH we can read the same data and make changes." Confirmed: Cascades pfSense is **Plus
25.07-RELEASE** (current, not old — the earlier "too old" premise was wrong) and **admin SSH drops
straight to a shell** (no menu gotcha). So the upgrade/package blocker is **MOOT** and the layer is
**OFF HOLD**.
**[BLOCKER — Howard 2026-06-16]** `pfSense-pkg-RESTAPI` is third-party and the **Cascades pfSense is too
old to install it**. PREREQUISITE: **upgrade the Cascades pfSense** (firmware) before the package will
install. Work is ON HOLD until that upgrade is done. After the upgrade: install RESTAPI → mint a read-only
key (write-capable for control) → `pfsense-backend.sh clients/cascades-tucson/pfsense-api setup`
vault url+apikey at `clients/cascades-tucson/pfsense-api` → first live `gw-audit cascades` to verify
v2 endpoints. (Also blocked from Howard-Home by the `.0.0/24` home-LAN shadow over pfSense `192.168.0.1`
run the first live validation from/through the Cascades network.) ACG office pfSense (`infrastructure/
pfsense-firewall`) may be a newer box usable as the first live test once it has the pkg + a vaulted key.
**STATUS: WORKING (read) via `scripts/pfsense-ssh.sh` — control verbs WIP.**
- `pfsense-ssh.sh <slug> audit` — version/WAN-media/gateway-events/DHCP-exhaustion/states/DNS/load/NIC-errors.
- `pfsense-ssh.sh <slug> dhcp` — pool utilization + "no free leases" check.
- `pfsense-ssh.sh <slug> run "<cmd>"` — arbitrary command (reads OR changes; operator-gated, no dry-run).
- Cred = `clients/<slug>/pfsense-firewall` (host + admin user/pass), system OpenSSH via askpass. Validated
live on Cascades 2026-06-16 (the pfSense-health audit in the unifi-full-audit report came from this).
**Resume trigger:** Cascades (or another client) pfSense upgraded + RESTAPI installable. The code is done;
resuming = the setup/vault steps above + endpoint verification, no further build expected unless v2 paths differ.
**Remaining build (SSH backend):** named, reviewed, gated CONTROL verbs mapping the gw-control contract to
SSH primitives — `block-ips``easyrule block wan <ip>`; `pf-list`/`fw-list` → read config.xml / `pfSsh.php`;
toggles → config edit + `filter_configure`/`rc.reload_all`; backup config.xml first. Then optionally wire
gw-audit/gw-control dispatch to the SSH backend when `clients/<slug>/pfsense-firewall` exists + num_gw=0.
**Superseded/optional:** the REST `pfsense-backend.sh` + `clients/<slug>/pfsense-api` path stays in-tree
as a dormant alternative (works if a site ever installs the pkg) but is no longer the plan.
- [ ] **Site→gateway map:** record per-site gateway type + access (UOS site_id ↔ pfSense host/cred) so the
driver auto-selects. Could live alongside `sites.sh` output.
- [ ] **VPN convergence:** the "Deeper VPN — gateway-hosted VPN server" item (C) is *easier and better* on

View File

@@ -0,0 +1,68 @@
#!/usr/bin/env bash
# pfsense-ssh.sh — talk to a client's pfSense over SSH (site VPN reachable). This is the SSH backend for
# the gateway compatibility layer. DECISION (Mike, 2026-06-16): the RESTAPI package is NOT needed — with
# VPN + SSH shell we can read the same data and make changes directly. (Confirmed on Cascades pfSense Plus
# 25.07: admin SSH drops straight to a shell, no menu gotcha.)
#
# Cred from the vault: clients/<slug>/pfsense-firewall (top-level `host`, credentials.username/password).
# Uses SYSTEM OpenSSH via an SSH_ASKPASS helper (no sshpass dependency); runs each call as `sh -s` over a
# heredoc so awk/quoting is clean.
#
# Usage:
# pfsense-ssh.sh <slug> audit # read-only health: version/WAN/DHCP-exhaustion/DNS/states/load
# pfsense-ssh.sh <slug> dhcp # DHCP pool utilization + "no free leases" check
# pfsense-ssh.sh <slug> run "<command>" # arbitrary command (CAN mutate — operator-gated; e.g. run "pfctl -si")
# pfsense-ssh.sh <slug> shell # (prints the interactive ssh command to paste)
# NOTE: `run` executes whatever you pass, including changes — there is no dry-run for it. For repeatable
# changes prefer adding a named, reviewed verb here over ad-hoc `run`.
set -uo pipefail
REPO="$(git rev-parse --show-toplevel 2>/dev/null || echo .)"
VAULT="$REPO/.claude/scripts/vault.sh"
SLUG="${1:?usage: pfsense-ssh.sh <slug> <audit|dhcp|run|shell> [args]}"
ACT="${2:?action: audit|dhcp|run|shell}"; shift 2 || true
VP="clients/$SLUG/pfsense-firewall"
HOST="$(bash "$VAULT" get-field "$VP" host 2>/dev/null || bash "$VAULT" get-field "$VP" credentials.host 2>/dev/null || true)"
U="$(bash "$VAULT" get-field "$VP" credentials.username 2>/dev/null || true)"
PP="$(bash "$VAULT" get-field "$VP" credentials.password 2>/dev/null || true)"; export PP
if [ -z "$HOST" ] || [ -z "$U" ] || [ -z "$PP" ]; then
echo "[BLOCKED] need host + admin creds at vault:$VP (fields: host, credentials.username, credentials.password)"; exit 2; fi
if [ "$ACT" = "shell" ]; then echo "ssh ${U}@${HOST} # password in vault:$VP"; exit 0; fi
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
ASKP="$TMP/a.sh"; printf '#!/bin/sh\nprintf "%%s\\n" "$PP"\n' >"$ASKP"; chmod +x "$ASKP"
# pfssh: feed remote sh a script on stdin; password via askpass (stderr noise dropped)
pfssh(){ SSH_ASKPASS="$ASKP" SSH_ASKPASS_REQUIRE=force DISPLAY=:0 ssh \
-o ConnectTimeout=12 -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/dev/null \
-o PreferredAuthentications=password -o PubkeyAuthentication=no -o NumberOfPasswordPrompts=1 \
"$U@$HOST" 'sh -s' 2>/dev/null; }
echo "[INFO] pfSense $ACT @ $U@$HOST (vault:$VP)"
case "$ACT" in
run)
CMD="$*"; [ -n "$CMD" ] || { echo "[ERROR] run needs a command"; exit 1; }
printf '%s\n' "$CMD" | pfssh ;;
dhcp)
pfssh <<'RSCRIPT'
echo "## DHCP backend"; { pgrep -lf dhcpd >/dev/null && echo "ISC dhcpd active"; }; { pgrep -lf kea >/dev/null && echo "Kea active"; }
echo "## 'no free leases' events (exhaustion)"; clog /var/log/dhcpd.log 2>/dev/null | grep -ic 'no free leases'
echo "## active leases per /24 (top 20)"
awk '/^lease /{ip=$2} /binding state active/{a[ip]=1} END{for(i in a){n=i; sub(/\.[0-9]+$/,"",n); c[n]++} for(k in c) print c[k], k}' /var/dhcpd/var/db/dhcpd.leases 2>/dev/null | sort -rn | head -20
echo "## pool ranges (subnet -> range)"; grep -hE 'subnet|range ' /var/dhcpd/etc/dhcpd.conf 2>/dev/null | paste - - | head -40
RSCRIPT
;;
audit)
pfssh <<'RSCRIPT'
echo "## VERSION"; cat /etc/version 2>/dev/null
echo "## UPTIME/LOAD"; uptime
echo "## physical interfaces (media/status; VLAN sub-ifs skipped)"; for i in $(ifconfig -l); do case $i in *.*) continue;; igc[0-9]|em[0-9]|ix[0-9]|vmx[0-9]) m=$(ifconfig $i 2>/dev/null | grep -E 'media|status' | tr '\n' ' '); [ -n "$m" ] && echo " $i: $m";; esac; done
echo "## GATEWAY loss/down events (last 8)"; clog /var/log/gateways.log 2>/dev/null | tail -8
echo "## DHCP exhaustion ('no free leases' count)"; clog /var/log/dhcpd.log 2>/dev/null | grep -ic 'no free leases'
echo "## DHCP busiest /24s (top 8)"; awk '/^lease /{ip=$2} /binding state active/{a[ip]=1} END{for(i in a){n=i; sub(/\.[0-9]+$/,"",n); c[n]++} for(k in c) print c[k], k}' /var/dhcpd/var/db/dhcpd.leases 2>/dev/null | sort -rn | head -8
echo "## PF states"; pfctl -si 2>/dev/null | grep -iE 'current entries|searches'; pfctl -sm 2>/dev/null | grep -E '^states'
echo "## DNS resolver"; pgrep -lf unbound >/dev/null && echo "unbound running" || echo "unbound NOT running"
echo "## mbuf"; netstat -m 2>/dev/null | head -1
echo "## NIC errors (Ierrs/Oerrs/Coll)"; netstat -i 2>/dev/null | awk 'NR==1 || ($1 ~ /^(igc|em|ix|vmx)[0-9]$/)'
RSCRIPT
;;
*) echo "action: audit|dhcp|run|shell"; exit 1;;
esac

View File

@@ -47,3 +47,30 @@ no UniFi gateway (pfSense firewall). All collectors ran clean.
(All changes via the gated `apply-radio`/`apply-wlan`/`channel-plan` scripts — per zone, with rollback +
live validation. Nothing applied in this audit.)
---
## pfSense health check (2026-06-16) — ruling out the gateway as a WiFi factor
Investigated the Cascades pfSense (`192.168.0.1`, **pfSense Plus 25.07-RELEASE**, Netgate) over the site
VPN via SSH, to confirm whether any gateway-side issue contributes to the "WiFi bad for some users"
symptom. **Verdict: pfSense is healthy and is NOT a contributor — the problem is RF-side (2.4 GHz).**
| Area | Finding | WiFi impact |
|---|---|---|
| **DHCP exhaustion** | **0** "no free leases" events in dhcpd.log. WiFi/AP pool `192.168.0.0/22` (range 192.168.2.23.254, cap ~507) only **270 active (~53%)**; per-unit /28s + `10.0.20/.50` all have headroom | **Ruled out** (was the top suspect) |
| **DNS** | unbound resolver running | Fine |
| **WAN** | Dual Cox — WAN1 `184.191.143.62/30`, WAN2 `72.211.21.217/27`, both active **full-duplex**, `WAN_Group` gateway group, **no loss/down events** logged | Fine |
| **Firewall states** | 28,368 / 790,000 limit | Fine |
| **CPU / mbuf / uptime** | load 0.6, mbufs nominal, 10-day uptime | Healthy |
**Architecture:** per-unit design — **199 DHCP subnets**, mostly `10.x.y.0/28` per apartment (assisted-
living L2 isolation) + the `192.168.0.0/22` staff/AP network (APs + most WiFi clients). Active DHCP
backend is **ISC** (Kea config present but dormant).
**Minor (not WiFi-related):** `igc3`/WAN2 logged 1707 input-errors + 1707 "collisions", but the link is
2.5GbE full-duplex/active with zero gateway loss — consistent with the known Intel I225/I226 2.5G counter
quirk, not a real fault. No action needed unless WAN2 misbehaves.
**Conclusion:** gateway/DHCP/DNS/WAN are not bottlenecking the wireless. The 2.4 GHz remediation
(power-down + coverage-redundancy disables) remains the correct and sole fix for the client-experience tail.