Files

Howard Enos e0f9b1e221 sync: auto-sync from HOWARD-HOME at 2026-06-18 12:21:23

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-18 12:21:23

2026-06-18 12:22:42 -07:00

5.4 KiB

Raw Blame History

Cascades — Network Logging / Observability Plan (SPEC — build later)

Created: 2026-06-17 (Howard-Home / claude-main)
Status: PLAN ONLY — no infra changes made. For a scheduled build.
Goal: Capture + retain a searchable record of device drops / kicks / disconnects and the telemetry to root-cause the ongoing Cascades network issues (2.4 GHz congestion, sticky clients, roaming/min-RSSI deauths — see reports/2026-06-16-unifi-full-audit.md).

The problem we found (2026-06-17)

The UniFi controller is NOT retaining client history. A 7-day pull of the Cascades site's stat/event AND stat/alarm returned zero records (auth/site fine — client/device queries return data). So when a phone/device drops or is kicked, nothing is recorded -> the network is a black box after the fact.
pfSense logs locally but in tiny circular buffers (clog) that roll over in hours — no useful history, no search.
=> We must capture events at the source and ship them to a store with retention + search. pfSense and UniFi are log sources; neither is a retention/search platform on its own.

Where the collector lives — decision

Candidate	Verdict
CS-SERVER	NO. Fragile EOL DC (Dell R610, ~16 yr, degraded OS RAID-1, single-DC data-loss risk, I/O-bound). Adding syslog ingestion load is unacceptable.
pfSense / UniFi alone	Sources only. pfSense local retention ~hours; UniFi retains ~0 client events. Live view yes, forensics no.
Synology cascadesDS (`192.168.0.120`)	PREFERRED on-site collector. DSM up on :5001 (vault `clients/cascades-tucson/synology-cascadesds`). Built-in Log Center = a syslog server (retention + search + notifications), no Docker needed. Becoming backup-only anyway -> light syslog duty fits, keeps logs local + off CS-SERVER.
Jupiter (`172.16.3.20`, ACG office Docker)	Fallback if a richer stack (Graylog/Loki) is wanted; cross-site (Cascades -> office). Use only if on-site Synology is ruled out.

Recommendation: Synology Log Center as the on-site syslog collector. If cascadesDS turns out to be a Plus/x86 model with spare RAM, Container Manager can later add Graylog or Grafana Loki for richer search/dashboards/alerting — but Log Center alone meets the core ask.

Sources to configure (ship syslog -> Synology Log Center, UDP/TCP 514)

pfSense (192.168.0.1): Status -> System Logs -> Settings -> Remote Logging: server = Synology IP:514; select System, Firewall, DHCP, Gateway. (DHCP lease grant/expire/decline = device-drop + IP-churn signal; firewall = blocked traffic.)
UniFi controller + Cascades APs (UOS 172.16.3.29, site va6iba3v): Settings -> System -> enable Remote Logging / syslog to the Synology, include client events / debug so the APs emit assoc/DEAUTH-with-reason-code + RSSI-at-disconnect + roam events — the gold data for "who got kicked and why." Confirm AP syslog is forwarded (not just controller app log).
(Optional) switches — port up/down/flap events (the ~25 underspeed ports + 3 offline switches in the audit are suspects).

Client time-series snapshotter (fills the controller's history gap)

Because the controller isn't keeping client history, add a small poller (every 1-2 min) that hits the controller API /stat/sta for the Cascades site and appends per-client rows: ts, mac, hostname, ap, band, channel, rssi, tx_retry%, satisfaction, is_wired.

Where to run: Synology Task Scheduler + a script, or a small container; or cross-site on GuruRMM (172.16.3.30) via cron; or a coord-scheduled job. Store as SQLite/CSV (or into the collector if Graylog/Loki is chosen).
Why: lets us answer "did device X drop because RSSI cratered / it stuck to a far AP / 2.4 GHz airtime saturated" — correlating drops with the documented RF problems. Pairs with the existing unifi-wifi skill collectors (watch-ap.sh, radio-usage.sh, neighbor-collect.sh).

Alerting (phase 2)

From Log Center (or Graylog/Loki): notify (Discord via post-bot-alert.sh / discord-dm) on AP reboot, switch-port flap, repeated deauths for a tracked device, or DHCP pool pressure.

Retention

Target 30-90 days searchable (HIPAA-adjacent network metadata; no PHI in syslog). Size the Synology Log Center archive / volume accordingly; rotate/compress older.

Build steps (when scheduled)

Confirm cascadesDS model + RAM + DSM version (determines Log Center-only vs Container Manager for Graylog/Loki). Cred: vault clients/cascades-tucson/synology-cascadesds.
Install/enable Log Center (Package Center) -> enable syslog server (514), set retention.
Point pfSense remote syslog at it (sources above) — verify receipt.
Enable UniFi controller + AP remote syslog (with client/deauth events) — verify AP deauth events arrive with reason + RSSI.
Deploy the client snapshotter (cron/Task Scheduler) — verify rows accumulating.
(Optional) Container Manager -> Graylog/Loki+Grafana for dashboards; wire alerting.
Validate: force a test device off WiFi -> confirm a searchable deauth event with reason + RSSI.

Open items

Confirm cascadesDS model/RAM/Docker capability (step 1).
Confirm no PHI traverses syslog (network metadata only) for the HIPAA file.
Decide retention window + alert thresholds.
If on-site is rejected -> fall back to Jupiter (Graylog/Loki) cross-site.

5.4 KiB Raw Blame History