# Cascades — Network Logging / Observability Plan (SPEC — build later) - **Created:** 2026-06-17 (Howard-Home / claude-main) - **Status:** PLAN ONLY — no infra changes made. For a scheduled build. - **Goal:** Capture + retain a searchable record of **device drops / kicks / disconnects** and the telemetry to **root-cause the ongoing Cascades network issues** (2.4 GHz congestion, sticky clients, roaming/min-RSSI deauths — see `reports/2026-06-16-unifi-full-audit.md`). ## The problem we found (2026-06-17) - **The UniFi controller is NOT retaining client history.** A 7-day pull of the Cascades site's `stat/event` AND `stat/alarm` returned **zero** records (auth/site fine — client/device queries return data). So when a phone/device drops or is kicked, **nothing is recorded** -> the network is a black box after the fact. - **pfSense logs locally but in tiny circular buffers** (clog) that roll over in hours — no useful history, no search. - => We must **capture events at the source and ship them to a store with retention + search**. pfSense and UniFi are log *sources*; neither is a retention/search platform on its own. ## Where the collector lives — decision | Candidate | Verdict | |---|---| | **CS-SERVER** | **NO.** Fragile EOL DC (Dell R610, ~16 yr, **degraded OS RAID-1**, single-DC data-loss risk, I/O-bound). Adding syslog ingestion load is unacceptable. | | **pfSense / UniFi alone** | Sources only. pfSense local retention ~hours; UniFi retains ~0 client events. Live view yes, forensics no. | | **Synology cascadesDS (`192.168.0.120`)** | **PREFERRED on-site collector.** DSM up on :5001 (vault `clients/cascades-tucson/synology-cascadesds`). Built-in **Log Center** = a syslog server (retention + search + notifications), no Docker needed. Becoming backup-only anyway -> light syslog duty fits, keeps logs local + off CS-SERVER. | | Jupiter (`172.16.3.20`, ACG office Docker) | Fallback if a richer stack (Graylog/Loki) is wanted; cross-site (Cascades -> office). Use only if on-site Synology is ruled out. | **Recommendation:** Synology **Log Center** as the on-site syslog collector. If cascadesDS turns out to be a Plus/x86 model with spare RAM, **Container Manager** can later add **Graylog** or **Grafana Loki** for richer search/dashboards/alerting — but Log Center alone meets the core ask. ## Sources to configure (ship syslog -> Synology Log Center, UDP/TCP 514) 1. **pfSense** (`192.168.0.1`): Status -> System Logs -> Settings -> **Remote Logging**: server = Synology IP:514; select **System, Firewall, DHCP, Gateway**. (DHCP lease grant/expire/decline = device-drop + IP-churn signal; firewall = blocked traffic.) 2. **UniFi controller + Cascades APs** (UOS `172.16.3.29`, site `va6iba3v`): Settings -> System -> enable **Remote Logging / syslog** to the Synology, include **client events / debug** so the **APs emit assoc/DEAUTH-with-reason-code + RSSI-at-disconnect + roam** events — the gold data for "who got kicked and why." Confirm AP syslog is forwarded (not just controller app log). 3. **(Optional) switches** — port up/down/flap events (the ~25 underspeed ports + 3 offline switches in the audit are suspects). ## Client time-series snapshotter (fills the controller's history gap) Because the controller isn't keeping client history, add a small **poller** (every 1-2 min) that hits the controller API `/stat/sta` for the Cascades site and appends per-client rows: `ts, mac, hostname, ap, band, channel, rssi, tx_retry%, satisfaction, is_wired`. - **Where to run:** Synology Task Scheduler + a script, or a small container; or cross-site on GuruRMM (`172.16.3.30`) via cron; or a coord-scheduled job. Store as SQLite/CSV (or into the collector if Graylog/Loki is chosen). - **Why:** lets us answer "did device X drop because RSSI cratered / it stuck to a far AP / 2.4 GHz airtime saturated" — correlating drops with the documented RF problems. Pairs with the existing `unifi-wifi` skill collectors (`watch-ap.sh`, `radio-usage.sh`, `neighbor-collect.sh`). ## Alerting (phase 2) From Log Center (or Graylog/Loki): notify (Discord via `post-bot-alert.sh` / `discord-dm`) on AP reboot, switch-port flap, repeated deauths for a tracked device, or DHCP pool pressure. ## Retention Target 30-90 days searchable (HIPAA-adjacent network metadata; no PHI in syslog). Size the Synology Log Center archive / volume accordingly; rotate/compress older. ## Build steps (when scheduled) 1. Confirm cascadesDS **model + RAM + DSM version** (determines Log Center-only vs Container Manager for Graylog/Loki). Cred: vault `clients/cascades-tucson/synology-cascadesds`. 2. Install/enable **Log Center** (Package Center) -> enable **syslog server** (514), set retention. 3. Point **pfSense** remote syslog at it (sources above) — verify receipt. 4. Enable **UniFi controller + AP** remote syslog (with client/deauth events) — verify AP deauth events arrive with reason + RSSI. 5. Deploy the **client snapshotter** (cron/Task Scheduler) — verify rows accumulating. 6. (Optional) Container Manager -> Graylog/Loki+Grafana for dashboards; wire alerting. 7. Validate: force a test device off WiFi -> confirm a searchable deauth event with reason + RSSI. ## Open items - Confirm cascadesDS model/RAM/Docker capability (step 1). - Confirm no PHI traverses syslog (network metadata only) for the HIPAA file. - Decide retention window + alert thresholds. - If on-site is rejected -> fall back to Jupiter (Graylog/Loki) cross-site.