sync: auto-sync from HOWARD-HOME at 2026-07-02 11:16:12

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-07-02 11:16:12
2026-07-02 11:16:40 -07:00
parent ac23f17e23
commit ccfa4f7b21
4 changed files with 119 additions and 165 deletions
--- a/clients/scileppi-law/session-logs/2026-07/2026-07-02-howard-scileppi-internet-outage-cox-wan-rca.md
+++ b/clients/scileppi-law/session-logs/2026-07/2026-07-02-howard-scileppi-internet-outage-cox-wan-rca.md
@@ -0,0 +1,83 @@
+# Scileppi Law — internet "no internet" reports root-caused to Cox WAN line (not a device); UniFi layer documented; Fable 5 independent verify
+
+## User
+- **User:** Howard Enos (howard)
+- **Machine:** Howard-Home
+- **Role:** tech
+
+## Session Summary
+
+Started as a repeat of the 2026-07-01 "Sylvia can't connect to the server" call, this time with a second, undocumented laptop also affected. Loaded the Scileppi wiki + the 2026-07-01 session log, then checked live state via GuruRMM. Unlike yesterday (NAS reboot), `SL-SERVER` was healthy: uptime ~1 day, SMB/AFP all listening, `/volume1` mounted 35%. Sylvia's Mac (`Mac-mini-2`) was flapping offline (Wi-Fi/display-sleep), and yesterday's fixes (`pmset displaysleep 0`, `.local` mount plist) were confirmed intact once it reconnected. The NAS showed active AFP sessions for three users — `sylvia`, `rose`, `andrew` — i.e. the firm has more users/devices than the wiki records; the "other laptop" is not enrolled in RMM. Initial working theory was stale ~23h AFP sessions from yesterday's NAS reboot.
+
+The user then reframed the issue: the USG "reports no internet" — need to determine device-reboot/crash vs Cox ISP. Diagnosed the WAN from inside the site via `SL-SERVER` (wired, default route through the gateway): gateway, `8.8.8.8`, `1.1.1.1`, and the Cox resolver all 0% loss; a 20-ping sustained test was clean; traceroute completed cleanly end-to-end through Cox (hop 2 `10.68.48.1` → `68.105.x`/`72.215.x` → Google, ~30 ms). So there was no live outage at check time. Confirmed the gateway `192.168.242.1` is a UniFi OS device (title "UniFi OS", ports 443/8443/8080).
+
+Scileppi is not on the self-hosted UOS controller; it is cloud-managed under ACG's Ubiquiti account. Found the device as `UCG-Scileppi` (UniFi Cloud Gateway Ultra) via the Site Manager cloud API. The documented vault key `infrastructure/unifi-site-manager-api` returned 401 (stale); the working key is `services/unifi-site-manager`. Pulled WAN telemetry: `startupTime` 2026-06-24 (gateway up ~8 days — no reboot), all four UniFi devices online (UCG, USW Pro Max 16 PoE, two APs), WAN IP `184.183.92.165`. `internetIssues5min` showed a WAN packet-loss/downtime cluster ~03:50–04:25 Phoenix (`wan_downtime` ~04:15) and a recent `not_reported` telemetry gap ~10:45–10:55 (the "one it just had that didn't report"). Concluded: Cox WAN line quality (brief blips + packet loss), not a device fault.
+
+Sent the findings to Winter via Discord DM. Then documented Scileppi's full UniFi layer in the wiki (previously undocumented), added a P1 open item for a Cox service ticket, and logged the 2026-07-02 event.
+
+Finally, per the user's request, ran an independent re-gather on the **Fable 5 model** (subagent). It confirmed the core verdict (Cox line quality, not a device reboot) and the gateway 8-day uptime, all-devices-online, WAN IP, morning cluster, and clean live test. It found two data discrepancies vs the first pull: the ISP label now reads "Cox Business / AS22773" (not "Comcast Cable / AS7922"), and the ISP-metrics counts differ (21 lossy windows vs ~90, max 13% vs 6%, plus two real downtime samples of 2 s + 18 s at ~04:15–04:20 that the first pull didn't show). The differences stem from UniFi backfilling/recomputing the 5-min series and its flaky ISP auto-detection. Reconciled the wiki to the verified numbers before saving.
+
+## Key Decisions
+
+- Diagnosed the WAN from `SL-SERVER` (wired, in-site) rather than chasing the USG directly first — its RMM heartbeat is a live internet up/down signal and traceroute from it cleanly separates LAN vs WAN vs ISP.
+- Used the cloud Site Manager API (`services/unifi-site-manager`) as the authoritative source for the gateway's WAN event history, since Scileppi is cloud-managed and not on the self-hosted UOS controller.
+- Attributed the "USG reports no internet, but a wired box has clean internet" pattern to a brief WAN blip + the UCG's own health-check/telemetry gap, not a real sustained outage — supported by the `not_reported` bucket and clean live tests.
+- Concluded Cox line quality (not equipment) from the 8-day gateway uptime + WAN-only loss/downtime; recommended a Cox service call rather than any UniFi change.
+- Ran the recheck on Fable 5 via a subagent (cannot swap the main-loop model mid-session) and held the wiki save until it returned, then corrected the two discrepancies rather than saving the first-pull numbers.
+
+## Problems Encountered
+
+- Documented vault key `infrastructure/unifi-site-manager-api` returned 401 (stale/rotated). Resolved by using `services/unifi-site-manager`. Logged to `errorlog.md` (`--friction`).
+- Scileppi not found on the self-hosted UOS controller or by name in the cloud host list initially. Resolved by enumerating all cloud hosts — the device is named `UCG-Scileppi`.
+- First-pull ISP-metrics numbers (90/276 lossy windows, max 6%, no downtime>0) did not reproduce on the Fable 5 re-gather (21/283, max 13%, two downtime samples 2 s + 18 s). Cause: UniFi backfills/recomputes the 5-min series between pulls; its ISP name detection also flip-flops (Cox Business/22773 vs Comcast/7922). Reconciled the wiki to the verified data and added a note to read live metrics at diagnosis time.
+- Sylvia's Mac (`Mac-mini-2`) repeatedly flapped offline in RMM mid-work (Wi-Fi/display-sleep). Queued read-only diagnostics that ran on reconnect; confirmed prior fixes intact. Separate issue from the internet question.
+
+## Configuration Changes
+
+- `wiki/clients/scileppi-law.md` — added "Network gear (UniFi)" subsection (UCG-Scileppi + switch + 2 APs, cloud host id, siteId, cloud-API access via `services/unifi-site-manager`); expanded Network with WAN/ISP + line-quality detail and a diagnosis playbook; added P1 Cox-ticket open item; added 2026-07-02 Key Event; `last_compiled` -> 2026-07-02; added this log to `sources`.
+- `errorlog.md` — one `--friction` entry (stale `infrastructure/unifi-site-manager-api` key returns 401; use `services/unifi-site-manager`).
+- No changes to any endpoint/device — all RMM commands on `SL-SERVER` and `Mac-mini-2` were read-only diagnostics.
+
+## Credentials & Secrets
+
+- No credentials created or rotated. Used existing vault entries: `services/unifi-site-manager` (working UniFi Site Manager cloud API key; `X-API-KEY` vs `https://api.ui.com`); `infrastructure/gururmm-server.sops.yaml` (RMM admin). Noted `infrastructure/unifi-site-manager-api` is **stale (401)** — do not use.
+- Discord bot token resolved from vault for the DM (`projects/discord-bot/bot-token.sops.yaml`).
+- Temp JWT/state files written to repo root during work were removed at each step.
+
+## Infrastructure & Servers
+
+- **Scileppi subnet:** `192.168.242.0/24`; gateway/DNS `192.168.242.1` = `UCG-Scileppi`.
+- **UCG-Scileppi** — UniFi Cloud Gateway Ultra (UDRULT), fw 5.1.19 (UniFi OS 5.1.117 / Network 10.4.57). LAN `192.168.242.1`, WAN `184.183.92.165` on `eth4` port 5 (2.5 GbE). MAC `1c:6a:1b:4b:14:xx` (LAN bridge shows locally-administered `1e:6a:1b:...` in ARP). `startupTime` 2026-06-24T20:23:02Z; adopted 2025-10-20. Cloud host id `1C6A1B4B14DD0000000008CE3561000000000946A4750000000067C563E4:1543906151`; siteId `68f6f0d89dda23444538a492`.
+- **USW Pro Max 16 PoE** — core switch, `192.168.242.199`, fw 7.4.1.
+- **APs** — "File Room" `192.168.242.121` (fw 6.8.2); "Sylvia Desk" `192.168.242.10` (fw 6.8.2).
+- **ISP:** Cox (Cox Business, **AS22773**, Phoenix). WAN IP `184.183.92.165`. UniFi ISP name unreliable (has read both Cox Business/22773 and Comcast/7922).
+- **SL-SERVER** — Synology NAS `192.168.242.5`, GuruRMM agent `0186e9d5-e1cc-4603-a81a-adb1f2230702`. Uptime ~1 day (own Jul-01 reboot).
+- **Mac-mini-2** — Sylvia's Mac, `192.168.242.154` (Wi-Fi en1), GuruRMM agent `1386d9fd-ac16-423c-ada0-5abad5b61838`.
+- **GuruRMM:** internal `http://172.16.3.30:3001`; public `https://rmm.azcomputerguru.com`.
+- **UniFi Site Manager cloud API:** `https://api.ui.com` (`X-API-KEY`, key `services/unifi-site-manager`).
+
+## Commands & Outputs
+
+- RMM auth: `POST /api/auth/login` (internal endpoint worked from Howard-Home this session).
+- SL-SERVER internet path test: `traceroute -n -w 2 -m 8 8.8.8.8` → hop1 `192.168.242.1` 0.3 ms, hop2 `10.68.48.1` (Cox CMTS), `68.105.30.142`/`72.215.224.179` (Cox), `8.8.8.8` 30 ms; `ping -c 20` gateway + 8.8.8.8 = 0% loss.
+- Gateway identity: `curl -sk https://192.168.242.1/` → `<title>UniFi OS</title>`, nginx; ports 80/443/8443/8080/53 open.
+- Cloud API: `GET /v1/hosts` → found `UCG-Scileppi`; `GET /v1/devices?hostIds[]=<id>` → 4 devices online, `startupTime` 2026-06-24; `GET /ea/isp-metrics/5m?beginTimestamp=&endTimestamp=` → per-5-min WAN loss/latency/downtime (range must be within last 24h, 5-min aligned; otherwise HTTP 400 with the allowed range echoed).
+- `internetIssues5min` bucket decode: `index * 300 = epoch (UTC)`, Phoenix = UTC-7. idx 5943303 = 11:15 UTC = 04:15 Phoenix (`wan_downtime`); `not_reported` bucket ~17:45–17:55 UTC = ~10:45–10:55 Phoenix.
+- Fable 5 re-gather (independent): confirmed A (startup 2026-06-24), B (4 online), C (Cox; Comcast label not reproducible now), D (morning cluster + not_reported), F (clean live test), G (verdict). Refined E: 21 lossy windows, max 13%, two downtime samples 2 s + 18 s at ~04:15–04:20.
+
+## Pending / Incomplete Tasks
+
+- **P1 — Open a Cox service ticket** for circuit `184.183.92.165` (Cox Business, AS22773): intermittent WAN packet loss (spiking ~13%) + brief WAN drops (~20 s ~04:15 and a fast blip ~10:45–10:55 on 2026-07-02); have Cox check upstream/downstream signal + SNR at the modem/ONT.
+- **Identify the "other laptop"** (likely `rose` or `andrew` — both had ~23h AFP sessions) and consider enrolling `rose`/`andrew` machines in RMM; they are active users the wiki doesn't cover.
+- **Move `Mac-mini-2` to wired Ethernet** — durable fix for Wi-Fi idle drops (`displaysleep 0` is interim).
+- **Consider re-notifying Winter** with the corrected numbers — the DM sent earlier said up-to-6% / ~1/3 of windows; the verified figures are up-to-13% + a ~20 s downtime event. Verdict unchanged (Cox line).
+- Run `/wiki-compile client:scileppi-law --full` to fold this into the article deliberately (decoupled from save).
+
+## Reference Information
+
+- **Winter DM:** message_id `1522301948395651295`.
+- **GuruRMM agents:** SL-SERVER `0186e9d5-e1cc-4603-a81a-adb1f2230702`; Mac-mini-2 `1386d9fd-ac16-423c-ada0-5abad5b61838`.
+- **UCG-Scileppi:** host id `1C6A1B4B14DD0000000008CE3561000000000946A4750000000067C563E4:1543906151`; siteId `68f6f0d89dda23444538a492`; WAN `184.183.92.165`.
+- **Vault:** `services/unifi-site-manager` (working cloud key); `infrastructure/unifi-site-manager-api` (STALE/401); `infrastructure/gururmm-server.sops.yaml`.
+- **Prior log:** `clients/scileppi-law/session-logs/2026-07/2026-07-01-howard-sylvia-server-mount-nosleep-screenconnect.md`.
+- **Wiki:** `wiki/clients/scileppi-law.md` (updated this session); related `wiki/systems/uos-server.md`.