From 3f01efb6bf350cc494312d083a0073b6f4df1ab7 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Mon, 15 Jun 2026 18:32:32 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-15 18:32:17 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-15 18:32:17 --- ...mike-unifi-wifi-skill-and-gururmm-fixes.md | 140 ++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md diff --git a/session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md b/session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md new file mode 100644 index 0000000..eeb1163 --- /dev/null +++ b/session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md @@ -0,0 +1,140 @@ +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Multi-stream session. Two GuruRMM server bugs were diagnosed, fixed, and deployed to production, +and a substantial new fleet capability — the `unifi-wifi` tuning skill — was researched and built +against the self-hosted UOS controller. + +**GuruRMM BSOD duplicate alerts (fixed + deployed).** Triaged a dashboard showing two identical +`VIDEO_TDR_FAILURE (0x116) on MSI` CRITICAL alerts. Root cause: the BSOD alert `dedup_key` was +`bsod::` — unique per crash, so every recurrence spawned a new alert. Worse, +because `alert_mutes` keys on `dedup_key`, the "ignore permanently" Mike had set only matched the one +dump it was placed on, so each new crash re-alerted (a perma-ignore failure, not just a cosmetic +duplicate). Changed the key to `bsod::` (stable across recurrences). Committed +`f0a4b7f`, pushed → the build pipeline deployed server v0.3.73. Then corrected the live state on +`.30` via psql: retired the stale per-dump mute, inserted the correct stable-key mute for MSI +(`bsod:a685af29-...:0x116`), and resolved the 2 active duplicate alerts. Verified: MSI BSOD alerts +now 0 active. + +**GuruRMM MSI cache EXDEV (fixed + deployed).** Explained a `gururmm_server::api::install: Failed to +move MSI to cache` server_error. The site-MSI builder staged the signed MSI in `std::env::temp_dir()` +(`/tmp`, a tmpfs on `.30`) then `rename`d it to `/opt/gururmm/downloads` (root LV) — a cross-device +rename that fails with EXDEV, so every site-specific MSI build 500'd. The signed-EXE path already +staged in `downloads_dir` for this reason; the MSI path was the outlier. Fixed (stage temp in +`downloads_dir`), committed `95ef901`, deployed server v0.3.74. + +**UOS dedicated SSH key (Howard unblocked).** Howard was blocked on UOS controller (.29) access for +Cascades RF work. Generated a dedicated ed25519 keypair, installed its pubkey on `.29` root, and +vaulted it (`infrastructure/uos-server-ssh-key`, base64 in `ssh-private-key-b64`). Wired +`uos-mongo.sh` to auto-resolve it so any fleet machine works. Replied via coord. + +**`unifi-wifi` skill (the main build).** Researched what the UOS controller exposes for RF tuning, +corrected an early wrong conclusion (the history is NOT in the `ace` config DB — it's in **`ace_stat`**: +`stat_hourly` per-AP/band `cu_total`/`cu_interf`/`num_sta`, and `wifi_connectivity_event` = the roam +graph). Built: `audit-site.sh` (config + foreign-interference audit), `model-rank.sh` (airtime-reduction +ranking), `optimize-radios.sh` (coverage-safe power-down/disable planner, multi-model-hardened via +Grok+Gemini), `live-stats.sh` (controller live API, needs a vaulted admin), `watch-ap.sh` (per-AP +real-time RF watch via direct AP SSH). Confirmed direct AP SSH is feasible (device-auth vaulted +`clients/cascades-tucson/unifi-ap-ssh`); needs the Cascades VPN for L3 reach. Messaged Howard the +handoff. + +## Key Decisions + +- **BSOD/mute key on `(agent,bugcheck)` not dump hash.** One fix resolves both the duplicate alerts + and the broken perma-ignore (both ride on `dedup_key`). Counting is preserved (every dump still in + `bsod_events`); muting only suppresses the active alert + email. +- **Deploy via push (webhook pipeline), DB cleanup via psql on `.30`.** The pipeline auto-builds on + push to `guru-rmm` main; the existing duplicate alerts and the corrected mute don't self-fix, so + applied them directly in Postgres. +- **UOS key: dedicated keypair, not the standard key.** Vaulting GURU-5070's broad personal key + fleet-wide was rejected; a dedicated, revocable key scoped to `.29` was generated instead. +- **Vault multiline keys as base64.** `vault-helper --set` collapses multiline values to one line + (corrupts SSH keys); store as `*-b64` and decode on use. (Root cause of a failed key round-trip.) +- **WiFi coverage model = the roam graph, not distance.** Materials-aware by construction: Cascades' + steel-reinforced hallway walls block cross-hall RF, so clients never roam across them and the model + never calls those APs redundant. Distance/floorplan is only a prior; RF/roam evidence is the truth. +- **Power-down now, disable later.** Cascades airtime data robustly supports powering down ~all 2.4 + radios (safe, keeps BSSID); roam data is too sparse to PROVE coverage redundancy for disables, so the + optimizer recommends 0 disables until the live AP-to-AP RF neighbor table (API wireup) exists. +- **Multi-AI on design AND implementation.** Grok+Gemini critiqued the optimizer design (caught the + capacity-cascade risk → added load-shift simulation; bidirectional roams; band-specific RSSI; + 40%/zone cap; retries normalization). + +## Problems Encountered + +- **Vaulted SSH key didn't round-trip** (`libcrypto: unsupported`): `vault-helper --set` mangled the + multiline key to one line. Fixed by storing base64 (`ssh-private-key-b64`) + decode on use. +- **`tx_retries` shown as 958%/6317%** in the optimizer: it's a raw count, not a %. Normalized by + `wifi_tx_attempts`. +- **Optimizer over-classified "isolated-essential"**: sparse roam data → almost no strong neighbor → + everything looked isolated. Resolved by making POWER-DOWN (coverage-safe) the default for saturated + radios regardless of neighbor evidence, reserving DISABLE for radios with positive coverage evidence. +- **`6e` as a JS object key = SyntaxError** ("missing exponent") in mongo-shell JS (parsed as a number). + Quote it (`'6e'`) or avoid it as a bare key. +- **RMM SYSTEM context vs user mapped drives** (earlier in session, logged as correction): an AP/UNC + redirect check under SYSTEM read False; the share existed in the user session. Diagnose in `user_session`. + +## Configuration Changes + +Created (skill `.claude/skills/unifi-wifi/`): `SKILL.md`; `references/data-access.md`, +`references/methodology.md`, `references/interference-model.md`; `scripts/audit-site.sh`, +`scripts/model-rank.sh`, `scripts/optimize-radios.sh`, `scripts/live-stats.sh`, `scripts/watch-ap.sh`. +Modified: `.claude/scripts/uos-mongo.sh` (auto-resolve the vaulted UOS key); `wiki/systems/uos-server.md` +(dedicated key), `wiki/systems/jupiter.md`, `wiki/clients/internal-infrastructure.md`. + +guru-rmm (submodule, deployed): `server/src/ws/mod.rs` (BSOD dedup_key, `f0a4b7f`/v0.3.73); +`server/src/api/install.rs` (MSI EXDEV, `95ef901`/v0.3.74). + +Vault (all pushed): `infrastructure/uos-server-ssh-key`, `clients/cascades-tucson/unifi-ap-ssh`. +DB (gururmm Postgres on `.30`): alert_mutes for MSI corrected; 2 BSOD alerts resolved. +Memory: `feedback_rmm_system_context_mapped_drives.md` (+ MEMORY.md line). errorlog: RMM-SYSTEM correction. + +## Credentials & Secrets + +- **UOS dedicated root SSH key** — vault `infrastructure/uos-server-ssh-key` (private key base64 in + `credentials.ssh-private-key-b64`; pubkey in `/root/.ssh/authorized_keys` on `.29`). Decode: + `vault.sh get-field ... credentials.ssh-private-key-b64 | base64 -d`. +- **Cascades UniFi AP device-auth SSH** — vault `clients/cascades-tucson/unifi-ap-ssh`: user + `gUJiB84lr6C4`, password `RJE3VIqXiA8Gj` (all Cascades APs; needs Cascades VPN to reach 192.168.2.x/3.x). +- **UOS cloud Site Manager API key** (from earlier) — vault `infrastructure/unifi-site-manager-api` + (`amY54KqX0i0OuGEYNykLdH9M1Kd4jhzt`); works on api.ui.com for adopted devices only, 401s the local API. +- **gururmm Postgres** (used for the alert DB fix) — vault `infrastructure/gururmm-server-physical` + `credentials.databases.postgresql-*` (db gururmm / user gururmm). + +## Infrastructure & Servers + +- **GuruRMM server `.30`** (hostname gururmm, Ubuntu): systemd `gururmm-server`; repo `/home/guru/gururmm`; + binary `/opt/gururmm/gururmm-server`; build log `/var/log/gururmm-build.log`; push-to-main webhook + auto-builds+deploys. SSH `infrastructure/gururmm-server-physical`. Now at v0.3.74. +- **UOS Server `.29`** (unifi.azcomputerguru.com → `:11443` via NPM; Rocky 9; UniFi Network = ace.jar + + Mongo `ace`/`ace_stat`/`ace_audit` on 127.0.0.1:27117 inside rootless podman `uosserver`). Cascades + site `_id 685f39068e65331c46ef6dd2`, short `va6iba3v`, 77 APs (U7-Pro/6E) + 12 switches. +- **MSI BSOD agent** `a685af29-ef35-46da-ac3d-431e713b70ab` (recurring 0x116; being replaced). + +## Commands & Outputs + +- WiFi audit/rank/optimize: `bash .claude/skills/unifi-wifi/scripts/{audit-site,model-rank,optimize-radios}.sh cascades`. +- UOS Mongo (incl. ace_stat): `bash .claude/scripts/uos-mongo.sh` (pipe JS; `db.getSiblingDB('ace_stat')`). +- Cascades 2.4 finding: 75 APs, cu_total 74–94%, cu_interf 61–81%, ~1 client each → power-down 74/75, 0 disables yet. +- MSI EXDEV: `err=Invalid cross-device link (os error 18)`; `/tmp`=tmpfs, `/opt/gururmm/downloads`=root LV. + +## Pending / Incomplete Tasks + +- **Enable the Cascades site VPN** → unlocks `watch-ap.sh` (per-AP real-time) — APs on 192.168.2.x/3.x. +- **Create a read-only UOS Network admin**, vault `infrastructure/uos-server-network-api` → + `live-stats.sh` (controller-wide live RF + the AP-to-AP RF neighbor table that enables confident disables). +- Howard placing APs on the UniFi floorplan → adds distance-prior edges to the model. +- `optimize-radios.sh` v-next: greedy disable becomes meaningful once the RF table exists. +- (Carried, earlier in day) SP-SharonW11 M365 license removal — coord todo `79d291db`, EOW 2026-06-19. + +## Reference Information + +- guru-rmm commits: BSOD `f0a4b7f` (v0.3.73), MSI `95ef901` (v0.3.74). Coord msgs to Howard: + `a589f230` (wifi skill), `a589...`/UOS key reply. MSI alert dedup_key now `bsod::0x116`. +- Skill: `.claude/skills/unifi-wifi/` (SKILL.md + references/ + scripts/). Data planes: `ace` (config), + `ace_stat` (history: stat_hourly/daily + wifi_connectivity_event), live Network API (optional). +- UOS access: `infrastructure/uos-server-ssh-key` + `.claude/scripts/uos-mongo.sh`; wiki `systems/uos-server.md`.