sync: auto-sync from GURU-5070 at 2026-06-15 18:32:17

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-15 18:32:17
This commit is contained in:
2026-06-15 18:32:32 -07:00
parent 9a2f806c67
commit 3f01efb6bf

View File

@@ -0,0 +1,140 @@
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Multi-stream session. Two GuruRMM server bugs were diagnosed, fixed, and deployed to production,
and a substantial new fleet capability — the `unifi-wifi` tuning skill — was researched and built
against the self-hosted UOS controller.
**GuruRMM BSOD duplicate alerts (fixed + deployed).** Triaged a dashboard showing two identical
`VIDEO_TDR_FAILURE (0x116) on MSI` CRITICAL alerts. Root cause: the BSOD alert `dedup_key` was
`bsod:<agent>:<dump_sha256>` — unique per crash, so every recurrence spawned a new alert. Worse,
because `alert_mutes` keys on `dedup_key`, the "ignore permanently" Mike had set only matched the one
dump it was placed on, so each new crash re-alerted (a perma-ignore failure, not just a cosmetic
duplicate). Changed the key to `bsod:<agent>:<bugcheck_code>` (stable across recurrences). Committed
`f0a4b7f`, pushed → the build pipeline deployed server v0.3.73. Then corrected the live state on
`.30` via psql: retired the stale per-dump mute, inserted the correct stable-key mute for MSI
(`bsod:a685af29-...:0x116`), and resolved the 2 active duplicate alerts. Verified: MSI BSOD alerts
now 0 active.
**GuruRMM MSI cache EXDEV (fixed + deployed).** Explained a `gururmm_server::api::install: Failed to
move MSI to cache` server_error. The site-MSI builder staged the signed MSI in `std::env::temp_dir()`
(`/tmp`, a tmpfs on `.30`) then `rename`d it to `/opt/gururmm/downloads` (root LV) — a cross-device
rename that fails with EXDEV, so every site-specific MSI build 500'd. The signed-EXE path already
staged in `downloads_dir` for this reason; the MSI path was the outlier. Fixed (stage temp in
`downloads_dir`), committed `95ef901`, deployed server v0.3.74.
**UOS dedicated SSH key (Howard unblocked).** Howard was blocked on UOS controller (.29) access for
Cascades RF work. Generated a dedicated ed25519 keypair, installed its pubkey on `.29` root, and
vaulted it (`infrastructure/uos-server-ssh-key`, base64 in `ssh-private-key-b64`). Wired
`uos-mongo.sh` to auto-resolve it so any fleet machine works. Replied via coord.
**`unifi-wifi` skill (the main build).** Researched what the UOS controller exposes for RF tuning,
corrected an early wrong conclusion (the history is NOT in the `ace` config DB — it's in **`ace_stat`**:
`stat_hourly` per-AP/band `cu_total`/`cu_interf`/`num_sta`, and `wifi_connectivity_event` = the roam
graph). Built: `audit-site.sh` (config + foreign-interference audit), `model-rank.sh` (airtime-reduction
ranking), `optimize-radios.sh` (coverage-safe power-down/disable planner, multi-model-hardened via
Grok+Gemini), `live-stats.sh` (controller live API, needs a vaulted admin), `watch-ap.sh` (per-AP
real-time RF watch via direct AP SSH). Confirmed direct AP SSH is feasible (device-auth vaulted
`clients/cascades-tucson/unifi-ap-ssh`); needs the Cascades VPN for L3 reach. Messaged Howard the
handoff.
## Key Decisions
- **BSOD/mute key on `(agent,bugcheck)` not dump hash.** One fix resolves both the duplicate alerts
and the broken perma-ignore (both ride on `dedup_key`). Counting is preserved (every dump still in
`bsod_events`); muting only suppresses the active alert + email.
- **Deploy via push (webhook pipeline), DB cleanup via psql on `.30`.** The pipeline auto-builds on
push to `guru-rmm` main; the existing duplicate alerts and the corrected mute don't self-fix, so
applied them directly in Postgres.
- **UOS key: dedicated keypair, not the standard key.** Vaulting GURU-5070's broad personal key
fleet-wide was rejected; a dedicated, revocable key scoped to `.29` was generated instead.
- **Vault multiline keys as base64.** `vault-helper --set` collapses multiline values to one line
(corrupts SSH keys); store as `*-b64` and decode on use. (Root cause of a failed key round-trip.)
- **WiFi coverage model = the roam graph, not distance.** Materials-aware by construction: Cascades'
steel-reinforced hallway walls block cross-hall RF, so clients never roam across them and the model
never calls those APs redundant. Distance/floorplan is only a prior; RF/roam evidence is the truth.
- **Power-down now, disable later.** Cascades airtime data robustly supports powering down ~all 2.4
radios (safe, keeps BSSID); roam data is too sparse to PROVE coverage redundancy for disables, so the
optimizer recommends 0 disables until the live AP-to-AP RF neighbor table (API wireup) exists.
- **Multi-AI on design AND implementation.** Grok+Gemini critiqued the optimizer design (caught the
capacity-cascade risk → added load-shift simulation; bidirectional roams; band-specific RSSI;
40%/zone cap; retries normalization).
## Problems Encountered
- **Vaulted SSH key didn't round-trip** (`libcrypto: unsupported`): `vault-helper --set` mangled the
multiline key to one line. Fixed by storing base64 (`ssh-private-key-b64`) + decode on use.
- **`tx_retries` shown as 958%/6317%** in the optimizer: it's a raw count, not a %. Normalized by
`wifi_tx_attempts`.
- **Optimizer over-classified "isolated-essential"**: sparse roam data → almost no strong neighbor →
everything looked isolated. Resolved by making POWER-DOWN (coverage-safe) the default for saturated
radios regardless of neighbor evidence, reserving DISABLE for radios with positive coverage evidence.
- **`6e` as a JS object key = SyntaxError** ("missing exponent") in mongo-shell JS (parsed as a number).
Quote it (`'6e'`) or avoid it as a bare key.
- **RMM SYSTEM context vs user mapped drives** (earlier in session, logged as correction): an AP/UNC
redirect check under SYSTEM read False; the share existed in the user session. Diagnose in `user_session`.
## Configuration Changes
Created (skill `.claude/skills/unifi-wifi/`): `SKILL.md`; `references/data-access.md`,
`references/methodology.md`, `references/interference-model.md`; `scripts/audit-site.sh`,
`scripts/model-rank.sh`, `scripts/optimize-radios.sh`, `scripts/live-stats.sh`, `scripts/watch-ap.sh`.
Modified: `.claude/scripts/uos-mongo.sh` (auto-resolve the vaulted UOS key); `wiki/systems/uos-server.md`
(dedicated key), `wiki/systems/jupiter.md`, `wiki/clients/internal-infrastructure.md`.
guru-rmm (submodule, deployed): `server/src/ws/mod.rs` (BSOD dedup_key, `f0a4b7f`/v0.3.73);
`server/src/api/install.rs` (MSI EXDEV, `95ef901`/v0.3.74).
Vault (all pushed): `infrastructure/uos-server-ssh-key`, `clients/cascades-tucson/unifi-ap-ssh`.
DB (gururmm Postgres on `.30`): alert_mutes for MSI corrected; 2 BSOD alerts resolved.
Memory: `feedback_rmm_system_context_mapped_drives.md` (+ MEMORY.md line). errorlog: RMM-SYSTEM correction.
## Credentials & Secrets
- **UOS dedicated root SSH key** — vault `infrastructure/uos-server-ssh-key` (private key base64 in
`credentials.ssh-private-key-b64`; pubkey in `/root/.ssh/authorized_keys` on `.29`). Decode:
`vault.sh get-field ... credentials.ssh-private-key-b64 | base64 -d`.
- **Cascades UniFi AP device-auth SSH** — vault `clients/cascades-tucson/unifi-ap-ssh`: user
`gUJiB84lr6C4`, password `RJE3VIqXiA8Gj` (all Cascades APs; needs Cascades VPN to reach 192.168.2.x/3.x).
- **UOS cloud Site Manager API key** (from earlier) — vault `infrastructure/unifi-site-manager-api`
(`amY54KqX0i0OuGEYNykLdH9M1Kd4jhzt`); works on api.ui.com for adopted devices only, 401s the local API.
- **gururmm Postgres** (used for the alert DB fix) — vault `infrastructure/gururmm-server-physical`
`credentials.databases.postgresql-*` (db gururmm / user gururmm).
## Infrastructure & Servers
- **GuruRMM server `.30`** (hostname gururmm, Ubuntu): systemd `gururmm-server`; repo `/home/guru/gururmm`;
binary `/opt/gururmm/gururmm-server`; build log `/var/log/gururmm-build.log`; push-to-main webhook
auto-builds+deploys. SSH `infrastructure/gururmm-server-physical`. Now at v0.3.74.
- **UOS Server `.29`** (unifi.azcomputerguru.com → `:11443` via NPM; Rocky 9; UniFi Network = ace.jar +
Mongo `ace`/`ace_stat`/`ace_audit` on 127.0.0.1:27117 inside rootless podman `uosserver`). Cascades
site `_id 685f39068e65331c46ef6dd2`, short `va6iba3v`, 77 APs (U7-Pro/6E) + 12 switches.
- **MSI BSOD agent** `a685af29-ef35-46da-ac3d-431e713b70ab` (recurring 0x116; being replaced).
## Commands & Outputs
- WiFi audit/rank/optimize: `bash .claude/skills/unifi-wifi/scripts/{audit-site,model-rank,optimize-radios}.sh cascades`.
- UOS Mongo (incl. ace_stat): `bash .claude/scripts/uos-mongo.sh` (pipe JS; `db.getSiblingDB('ace_stat')`).
- Cascades 2.4 finding: 75 APs, cu_total 7494%, cu_interf 6181%, ~1 client each → power-down 74/75, 0 disables yet.
- MSI EXDEV: `err=Invalid cross-device link (os error 18)`; `/tmp`=tmpfs, `/opt/gururmm/downloads`=root LV.
## Pending / Incomplete Tasks
- **Enable the Cascades site VPN** → unlocks `watch-ap.sh` (per-AP real-time) — APs on 192.168.2.x/3.x.
- **Create a read-only UOS Network admin**, vault `infrastructure/uos-server-network-api`
`live-stats.sh` (controller-wide live RF + the AP-to-AP RF neighbor table that enables confident disables).
- Howard placing APs on the UniFi floorplan → adds distance-prior edges to the model.
- `optimize-radios.sh` v-next: greedy disable becomes meaningful once the RF table exists.
- (Carried, earlier in day) SP-SharonW11 M365 license removal — coord todo `79d291db`, EOW 2026-06-19.
## Reference Information
- guru-rmm commits: BSOD `f0a4b7f` (v0.3.73), MSI `95ef901` (v0.3.74). Coord msgs to Howard:
`a589f230` (wifi skill), `a589...`/UOS key reply. MSI alert dedup_key now `bsod:<agent>:0x116`.
- Skill: `.claude/skills/unifi-wifi/` (SKILL.md + references/ + scripts/). Data planes: `ace` (config),
`ace_stat` (history: stat_hourly/daily + wifi_connectivity_event), live Network API (optional).
- UOS access: `infrastructure/uos-server-ssh-key` + `.claude/scripts/uos-mongo.sh`; wiki `systems/uos-server.md`.