sync: auto-sync from GURU-5070 at 2026-06-15 18:32:17
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-15 18:32:17
This commit is contained in:
@@ -0,0 +1,140 @@
|
|||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-5070
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Multi-stream session. Two GuruRMM server bugs were diagnosed, fixed, and deployed to production,
|
||||||
|
and a substantial new fleet capability — the `unifi-wifi` tuning skill — was researched and built
|
||||||
|
against the self-hosted UOS controller.
|
||||||
|
|
||||||
|
**GuruRMM BSOD duplicate alerts (fixed + deployed).** Triaged a dashboard showing two identical
|
||||||
|
`VIDEO_TDR_FAILURE (0x116) on MSI` CRITICAL alerts. Root cause: the BSOD alert `dedup_key` was
|
||||||
|
`bsod:<agent>:<dump_sha256>` — unique per crash, so every recurrence spawned a new alert. Worse,
|
||||||
|
because `alert_mutes` keys on `dedup_key`, the "ignore permanently" Mike had set only matched the one
|
||||||
|
dump it was placed on, so each new crash re-alerted (a perma-ignore failure, not just a cosmetic
|
||||||
|
duplicate). Changed the key to `bsod:<agent>:<bugcheck_code>` (stable across recurrences). Committed
|
||||||
|
`f0a4b7f`, pushed → the build pipeline deployed server v0.3.73. Then corrected the live state on
|
||||||
|
`.30` via psql: retired the stale per-dump mute, inserted the correct stable-key mute for MSI
|
||||||
|
(`bsod:a685af29-...:0x116`), and resolved the 2 active duplicate alerts. Verified: MSI BSOD alerts
|
||||||
|
now 0 active.
|
||||||
|
|
||||||
|
**GuruRMM MSI cache EXDEV (fixed + deployed).** Explained a `gururmm_server::api::install: Failed to
|
||||||
|
move MSI to cache` server_error. The site-MSI builder staged the signed MSI in `std::env::temp_dir()`
|
||||||
|
(`/tmp`, a tmpfs on `.30`) then `rename`d it to `/opt/gururmm/downloads` (root LV) — a cross-device
|
||||||
|
rename that fails with EXDEV, so every site-specific MSI build 500'd. The signed-EXE path already
|
||||||
|
staged in `downloads_dir` for this reason; the MSI path was the outlier. Fixed (stage temp in
|
||||||
|
`downloads_dir`), committed `95ef901`, deployed server v0.3.74.
|
||||||
|
|
||||||
|
**UOS dedicated SSH key (Howard unblocked).** Howard was blocked on UOS controller (.29) access for
|
||||||
|
Cascades RF work. Generated a dedicated ed25519 keypair, installed its pubkey on `.29` root, and
|
||||||
|
vaulted it (`infrastructure/uos-server-ssh-key`, base64 in `ssh-private-key-b64`). Wired
|
||||||
|
`uos-mongo.sh` to auto-resolve it so any fleet machine works. Replied via coord.
|
||||||
|
|
||||||
|
**`unifi-wifi` skill (the main build).** Researched what the UOS controller exposes for RF tuning,
|
||||||
|
corrected an early wrong conclusion (the history is NOT in the `ace` config DB — it's in **`ace_stat`**:
|
||||||
|
`stat_hourly` per-AP/band `cu_total`/`cu_interf`/`num_sta`, and `wifi_connectivity_event` = the roam
|
||||||
|
graph). Built: `audit-site.sh` (config + foreign-interference audit), `model-rank.sh` (airtime-reduction
|
||||||
|
ranking), `optimize-radios.sh` (coverage-safe power-down/disable planner, multi-model-hardened via
|
||||||
|
Grok+Gemini), `live-stats.sh` (controller live API, needs a vaulted admin), `watch-ap.sh` (per-AP
|
||||||
|
real-time RF watch via direct AP SSH). Confirmed direct AP SSH is feasible (device-auth vaulted
|
||||||
|
`clients/cascades-tucson/unifi-ap-ssh`); needs the Cascades VPN for L3 reach. Messaged Howard the
|
||||||
|
handoff.
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- **BSOD/mute key on `(agent,bugcheck)` not dump hash.** One fix resolves both the duplicate alerts
|
||||||
|
and the broken perma-ignore (both ride on `dedup_key`). Counting is preserved (every dump still in
|
||||||
|
`bsod_events`); muting only suppresses the active alert + email.
|
||||||
|
- **Deploy via push (webhook pipeline), DB cleanup via psql on `.30`.** The pipeline auto-builds on
|
||||||
|
push to `guru-rmm` main; the existing duplicate alerts and the corrected mute don't self-fix, so
|
||||||
|
applied them directly in Postgres.
|
||||||
|
- **UOS key: dedicated keypair, not the standard key.** Vaulting GURU-5070's broad personal key
|
||||||
|
fleet-wide was rejected; a dedicated, revocable key scoped to `.29` was generated instead.
|
||||||
|
- **Vault multiline keys as base64.** `vault-helper --set` collapses multiline values to one line
|
||||||
|
(corrupts SSH keys); store as `*-b64` and decode on use. (Root cause of a failed key round-trip.)
|
||||||
|
- **WiFi coverage model = the roam graph, not distance.** Materials-aware by construction: Cascades'
|
||||||
|
steel-reinforced hallway walls block cross-hall RF, so clients never roam across them and the model
|
||||||
|
never calls those APs redundant. Distance/floorplan is only a prior; RF/roam evidence is the truth.
|
||||||
|
- **Power-down now, disable later.** Cascades airtime data robustly supports powering down ~all 2.4
|
||||||
|
radios (safe, keeps BSSID); roam data is too sparse to PROVE coverage redundancy for disables, so the
|
||||||
|
optimizer recommends 0 disables until the live AP-to-AP RF neighbor table (API wireup) exists.
|
||||||
|
- **Multi-AI on design AND implementation.** Grok+Gemini critiqued the optimizer design (caught the
|
||||||
|
capacity-cascade risk → added load-shift simulation; bidirectional roams; band-specific RSSI;
|
||||||
|
40%/zone cap; retries normalization).
|
||||||
|
|
||||||
|
## Problems Encountered
|
||||||
|
|
||||||
|
- **Vaulted SSH key didn't round-trip** (`libcrypto: unsupported`): `vault-helper --set` mangled the
|
||||||
|
multiline key to one line. Fixed by storing base64 (`ssh-private-key-b64`) + decode on use.
|
||||||
|
- **`tx_retries` shown as 958%/6317%** in the optimizer: it's a raw count, not a %. Normalized by
|
||||||
|
`wifi_tx_attempts`.
|
||||||
|
- **Optimizer over-classified "isolated-essential"**: sparse roam data → almost no strong neighbor →
|
||||||
|
everything looked isolated. Resolved by making POWER-DOWN (coverage-safe) the default for saturated
|
||||||
|
radios regardless of neighbor evidence, reserving DISABLE for radios with positive coverage evidence.
|
||||||
|
- **`6e` as a JS object key = SyntaxError** ("missing exponent") in mongo-shell JS (parsed as a number).
|
||||||
|
Quote it (`'6e'`) or avoid it as a bare key.
|
||||||
|
- **RMM SYSTEM context vs user mapped drives** (earlier in session, logged as correction): an AP/UNC
|
||||||
|
redirect check under SYSTEM read False; the share existed in the user session. Diagnose in `user_session`.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
Created (skill `.claude/skills/unifi-wifi/`): `SKILL.md`; `references/data-access.md`,
|
||||||
|
`references/methodology.md`, `references/interference-model.md`; `scripts/audit-site.sh`,
|
||||||
|
`scripts/model-rank.sh`, `scripts/optimize-radios.sh`, `scripts/live-stats.sh`, `scripts/watch-ap.sh`.
|
||||||
|
Modified: `.claude/scripts/uos-mongo.sh` (auto-resolve the vaulted UOS key); `wiki/systems/uos-server.md`
|
||||||
|
(dedicated key), `wiki/systems/jupiter.md`, `wiki/clients/internal-infrastructure.md`.
|
||||||
|
|
||||||
|
guru-rmm (submodule, deployed): `server/src/ws/mod.rs` (BSOD dedup_key, `f0a4b7f`/v0.3.73);
|
||||||
|
`server/src/api/install.rs` (MSI EXDEV, `95ef901`/v0.3.74).
|
||||||
|
|
||||||
|
Vault (all pushed): `infrastructure/uos-server-ssh-key`, `clients/cascades-tucson/unifi-ap-ssh`.
|
||||||
|
DB (gururmm Postgres on `.30`): alert_mutes for MSI corrected; 2 BSOD alerts resolved.
|
||||||
|
Memory: `feedback_rmm_system_context_mapped_drives.md` (+ MEMORY.md line). errorlog: RMM-SYSTEM correction.
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
- **UOS dedicated root SSH key** — vault `infrastructure/uos-server-ssh-key` (private key base64 in
|
||||||
|
`credentials.ssh-private-key-b64`; pubkey in `/root/.ssh/authorized_keys` on `.29`). Decode:
|
||||||
|
`vault.sh get-field ... credentials.ssh-private-key-b64 | base64 -d`.
|
||||||
|
- **Cascades UniFi AP device-auth SSH** — vault `clients/cascades-tucson/unifi-ap-ssh`: user
|
||||||
|
`gUJiB84lr6C4`, password `RJE3VIqXiA8Gj` (all Cascades APs; needs Cascades VPN to reach 192.168.2.x/3.x).
|
||||||
|
- **UOS cloud Site Manager API key** (from earlier) — vault `infrastructure/unifi-site-manager-api`
|
||||||
|
(`amY54KqX0i0OuGEYNykLdH9M1Kd4jhzt`); works on api.ui.com for adopted devices only, 401s the local API.
|
||||||
|
- **gururmm Postgres** (used for the alert DB fix) — vault `infrastructure/gururmm-server-physical`
|
||||||
|
`credentials.databases.postgresql-*` (db gururmm / user gururmm).
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
- **GuruRMM server `.30`** (hostname gururmm, Ubuntu): systemd `gururmm-server`; repo `/home/guru/gururmm`;
|
||||||
|
binary `/opt/gururmm/gururmm-server`; build log `/var/log/gururmm-build.log`; push-to-main webhook
|
||||||
|
auto-builds+deploys. SSH `infrastructure/gururmm-server-physical`. Now at v0.3.74.
|
||||||
|
- **UOS Server `.29`** (unifi.azcomputerguru.com → `:11443` via NPM; Rocky 9; UniFi Network = ace.jar +
|
||||||
|
Mongo `ace`/`ace_stat`/`ace_audit` on 127.0.0.1:27117 inside rootless podman `uosserver`). Cascades
|
||||||
|
site `_id 685f39068e65331c46ef6dd2`, short `va6iba3v`, 77 APs (U7-Pro/6E) + 12 switches.
|
||||||
|
- **MSI BSOD agent** `a685af29-ef35-46da-ac3d-431e713b70ab` (recurring 0x116; being replaced).
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
- WiFi audit/rank/optimize: `bash .claude/skills/unifi-wifi/scripts/{audit-site,model-rank,optimize-radios}.sh cascades`.
|
||||||
|
- UOS Mongo (incl. ace_stat): `bash .claude/scripts/uos-mongo.sh` (pipe JS; `db.getSiblingDB('ace_stat')`).
|
||||||
|
- Cascades 2.4 finding: 75 APs, cu_total 74–94%, cu_interf 61–81%, ~1 client each → power-down 74/75, 0 disables yet.
|
||||||
|
- MSI EXDEV: `err=Invalid cross-device link (os error 18)`; `/tmp`=tmpfs, `/opt/gururmm/downloads`=root LV.
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- **Enable the Cascades site VPN** → unlocks `watch-ap.sh` (per-AP real-time) — APs on 192.168.2.x/3.x.
|
||||||
|
- **Create a read-only UOS Network admin**, vault `infrastructure/uos-server-network-api` →
|
||||||
|
`live-stats.sh` (controller-wide live RF + the AP-to-AP RF neighbor table that enables confident disables).
|
||||||
|
- Howard placing APs on the UniFi floorplan → adds distance-prior edges to the model.
|
||||||
|
- `optimize-radios.sh` v-next: greedy disable becomes meaningful once the RF table exists.
|
||||||
|
- (Carried, earlier in day) SP-SharonW11 M365 license removal — coord todo `79d291db`, EOW 2026-06-19.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
- guru-rmm commits: BSOD `f0a4b7f` (v0.3.73), MSI `95ef901` (v0.3.74). Coord msgs to Howard:
|
||||||
|
`a589f230` (wifi skill), `a589...`/UOS key reply. MSI alert dedup_key now `bsod:<agent>:0x116`.
|
||||||
|
- Skill: `.claude/skills/unifi-wifi/` (SKILL.md + references/ + scripts/). Data planes: `ace` (config),
|
||||||
|
`ace_stat` (history: stat_hourly/daily + wifi_connectivity_event), live Network API (optional).
|
||||||
|
- UOS access: `infrastructure/uos-server-ssh-key` + `.claude/scripts/uos-mongo.sh`; wiki `systems/uos-server.md`.
|
||||||
Reference in New Issue
Block a user