From 05c17b476f9a092cafd53b9e1e516aef5940b0a6 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Sun, 7 Jun 2026 16:47:05 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-07 16:47:01 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-07 16:47:01 --- ...6-07-mike-gururmm-offline-alerting-mute.md | 170 ++++++++++++++++++ wiki/index.md | 2 +- wiki/projects/gururmm.md | 43 ++--- wiki/systems/jupiter.md | 3 +- 4 files changed, 191 insertions(+), 27 deletions(-) create mode 100644 session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md diff --git a/session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md b/session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md new file mode 100644 index 0000000..8c47d9a --- /dev/null +++ b/session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md @@ -0,0 +1,170 @@ +# GuruRMM — Role-Aware Offline Alerting + Alert Ignore/Mute (+ MSP360 API probe, Jupiter VM diag) + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Shipped two GuruRMM alerting features end-to-end and investigated an MSP360 API +capability question, then diagnosed an unrelated Unraid VM network fault. After the +morning backup-alert cleanup (separate log), Mike asked whether the MSP360 API could +delete plans / trigger backups / configure storage for the three console follow-ups. +Probed `api.mspbackups.com` with the vaulted Provider token: it is monitoring-only +(Companies/Users/Monitoring, GET; every management path 404s; OPTIONS confirms +read-only). The agent-CLI fallback also failed -- the white-labeled MSP360 agent on +SERVER has no `cbb.exe` at any standard path. Concluded the three items stay +MSP360-console tasks and filed them as coord todos for Mike. + +Built **role-aware offline alerting + correlated mass-offline detection** (spec +`role-aware-offline-alerting`). Servers offline are incidents; workstations offline +are routine. Implemented server-side (Tasks 1-5): migration 054 `role_override` +column + a shared `agent_role` classifier delegating to the canonical +`agent_is_server`; `OfflineAlertingConfig` in the policy JSONB; a new scheduled +offline-sweep evaluator generating server-only `agent_offline` alerts plus site +(>=50% + >=3) and fleet (>=10) `mass_offline_*` aggregates; a warm-up restart guard; +and `PUT /api/agents/:id/role-override`. Code Review REJECTED the first cut for a +CRITICAL defect in the spec's own restart guard (`last_seen < started_at` +permanently disabled detection in steady state because `last_seen` advances on every +heartbeat); fixed by making the warm-up window the sole guard. Re-review APPROVED. +Then the dashboard unit (Tasks 6-7): role-aware triage (offline servers individual + +elevated; offline workstations collapsed into a quiet "N workstations offline" +roll-up) and the role-override control on the agent detail page; plus a small server +DTO fix so `GET /api/agents/:id` returns `role_override`. Verified live: a real +offline server (WIN-TG2STMODJG8, "Windows Server 2019 Standard Evaluation", +auto-classified via `os_name`) fired `agent_offline`; zero false mass-offline despite +~55 chronically-offline agents; restart guard held across deploys. + +Built **alert ignore/mute (perma-silence)** (spec `alert-mute`). Distinct from +ack/resolve, which only quiet the current cycle: a mute keyed on `dedup_key` +suppresses a recurring condition until un-ignored, with a required reason. Server +unit (Tasks 1-4): migration 055 `alert_mutes` table + `muted` status; an +`is_dedup_muted` gate inserted into the universal `create_or_update_alert` AND the +one `create_check_alert` bypass so muted conditions write `status='muted'` (never +active, never email); transactional `mute_condition`/`unmute_condition`; +`POST /api/alerts/:id/mute` (reason required -> 400) and `/unmute`. Code Review +APPROVED (the muted gate is a top-of-function early return, so the active path is +byte-for-byte unchanged for normal alerts). Verified live against WIN-TG2STMODJG8's +re-firing `agent_offline`: mute -> `muted`, survived a full offline sweep without +re-firing, unmute -> `active`. The dashboard unit (Task 5) is NOT yet built. + +Finally, diagnosed an Unraid VM network fault on Jupiter (172.16.3.20): the libvirt +domain "Windows Server 2016" (guest hostname `ACG-DWP-X-BB`) had no LAN. Found the +host side healthy (vnet8 bridged to br0, forwarding, receiving LAN broadcast), and +the guest holding APIPA `169.254.157.152` with no gateway -- its single e1000 NIC is +DHCP-enabled but not getting a lease from pfSense (172.16.0.1). Paused for Mike's +call on DHCP-vs-static before changing the guest. + +## Key Decisions + +- Offline classification = auto (`os_product_type` 2/3, else `os_name`/`os_version` + ~/server/i) + a manual `role_override` column. `os_product_type` is populated on + only 16/168 agents, so `os_name` is the workhorse signal; inventory-less offline + servers (SIF-SERVER, Server2013) auto-classify as workstations and need the manual + override -- Mike held on tagging them. +- Mass/aggregate alerts pin to a representative offline agent with a site/fleet + `dedup_key` rather than making `alerts.agent_id` nullable (avoids a rippling + migration across every alert path). +- Restart guard = warm-up window only (max grace after boot). Dropped the broken + `last_seen < started_at` clause. Individual server alerts fire independently of the + site-outage alert (so a still-down server keeps paging after an outage clears). +- Alert mute keyed on `dedup_key` (universal recurring-condition id, always set); + permanent until un-ignored; stays muted on severity escalation. Gate placed at the + two creation choke points so it is universal across every alert type. +- MSP360 console items stay manual: the Provider API token is monitoring-tier only; + no REST path for plan delete / run / storage config at our access level. + +## Problems Encountered + +- CRITICAL restart-guard defect (my spec error): `last_seen < started_at` silently + disabled offline detection in steady state. Caught in Code Review; fixed to + warm-up-only and re-reviewed. The bug-enshrining unit test was rewritten. +- Split-brain classifier: the new `agent_role` diverged from the existing + `agent_is_server`. Unified by delegating to the canonical fn (threaded + `os_version`). +- `GET /api/agents/:id` omitted `role_override` (returns `AgentResponse`, not + `AgentWithDetails`), so the override card rendered blind. Fixed by adding the field + to the base `Agent`/`AgentResponse` DTOs (all `Agent` SELECTs use `SELECT *`). +- Orphaned offline alerts + per-agent policy N+1 in the sweep: replaced scattered + resolves with one authoritative resolve-except pass; read per-agent grace only for + servers; mass membership uses global grace/window. +- MSP360 white-label: no `cbb.exe` on SERVER, so the agent-CLI automation route for + the backup console tasks was not viable without per-box fingerprinting. +- Submodule gitlink: detached to the pinned commit before `/save` so the session-log + commit does not fold a gitlink bump. + +## Configuration Changes + +GuruRMM submodule (`azcomputerguru/gururmm`), all merged to `main`: +- `server/migrations/054_agent_role_override.sql` (new) -- `role_override` column. +- `server/migrations/055_alert_mutes.sql` (new) -- `alert_mutes` table + partial + unique index `ON dedup_key WHERE active`. +- `server/src/alerts/offline.rs` (new) -- offline sweep + mass detection. +- `server/src/db/alerts.rs` -- `AlertStatus::Muted`; `is_dedup_muted`/ + `mute_condition`/`unmute_condition`; muted gate in `create_or_update_alert`; + `create_or_update_alert` title/message/severity refresh (from morning). +- `server/src/alerts/check_alerts.rs` -- mute gate + notify suppression. +- `server/src/db/agents.rs` -- `role_override` on `Agent`/`AgentResponse`/ + `AgentWithDetails`; `agent_role` + `set_agent_role_override`. +- `server/src/db/policies.rs`, `server/src/policy/{effective,merge}.rs` -- + `OfflineAlertingConfig` + merge. +- `server/src/main.rs` -- offline-sweep spawn (60s). +- `server/src/api/{alerts,agents,mod}.rs` -- mute/unmute + role-override endpoints. +- `dashboard/src/lib/agentRole.ts` (new), `dashboard/src/components/ExceptionStream.tsx`, + `dashboard/src/pages/AgentDetail.tsx`, `dashboard/src/api/client.ts` -- triage + + override control. +- specs/`role-aware-offline-alerting/` and specs/`alert-mute/` (new spec folders). + +## Credentials & Secrets + +No new secrets. MSP360 Provider API: vault `msp-tools/msp360-api.sops.yaml` +(`credentials.login` / `credentials.password`); base `https://api.mspbackups.com`, +`POST /api/Provider/Login` -> `access_token`; monitoring-only tier. GuruRMM API: +vault `infrastructure/gururmm-server.sops.yaml`. Jupiter: vault +`infrastructure/jupiter-unraid-primary.sops.yaml` (root@172.16.3.20:22, key auth +works from GURU-5070). + +## Infrastructure & Servers + +- GuruRMM API/server: 172.16.3.30:3001 (Linux VM on Jupiter). Dashboards + rmm-beta / rmm.azcomputerguru.com (shared API serves both). +- Jupiter Unraid: 172.16.3.20, root:22, Unraid 6.12.85. VMs via virsh; bridge `br0` + (uplink eth2), `172.16.3.20/22`. VM "Windows Server 2016" = guest `ACG-DWP-X-BB`, + vnet8/br0/e1000, MAC 52:54:00:d4:8e:59, APIPA 169.254.157.152 (no DHCP lease). +- Office LAN 172.16.0.0/22; pfSense 172.16.0.1 (router + DNS + DHCP). + +## Commands & Outputs + +- Offline alerting commits: `f1cdf5d`, `30e4f23` (fix), `21d63bd` (DTO), `3eedf91` + (dashboard). Alert-mute: `29c405e` (spec), `a120e71` (server). +- Live mute verify: empty-reason 400; mute -> status=muted; after a 60s sweep the + muted alert did NOT re-fire active (count=0); unmute -> active. +- Jupiter diag: `virsh domiflist 9` (vnet8/br0/e1000); `virsh domifaddr 9 --source + agent` -> 169.254.157.152; guest-exec `ipconfig /all` -> DHCP Enabled: Yes, APIPA, + no gateway. + +## Pending / Incomplete Tasks + +- **alert-mute dashboard (Task 5)** -- NOT started. Ignore button + required-reason + prompt + Muted filter + Un-ignore, on the alerts page AND the agent Alerts tab; + muted rows show Un-ignore instead of Ack/Resolve. (Mike: "then continue" -> do this + next.) Plus alert-mute Task 6 (roadmap) + Task 7 (final doc). +- **offline alerting** -- Task 8 (roadmap entry) outstanding; classification gap: + tag SIF-SERVER / Server2013 (and WIN-TG2STMODJG8 if it should be a workstation) via + role_override -- held for Mike. +- **Jupiter VM** -- decide DHCP vs static for ACG-DWP-X-BB; if DHCP, run `ipconfig + /renew` via guest agent; if static, set the intended IP. Paused for Mike. +- **MSP360 console (Mike-side todos):** delete SERVER's Nov-2024 plan; AD1 full + backup for retention; LAB-Becky storage-or-delete. + +## Reference Information + +- Commits (main): `f1cdf5d`,`30e4f23`,`21d63bd`,`3eedf91`,`29c405e`,`a120e71`. +- Specs: `projects/msp-tools/guru-rmm/specs/role-aware-offline-alerting/`, + `.../specs/alert-mute/`. +- MSP360 PlanType map: 3=Files,7=SQL,8=Restore,11=Image,13=Consistency,16=HyperV. +- New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`; new + status `muted`; new tables `alert_mutes`; new columns `agents.role_override`. +- Test box: WIN-TG2STMODJG8 agent id b6c715df-09fe-4e97-b09a-82a1b535f041 (offline + eval server, used for live verification). diff --git a/wiki/index.md b/wiki/index.md index e0a33ce..040e604 100644 --- a/wiki/index.md +++ b/wiki/index.md @@ -55,7 +55,7 @@ Run `/wiki-lint` to check for stale entries and broken backlinks. | Article | Summary | Last Compiled | |---|---|---| -| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.45; 55 enrolled agents; backup-alert quality pass shipped + credential inheritance deployed + offboarding wizard spec complete; clickable alert badges with client filtering; tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-07 | +| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.45; 55 enrolled agents; backup-alert quality pass shipped (false backup_failed 15->2; backup_storage_low removed); credential inheritance deployed (hierarchical Global->Client->Site with is_inheritable + de-dup, /effective endpoints) + clickable alert badges with client filtering; SPEC-028 offboarding wizard spec complete (835 lines); role-aware offline alerting + correlated mass-offline detection shipped (agent_offline/mass_offline_site/mass_offline_fleet; migration 054; dashboard triage); alert ignore/mute (perma-silence, migration 055, muted status) shipped server-only (dashboard pending); tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-07 | | [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 | | [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 | | [The Computer Guru Show](projects/radio-show.md) | Radio show archive processing pipeline (Whisper + pyannote + SQLite FTS5) + post-show content workflow; 572 episodes indexed; FastAPI UI redesigned; Jupiter audio-file gap open | 2026-05-24 | diff --git a/wiki/projects/gururmm.md b/wiki/projects/gururmm.md index de0550d..33cc73d 100644 --- a/wiki/projects/gururmm.md +++ b/wiki/projects/gururmm.md @@ -9,7 +9,7 @@ aliases: sources: - "gururmm@main: server/src/api/*.rs (REST API surface, ~30 route modules)" - "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs)" - - "gururmm@main: server/migrations/*.sql (48 migrations — feature checkpoints, incl. 048_bsod_events)" + - "gururmm@main: server/migrations/*.sql (55+ migrations — feature checkpoints, incl. 048_bsod_events, 054_agent_role_override, 055_alert_mutes)" - "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/" - "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)" - "gururmm@main: server/migrations/048_bsod_events.sql" @@ -52,6 +52,7 @@ sources: - session-logs/2026-06-07-mike-gururmm-offboarding-spec.md - "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)" - session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md + - session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md backlinks: - clients/cascades-tucson - systems/gururmm-build @@ -69,6 +70,8 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G **Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail. +**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07 (second session):** Scheduled offline-sweep evaluator (60s tokio interval, server-only). `agent_offline` alerts for servers only; classifier: `os_product_type` 2/3, else `os_name`/`os_version` ~/server/i, else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to a representative offline agent. Warm-up restart guard. Dashboard: servers elevated in triage individually; workstations collapsed. Alert mute (perma-silence, migration 055 `alert_mutes` + `muted` status): keyed on `dedup_key`, permanent until un-ignored, reason required; gates both `create_or_update_alert` and `create_check_alert` bypass. Dashboard Ignore/Muted/Un-ignore UI NOT yet built. Commits f1cdf5d/30e4f23/21d63bd/3eedf91 (offline), 29c405e/a120e71 (mute). Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier. + **See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen). **Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there. @@ -147,6 +150,8 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r ### Alerting & Watchdog - Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope. - Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve. +- **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Generates server-only `agent_offline` alerts based on `agent_role` classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054, `agents.role_override`). Site rule (>=50% of a site's servers offline AND >=3 absolute) -> `mass_offline_site` aggregate; fleet rule (>=10 servers offline) -> `mass_offline_fleet` aggregate; aggregates pinned to a representative offline agent with site/fleet `dedup_key` (avoids making `alerts.agent_id` nullable). Restart guard = warm-up window after boot only (NOT `last_seen < started_at`, which is permanently false in steady state -- code review caught and fixed this spec defect). Dashboard: offline servers individual + elevated in triage; offline workstations collapsed into a "N workstations offline" roll-up; role-override control on agent detail page. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier; inventory-less offline servers (e.g., SIF-SERVER, Server2013) auto-classify as workstations until manually overridden. +- **Alert ignore/mute -- perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Mute keyed on `dedup_key` (universal recurring-condition id, always set). Permanent until un-ignored; reason required (400 if missing). Gate inserted at the top of `create_or_update_alert` AND the `create_check_alert` bypass so muted conditions write `status='muted'` and never email -- the active alert path is byte-for-byte unchanged. Transactional `mute_condition`/`unmute_condition`. `POST /api/alerts/:id/mute` + `/unmute`. Distinct from ack/resolve (which quiet only the current cycle). Dashboard Ignore button, Muted filter, and Un-ignore NOT yet built. New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`. New status: `muted`. Key functions: `is_dedup_muted`, `mute_condition`, `unmute_condition` (db/alerts.rs); `offline_sweep`, `agent_role` (alerts/offline.rs). ### Credentials Management - Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint (known HIGH item: `/reveal` ownership-scope check — [verify current state]). @@ -236,10 +241,12 @@ gururmm/ │ └── main.rs systemd unit template generation ├── server/ Rust/Axum API server │ └── src/ -│ ├── api/ REST handlers -│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs +│ ├── api/ REST handlers (alerts.rs: mute/unmute; agents.rs: role-override) +│ ├── alerts/ Alerting modules (offline.rs: offline_sweep, mass_offline detection, agent_role classifier, warm-up restart guard) +│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs, db/alerts.rs (is_dedup_muted, mute_condition, unmute_condition, AlertStatus::Muted) │ ├── ws/ WebSocket handler (BsodEvent dispatch) │ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type) +├── dashboard/ React/TypeScript UI (lib/agentRole.ts: server classifier; components/ExceptionStream.tsx: offline triage with workstation roll-up; pages/AgentDetail.tsx: role-override control) ├── tray/ System tray binary ├── installer/ WiX v4 MSI (gururmm-agent.wxs) ├── deploy/ @@ -254,23 +261,20 @@ gururmm/ ### Current Focus -<<<<<<< HEAD -As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.37+): - -- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.] -======= As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45): - **Credential inheritance (deployed 2026-06-07):** Production server running v0.3.45 with full credential inheritance and de-duplication. `/effective` endpoints validated. Dashboard clickable alert badges and client-scoped filtering implemented. - **SPEC-028 offboarding wizard (specification complete):** 835-line spec created for site and client offboarding workflows. Includes data export, dependency analysis, typed confirmation, and audit logging. Roadmap updated with "Client & Site Lifecycle Management" section. Implementation pending. -- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main → beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event → 3s wait → `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.] ->>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13) +- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.] - **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab + BSOD in Alerts stream (issue #10, dashboard bullets unchecked); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map). - **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall. - **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19). - **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24. - **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit). - **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). +- **MSP360 API scope (confirmed 2026-06-07):** Provider API token (vault `msp-tools/msp360-api.sops.yaml`) is monitoring-only at our tier: Companies/Users/Monitoring GET endpoints; every management path (plan delete, manual run, storage config) returns 404. White-labeled MSP360 agent has no `cbb.exe` at any standard path. Plan delete/run/storage remain MSP360-console tasks. +- **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt on alert row; Muted filter on alerts page + agent Alerts tab; Un-ignore control; muted rows show Un-ignore instead of Ack/Resolve. Server endpoints ready: `POST /api/alerts/:id/mute` + `/unmute`. +- **Offline alerting -- classification gap:** SIF-SERVER (SifOidak.local DC), Server2013 (Sombra), and other inventory-less offline servers currently auto-classify as workstations (no `os_product_type`; `os_name` doesn't match ~/server/i for those names). Needs manual `role_override` via `PUT /api/agents/:id/role-override` -- Mike held on tagging them. - **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future. - **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH). @@ -371,11 +375,7 @@ Gitea push to main ## Active State -<<<<<<< HEAD **Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):** -======= -**Fleet (as of 2026-06-07):** ->>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13) - 55 enrolled agents total - Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default). - Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline. @@ -406,11 +406,7 @@ Gitea push to main - Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted) **Dashboard — complete and working:** -<<<<<<< HEAD -Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts, Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI. -======= -Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent↔backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support. ->>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13) +Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support. **Dashboard — incomplete (see UI_GAPS.md):** - Enrollment management UI (revoke keys, audit log, duplicate hostname warnings) @@ -419,6 +415,7 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI - BSOD in Alerts stream (Phase 2 deferred) - Tunnel session management (interactive terminal — backend skeleton, not production-ready) - Offboarding wizard UI (SPEC-028 complete, implementation pending) +- Alert mute UI -- Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard Task 5 NOT started) **Open Gitea issues:** - #10 — BSOD detection Phase 2/3 (dashboard + fetch_bsod_dump + full bugcheck table) @@ -478,11 +475,9 @@ These decisions are locked. Do not reverse without explicit user approval. | 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. | | 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] | | 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. | -<<<<<<< HEAD | 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. | -======= | 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical credential propagation (Global → Client → Site) with `is_inheritable` flag and de-duplication by (credential_type, label). `/effective` endpoints validated. Dashboard UI: clickable alert severity badges with client filtering, offline badge now scopes to client-specific agents. SPEC-028 offboarding wizard specification created (835 lines) covering site and client offboarding workflows with data export, dependency analysis, typed confirmation, and audit logging. FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section. | ->>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13) +| 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped (second session). Offline sweep evaluator (60s tokio interval, `server/src/alerts/offline.rs`): server-only `agent_offline` alerts (classifier: `os_product_type` 2/3, else `os_name` ~/server/i, else `role_override`); site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to representative agent with site/fleet `dedup_key`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect — permanently false in steady state). Migration 054 (`agents.role_override`). Dashboard triage: servers elevated individually; workstations collapsed into roll-up. `PUT /api/agents/:id/role-override`. Verified live vs WIN-TG2STMODJG8 ("Windows Server 2019 Standard Evaluation", auto-classified via `os_name`); zero false mass-offline against ~55 chronically-offline agents; restart guard held across deploys. Known gap: `os_product_type` on only 16/168 agents; `os_name` is workhorse. Commits f1cdf5d/30e4f23/21d63bd/3eedf91. Alert ignore/mute (perma-silence): migration 055 (`alert_mutes` table + `muted` status); `dedup_key`-keyed; reason required (400 if missing); gates `create_or_update_alert` + `create_check_alert` bypass; `POST /api/alerts/:id/mute` + `/unmute`; dashboard UI NOT started. Verified live (mute -> `muted`; 60s sweep did not re-fire; unmute -> `active`). Commits 29c405e/a120e71. MSP360 Provider API probed live: monitoring-only tier (Companies/Users/Monitoring GET; management paths 404); plan delete/run/storage remain console tasks; white-label agent has no `cbb.exe`. | --- @@ -495,11 +490,9 @@ These decisions are locked. Do not reverse without explicit user approval. - Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified] - **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48. - **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`. -<<<<<<< HEAD - **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim. -======= - **2026-06-07 recompile:** Updated for credential inheritance production deployment (server v0.3.45), clickable alert badges with client filtering, and SPEC-028 offboarding wizard specification. Added Recent Work section documenting 2026-06-07 session accomplishments. Updated Current Focus to reflect credential inheritance as deployed and offboarding wizard as spec-complete/implementation-pending. Updated Dashboard status to include credentials management with inheritance. Updated version numbers throughout (server 0.3.37 → 0.3.45). Added session-logs/2026-06-07-mike-gururmm-offboarding-spec.md to sources. Updated History Highlights with 2026-06-07 entry. ->>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13) +- **2026-06-07 recompile (second session):** Folded in role-aware offline alerting (server/src/alerts/offline.rs: offline_sweep + agent_role classifier + mass_offline site/fleet detection; migration 054 agents.role_override; dashboard triage + role-override control on agent detail; warm-up restart guard; four commits f1cdf5d/30e4f23/21d63bd/3eedf91) and alert ignore/mute perma-silence (alert_mutes table, muted status, is_dedup_muted/mute_condition/unmute_condition, two-choke-point gate at create_or_update_alert + create_check_alert bypass, mute/unmute API; migration 055; dashboard pending; two commits 29c405e/a120e71). Added MSP360 API scope finding (monitoring-only tier confirmed by live probe; management paths 404). Updated Alerting & Watchdog section with offline alerting and mute detail including new alert types (agent_offline, mass_offline_site, mass_offline_fleet) and new status (muted). Updated Repo Structure (alerts/ directory; db/alerts.rs key functions; dashboard/ entry with agentRole.ts). Updated Development / Current Focus with alert mute dashboard (Task 5 not started), offline classification gap, and MSP360 API scope item. Added alert mute UI to Dashboard incomplete list. Added second 2026-06-07 History row. Updated migration count to 55+ (054/055 confirmed). Added session log source. Patterns section and all existing History rows preserved verbatim. ## Backlinks diff --git a/wiki/systems/jupiter.md b/wiki/systems/jupiter.md index 89218fe..670b38a 100644 --- a/wiki/systems/jupiter.md +++ b/wiki/systems/jupiter.md @@ -48,7 +48,7 @@ Not documented. iDRAC available at 172.16.1.73 (DHCP) for OOB management. | OwnCloud | 172.16.3.22 | running | OwnCloud file sync VM (cloud.acghosting.com) | | Unifi | (IP not documented) | running | UniFi Network controller | | Windows 7 | — | shut off | — | -| Windows Server 2016 | — | shut off | — | +| Windows Server 2016 | (none — APIPA) | running | Windows guest `ACG-DWP-X-BB`; e1000 NIC `vnet8` on br0, DHCP not leasing — see Known Issues | | Windows Server 2016_Template | — | shut off | — | ## Access @@ -89,6 +89,7 @@ Not documented. iDRAC available at 172.16.1.73 (DHCP) for OOB management. - **iDRAC IP is DHCP** (172.16.1.73) — may drift. Verify before relying on it for OOB access. - **guruRMM API proxy stale** — see NPM table above. Fix before it causes a routing incident. - **Post-power-failure recovery order matters** — see `.claude/POWER_FAILURE_RUNBOOK.md` for the full recovery sequence (Tailscale routes, libvirt/VMs, Seafile, NPM/DNS in order). +- **VM "Windows Server 2016" (`ACG-DWP-X-BB`) — no LAN (2026-06-07):** guest stuck on APIPA `169.254.157.152`, no DHCP lease. Host side is healthy (vnet8 bridged to br0, forwarding, receiving LAN broadcast); fault is guest-side — single e1000 NIC set to DHCP, pfSense (172.16.0.1) not leasing it. Diagnose via `virsh domifaddr 9 --source agent` and qemu guest-exec `ipconfig /all`. Fix path: `ipconfig /renew` in-guest (stuck-client case) or assign a static IP if that is the intended config. PAUSED pending Mike's DHCP-vs-static decision. ## Backlinks