sync: auto-sync from GURU-5070 at 2026-06-07 16:47:01
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-07 16:47:01
This commit is contained in:
170
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
Normal file
170
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
# GuruRMM — Role-Aware Offline Alerting + Alert Ignore/Mute (+ MSP360 API probe, Jupiter VM diag)
|
||||||
|
|
||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-5070
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Shipped two GuruRMM alerting features end-to-end and investigated an MSP360 API
|
||||||
|
capability question, then diagnosed an unrelated Unraid VM network fault. After the
|
||||||
|
morning backup-alert cleanup (separate log), Mike asked whether the MSP360 API could
|
||||||
|
delete plans / trigger backups / configure storage for the three console follow-ups.
|
||||||
|
Probed `api.mspbackups.com` with the vaulted Provider token: it is monitoring-only
|
||||||
|
(Companies/Users/Monitoring, GET; every management path 404s; OPTIONS confirms
|
||||||
|
read-only). The agent-CLI fallback also failed -- the white-labeled MSP360 agent on
|
||||||
|
SERVER has no `cbb.exe` at any standard path. Concluded the three items stay
|
||||||
|
MSP360-console tasks and filed them as coord todos for Mike.
|
||||||
|
|
||||||
|
Built **role-aware offline alerting + correlated mass-offline detection** (spec
|
||||||
|
`role-aware-offline-alerting`). Servers offline are incidents; workstations offline
|
||||||
|
are routine. Implemented server-side (Tasks 1-5): migration 054 `role_override`
|
||||||
|
column + a shared `agent_role` classifier delegating to the canonical
|
||||||
|
`agent_is_server`; `OfflineAlertingConfig` in the policy JSONB; a new scheduled
|
||||||
|
offline-sweep evaluator generating server-only `agent_offline` alerts plus site
|
||||||
|
(>=50% + >=3) and fleet (>=10) `mass_offline_*` aggregates; a warm-up restart guard;
|
||||||
|
and `PUT /api/agents/:id/role-override`. Code Review REJECTED the first cut for a
|
||||||
|
CRITICAL defect in the spec's own restart guard (`last_seen < started_at`
|
||||||
|
permanently disabled detection in steady state because `last_seen` advances on every
|
||||||
|
heartbeat); fixed by making the warm-up window the sole guard. Re-review APPROVED.
|
||||||
|
Then the dashboard unit (Tasks 6-7): role-aware triage (offline servers individual +
|
||||||
|
elevated; offline workstations collapsed into a quiet "N workstations offline"
|
||||||
|
roll-up) and the role-override control on the agent detail page; plus a small server
|
||||||
|
DTO fix so `GET /api/agents/:id` returns `role_override`. Verified live: a real
|
||||||
|
offline server (WIN-TG2STMODJG8, "Windows Server 2019 Standard Evaluation",
|
||||||
|
auto-classified via `os_name`) fired `agent_offline`; zero false mass-offline despite
|
||||||
|
~55 chronically-offline agents; restart guard held across deploys.
|
||||||
|
|
||||||
|
Built **alert ignore/mute (perma-silence)** (spec `alert-mute`). Distinct from
|
||||||
|
ack/resolve, which only quiet the current cycle: a mute keyed on `dedup_key`
|
||||||
|
suppresses a recurring condition until un-ignored, with a required reason. Server
|
||||||
|
unit (Tasks 1-4): migration 055 `alert_mutes` table + `muted` status; an
|
||||||
|
`is_dedup_muted` gate inserted into the universal `create_or_update_alert` AND the
|
||||||
|
one `create_check_alert` bypass so muted conditions write `status='muted'` (never
|
||||||
|
active, never email); transactional `mute_condition`/`unmute_condition`;
|
||||||
|
`POST /api/alerts/:id/mute` (reason required -> 400) and `/unmute`. Code Review
|
||||||
|
APPROVED (the muted gate is a top-of-function early return, so the active path is
|
||||||
|
byte-for-byte unchanged for normal alerts). Verified live against WIN-TG2STMODJG8's
|
||||||
|
re-firing `agent_offline`: mute -> `muted`, survived a full offline sweep without
|
||||||
|
re-firing, unmute -> `active`. The dashboard unit (Task 5) is NOT yet built.
|
||||||
|
|
||||||
|
Finally, diagnosed an Unraid VM network fault on Jupiter (172.16.3.20): the libvirt
|
||||||
|
domain "Windows Server 2016" (guest hostname `ACG-DWP-X-BB`) had no LAN. Found the
|
||||||
|
host side healthy (vnet8 bridged to br0, forwarding, receiving LAN broadcast), and
|
||||||
|
the guest holding APIPA `169.254.157.152` with no gateway -- its single e1000 NIC is
|
||||||
|
DHCP-enabled but not getting a lease from pfSense (172.16.0.1). Paused for Mike's
|
||||||
|
call on DHCP-vs-static before changing the guest.
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- Offline classification = auto (`os_product_type` 2/3, else `os_name`/`os_version`
|
||||||
|
~/server/i) + a manual `role_override` column. `os_product_type` is populated on
|
||||||
|
only 16/168 agents, so `os_name` is the workhorse signal; inventory-less offline
|
||||||
|
servers (SIF-SERVER, Server2013) auto-classify as workstations and need the manual
|
||||||
|
override -- Mike held on tagging them.
|
||||||
|
- Mass/aggregate alerts pin to a representative offline agent with a site/fleet
|
||||||
|
`dedup_key` rather than making `alerts.agent_id` nullable (avoids a rippling
|
||||||
|
migration across every alert path).
|
||||||
|
- Restart guard = warm-up window only (max grace after boot). Dropped the broken
|
||||||
|
`last_seen < started_at` clause. Individual server alerts fire independently of the
|
||||||
|
site-outage alert (so a still-down server keeps paging after an outage clears).
|
||||||
|
- Alert mute keyed on `dedup_key` (universal recurring-condition id, always set);
|
||||||
|
permanent until un-ignored; stays muted on severity escalation. Gate placed at the
|
||||||
|
two creation choke points so it is universal across every alert type.
|
||||||
|
- MSP360 console items stay manual: the Provider API token is monitoring-tier only;
|
||||||
|
no REST path for plan delete / run / storage config at our access level.
|
||||||
|
|
||||||
|
## Problems Encountered
|
||||||
|
|
||||||
|
- CRITICAL restart-guard defect (my spec error): `last_seen < started_at` silently
|
||||||
|
disabled offline detection in steady state. Caught in Code Review; fixed to
|
||||||
|
warm-up-only and re-reviewed. The bug-enshrining unit test was rewritten.
|
||||||
|
- Split-brain classifier: the new `agent_role` diverged from the existing
|
||||||
|
`agent_is_server`. Unified by delegating to the canonical fn (threaded
|
||||||
|
`os_version`).
|
||||||
|
- `GET /api/agents/:id` omitted `role_override` (returns `AgentResponse`, not
|
||||||
|
`AgentWithDetails`), so the override card rendered blind. Fixed by adding the field
|
||||||
|
to the base `Agent`/`AgentResponse` DTOs (all `Agent` SELECTs use `SELECT *`).
|
||||||
|
- Orphaned offline alerts + per-agent policy N+1 in the sweep: replaced scattered
|
||||||
|
resolves with one authoritative resolve-except pass; read per-agent grace only for
|
||||||
|
servers; mass membership uses global grace/window.
|
||||||
|
- MSP360 white-label: no `cbb.exe` on SERVER, so the agent-CLI automation route for
|
||||||
|
the backup console tasks was not viable without per-box fingerprinting.
|
||||||
|
- Submodule gitlink: detached to the pinned commit before `/save` so the session-log
|
||||||
|
commit does not fold a gitlink bump.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
GuruRMM submodule (`azcomputerguru/gururmm`), all merged to `main`:
|
||||||
|
- `server/migrations/054_agent_role_override.sql` (new) -- `role_override` column.
|
||||||
|
- `server/migrations/055_alert_mutes.sql` (new) -- `alert_mutes` table + partial
|
||||||
|
unique index `ON dedup_key WHERE active`.
|
||||||
|
- `server/src/alerts/offline.rs` (new) -- offline sweep + mass detection.
|
||||||
|
- `server/src/db/alerts.rs` -- `AlertStatus::Muted`; `is_dedup_muted`/
|
||||||
|
`mute_condition`/`unmute_condition`; muted gate in `create_or_update_alert`;
|
||||||
|
`create_or_update_alert` title/message/severity refresh (from morning).
|
||||||
|
- `server/src/alerts/check_alerts.rs` -- mute gate + notify suppression.
|
||||||
|
- `server/src/db/agents.rs` -- `role_override` on `Agent`/`AgentResponse`/
|
||||||
|
`AgentWithDetails`; `agent_role` + `set_agent_role_override`.
|
||||||
|
- `server/src/db/policies.rs`, `server/src/policy/{effective,merge}.rs` --
|
||||||
|
`OfflineAlertingConfig` + merge.
|
||||||
|
- `server/src/main.rs` -- offline-sweep spawn (60s).
|
||||||
|
- `server/src/api/{alerts,agents,mod}.rs` -- mute/unmute + role-override endpoints.
|
||||||
|
- `dashboard/src/lib/agentRole.ts` (new), `dashboard/src/components/ExceptionStream.tsx`,
|
||||||
|
`dashboard/src/pages/AgentDetail.tsx`, `dashboard/src/api/client.ts` -- triage +
|
||||||
|
override control.
|
||||||
|
- specs/`role-aware-offline-alerting/` and specs/`alert-mute/` (new spec folders).
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
No new secrets. MSP360 Provider API: vault `msp-tools/msp360-api.sops.yaml`
|
||||||
|
(`credentials.login` / `credentials.password`); base `https://api.mspbackups.com`,
|
||||||
|
`POST /api/Provider/Login` -> `access_token`; monitoring-only tier. GuruRMM API:
|
||||||
|
vault `infrastructure/gururmm-server.sops.yaml`. Jupiter: vault
|
||||||
|
`infrastructure/jupiter-unraid-primary.sops.yaml` (root@172.16.3.20:22, key auth
|
||||||
|
works from GURU-5070).
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
- GuruRMM API/server: 172.16.3.30:3001 (Linux VM on Jupiter). Dashboards
|
||||||
|
rmm-beta / rmm.azcomputerguru.com (shared API serves both).
|
||||||
|
- Jupiter Unraid: 172.16.3.20, root:22, Unraid 6.12.85. VMs via virsh; bridge `br0`
|
||||||
|
(uplink eth2), `172.16.3.20/22`. VM "Windows Server 2016" = guest `ACG-DWP-X-BB`,
|
||||||
|
vnet8/br0/e1000, MAC 52:54:00:d4:8e:59, APIPA 169.254.157.152 (no DHCP lease).
|
||||||
|
- Office LAN 172.16.0.0/22; pfSense 172.16.0.1 (router + DNS + DHCP).
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
- Offline alerting commits: `f1cdf5d`, `30e4f23` (fix), `21d63bd` (DTO), `3eedf91`
|
||||||
|
(dashboard). Alert-mute: `29c405e` (spec), `a120e71` (server).
|
||||||
|
- Live mute verify: empty-reason 400; mute -> status=muted; after a 60s sweep the
|
||||||
|
muted alert did NOT re-fire active (count=0); unmute -> active.
|
||||||
|
- Jupiter diag: `virsh domiflist 9` (vnet8/br0/e1000); `virsh domifaddr 9 --source
|
||||||
|
agent` -> 169.254.157.152; guest-exec `ipconfig /all` -> DHCP Enabled: Yes, APIPA,
|
||||||
|
no gateway.
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- **alert-mute dashboard (Task 5)** -- NOT started. Ignore button + required-reason
|
||||||
|
prompt + Muted filter + Un-ignore, on the alerts page AND the agent Alerts tab;
|
||||||
|
muted rows show Un-ignore instead of Ack/Resolve. (Mike: "then continue" -> do this
|
||||||
|
next.) Plus alert-mute Task 6 (roadmap) + Task 7 (final doc).
|
||||||
|
- **offline alerting** -- Task 8 (roadmap entry) outstanding; classification gap:
|
||||||
|
tag SIF-SERVER / Server2013 (and WIN-TG2STMODJG8 if it should be a workstation) via
|
||||||
|
role_override -- held for Mike.
|
||||||
|
- **Jupiter VM** -- decide DHCP vs static for ACG-DWP-X-BB; if DHCP, run `ipconfig
|
||||||
|
/renew` via guest agent; if static, set the intended IP. Paused for Mike.
|
||||||
|
- **MSP360 console (Mike-side todos):** delete SERVER's Nov-2024 plan; AD1 full
|
||||||
|
backup for retention; LAB-Becky storage-or-delete.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
- Commits (main): `f1cdf5d`,`30e4f23`,`21d63bd`,`3eedf91`,`29c405e`,`a120e71`.
|
||||||
|
- Specs: `projects/msp-tools/guru-rmm/specs/role-aware-offline-alerting/`,
|
||||||
|
`.../specs/alert-mute/`.
|
||||||
|
- MSP360 PlanType map: 3=Files,7=SQL,8=Restore,11=Image,13=Consistency,16=HyperV.
|
||||||
|
- New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`; new
|
||||||
|
status `muted`; new tables `alert_mutes`; new columns `agents.role_override`.
|
||||||
|
- Test box: WIN-TG2STMODJG8 agent id b6c715df-09fe-4e97-b09a-82a1b535f041 (offline
|
||||||
|
eval server, used for live verification).
|
||||||
@@ -55,7 +55,7 @@ Run `/wiki-lint` to check for stale entries and broken backlinks.
|
|||||||
|
|
||||||
| Article | Summary | Last Compiled |
|
| Article | Summary | Last Compiled |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.45; 55 enrolled agents; backup-alert quality pass shipped + credential inheritance deployed + offboarding wizard spec complete; clickable alert badges with client filtering; tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-07 |
|
| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.45; 55 enrolled agents; backup-alert quality pass shipped (false backup_failed 15->2; backup_storage_low removed); credential inheritance deployed (hierarchical Global->Client->Site with is_inheritable + de-dup, /effective endpoints) + clickable alert badges with client filtering; SPEC-028 offboarding wizard spec complete (835 lines); role-aware offline alerting + correlated mass-offline detection shipped (agent_offline/mass_offline_site/mass_offline_fleet; migration 054; dashboard triage); alert ignore/mute (perma-silence, migration 055, muted status) shipped server-only (dashboard pending); tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-07 |
|
||||||
| [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 |
|
| [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 |
|
||||||
| [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 |
|
| [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 |
|
||||||
| [The Computer Guru Show](projects/radio-show.md) | Radio show archive processing pipeline (Whisper + pyannote + SQLite FTS5) + post-show content workflow; 572 episodes indexed; FastAPI UI redesigned; Jupiter audio-file gap open | 2026-05-24 |
|
| [The Computer Guru Show](projects/radio-show.md) | Radio show archive processing pipeline (Whisper + pyannote + SQLite FTS5) + post-show content workflow; 572 episodes indexed; FastAPI UI redesigned; Jupiter audio-file gap open | 2026-05-24 |
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ aliases:
|
|||||||
sources:
|
sources:
|
||||||
- "gururmm@main: server/src/api/*.rs (REST API surface, ~30 route modules)"
|
- "gururmm@main: server/src/api/*.rs (REST API surface, ~30 route modules)"
|
||||||
- "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs)"
|
- "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs)"
|
||||||
- "gururmm@main: server/migrations/*.sql (48 migrations — feature checkpoints, incl. 048_bsod_events)"
|
- "gururmm@main: server/migrations/*.sql (55+ migrations — feature checkpoints, incl. 048_bsod_events, 054_agent_role_override, 055_alert_mutes)"
|
||||||
- "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/"
|
- "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/"
|
||||||
- "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)"
|
- "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)"
|
||||||
- "gururmm@main: server/migrations/048_bsod_events.sql"
|
- "gururmm@main: server/migrations/048_bsod_events.sql"
|
||||||
@@ -52,6 +52,7 @@ sources:
|
|||||||
- session-logs/2026-06-07-mike-gururmm-offboarding-spec.md
|
- session-logs/2026-06-07-mike-gururmm-offboarding-spec.md
|
||||||
- "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)"
|
- "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)"
|
||||||
- session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
|
- session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
|
||||||
|
- session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
|
||||||
backlinks:
|
backlinks:
|
||||||
- clients/cascades-tucson
|
- clients/cascades-tucson
|
||||||
- systems/gururmm-build
|
- systems/gururmm-build
|
||||||
@@ -69,6 +70,8 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G
|
|||||||
|
|
||||||
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail.
|
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail.
|
||||||
|
|
||||||
|
**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07 (second session):** Scheduled offline-sweep evaluator (60s tokio interval, server-only). `agent_offline` alerts for servers only; classifier: `os_product_type` 2/3, else `os_name`/`os_version` ~/server/i, else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to a representative offline agent. Warm-up restart guard. Dashboard: servers elevated in triage individually; workstations collapsed. Alert mute (perma-silence, migration 055 `alert_mutes` + `muted` status): keyed on `dedup_key`, permanent until un-ignored, reason required; gates both `create_or_update_alert` and `create_check_alert` bypass. Dashboard Ignore/Muted/Un-ignore UI NOT yet built. Commits f1cdf5d/30e4f23/21d63bd/3eedf91 (offline), 29c405e/a120e71 (mute). Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier.
|
||||||
|
|
||||||
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
|
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
|
||||||
|
|
||||||
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
|
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
|
||||||
@@ -147,6 +150,8 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r
|
|||||||
### Alerting & Watchdog
|
### Alerting & Watchdog
|
||||||
- Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope.
|
- Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope.
|
||||||
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
|
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
|
||||||
|
- **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Generates server-only `agent_offline` alerts based on `agent_role` classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054, `agents.role_override`). Site rule (>=50% of a site's servers offline AND >=3 absolute) -> `mass_offline_site` aggregate; fleet rule (>=10 servers offline) -> `mass_offline_fleet` aggregate; aggregates pinned to a representative offline agent with site/fleet `dedup_key` (avoids making `alerts.agent_id` nullable). Restart guard = warm-up window after boot only (NOT `last_seen < started_at`, which is permanently false in steady state -- code review caught and fixed this spec defect). Dashboard: offline servers individual + elevated in triage; offline workstations collapsed into a "N workstations offline" roll-up; role-override control on agent detail page. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier; inventory-less offline servers (e.g., SIF-SERVER, Server2013) auto-classify as workstations until manually overridden.
|
||||||
|
- **Alert ignore/mute -- perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Mute keyed on `dedup_key` (universal recurring-condition id, always set). Permanent until un-ignored; reason required (400 if missing). Gate inserted at the top of `create_or_update_alert` AND the `create_check_alert` bypass so muted conditions write `status='muted'` and never email -- the active alert path is byte-for-byte unchanged. Transactional `mute_condition`/`unmute_condition`. `POST /api/alerts/:id/mute` + `/unmute`. Distinct from ack/resolve (which quiet only the current cycle). Dashboard Ignore button, Muted filter, and Un-ignore NOT yet built. New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`. New status: `muted`. Key functions: `is_dedup_muted`, `mute_condition`, `unmute_condition` (db/alerts.rs); `offline_sweep`, `agent_role` (alerts/offline.rs).
|
||||||
|
|
||||||
### Credentials Management
|
### Credentials Management
|
||||||
- Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint (known HIGH item: `/reveal` ownership-scope check — [verify current state]).
|
- Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint (known HIGH item: `/reveal` ownership-scope check — [verify current state]).
|
||||||
@@ -236,10 +241,12 @@ gururmm/
|
|||||||
│ └── main.rs systemd unit template generation
|
│ └── main.rs systemd unit template generation
|
||||||
├── server/ Rust/Axum API server
|
├── server/ Rust/Axum API server
|
||||||
│ └── src/
|
│ └── src/
|
||||||
│ ├── api/ REST handlers
|
│ ├── api/ REST handlers (alerts.rs: mute/unmute; agents.rs: role-override)
|
||||||
│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs
|
│ ├── alerts/ Alerting modules (offline.rs: offline_sweep, mass_offline detection, agent_role classifier, warm-up restart guard)
|
||||||
|
│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs, db/alerts.rs (is_dedup_muted, mute_condition, unmute_condition, AlertStatus::Muted)
|
||||||
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
|
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
|
||||||
│ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type)
|
│ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type)
|
||||||
|
├── dashboard/ React/TypeScript UI (lib/agentRole.ts: server classifier; components/ExceptionStream.tsx: offline triage with workstation roll-up; pages/AgentDetail.tsx: role-override control)
|
||||||
├── tray/ System tray binary
|
├── tray/ System tray binary
|
||||||
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
|
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
|
||||||
├── deploy/
|
├── deploy/
|
||||||
@@ -254,23 +261,20 @@ gururmm/
|
|||||||
|
|
||||||
### Current Focus
|
### Current Focus
|
||||||
|
|
||||||
<<<<<<< HEAD
|
|
||||||
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.37+):
|
|
||||||
|
|
||||||
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
|
||||||
=======
|
|
||||||
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||||
|
|
||||||
- **Credential inheritance (deployed 2026-06-07):** Production server running v0.3.45 with full credential inheritance and de-duplication. `/effective` endpoints validated. Dashboard clickable alert badges and client-scoped filtering implemented.
|
- **Credential inheritance (deployed 2026-06-07):** Production server running v0.3.45 with full credential inheritance and de-duplication. `/effective` endpoints validated. Dashboard clickable alert badges and client-scoped filtering implemented.
|
||||||
- **SPEC-028 offboarding wizard (specification complete):** 835-line spec created for site and client offboarding workflows. Includes data export, dependency analysis, typed confirmation, and audit logging. Roadmap updated with "Client & Site Lifecycle Management" section. Implementation pending.
|
- **SPEC-028 offboarding wizard (specification complete):** 835-line spec created for site and client offboarding workflows. Includes data export, dependency analysis, typed confirmation, and audit logging. Roadmap updated with "Client & Site Lifecycle Management" section. Implementation pending.
|
||||||
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main → beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event → 3s wait → `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
||||||
>>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13)
|
|
||||||
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab + BSOD in Alerts stream (issue #10, dashboard bullets unchecked); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab + BSOD in Alerts stream (issue #10, dashboard bullets unchecked); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
||||||
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
|
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
|
||||||
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
|
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
|
||||||
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
|
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
|
||||||
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
|
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
|
||||||
- **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint).
|
- **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint).
|
||||||
|
- **MSP360 API scope (confirmed 2026-06-07):** Provider API token (vault `msp-tools/msp360-api.sops.yaml`) is monitoring-only at our tier: Companies/Users/Monitoring GET endpoints; every management path (plan delete, manual run, storage config) returns 404. White-labeled MSP360 agent has no `cbb.exe` at any standard path. Plan delete/run/storage remain MSP360-console tasks.
|
||||||
|
- **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt on alert row; Muted filter on alerts page + agent Alerts tab; Un-ignore control; muted rows show Un-ignore instead of Ack/Resolve. Server endpoints ready: `POST /api/alerts/:id/mute` + `/unmute`.
|
||||||
|
- **Offline alerting -- classification gap:** SIF-SERVER (SifOidak.local DC), Server2013 (Sombra), and other inventory-less offline servers currently auto-classify as workstations (no `os_product_type`; `os_name` doesn't match ~/server/i for those names). Needs manual `role_override` via `PUT /api/agents/:id/role-override` -- Mike held on tagging them.
|
||||||
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future.
|
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future.
|
||||||
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
|
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
|
||||||
|
|
||||||
@@ -371,11 +375,7 @@ Gitea push to main
|
|||||||
|
|
||||||
## Active State
|
## Active State
|
||||||
|
|
||||||
<<<<<<< HEAD
|
|
||||||
**Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):**
|
**Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):**
|
||||||
=======
|
|
||||||
**Fleet (as of 2026-06-07):**
|
|
||||||
>>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13)
|
|
||||||
- 55 enrolled agents total
|
- 55 enrolled agents total
|
||||||
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
|
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
|
||||||
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
|
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
|
||||||
@@ -406,11 +406,7 @@ Gitea push to main
|
|||||||
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
|
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
|
||||||
|
|
||||||
**Dashboard — complete and working:**
|
**Dashboard — complete and working:**
|
||||||
<<<<<<< HEAD
|
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support.
|
||||||
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts, Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI.
|
|
||||||
=======
|
|
||||||
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent↔backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support.
|
|
||||||
>>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13)
|
|
||||||
|
|
||||||
**Dashboard — incomplete (see UI_GAPS.md):**
|
**Dashboard — incomplete (see UI_GAPS.md):**
|
||||||
- Enrollment management UI (revoke keys, audit log, duplicate hostname warnings)
|
- Enrollment management UI (revoke keys, audit log, duplicate hostname warnings)
|
||||||
@@ -419,6 +415,7 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI
|
|||||||
- BSOD in Alerts stream (Phase 2 deferred)
|
- BSOD in Alerts stream (Phase 2 deferred)
|
||||||
- Tunnel session management (interactive terminal — backend skeleton, not production-ready)
|
- Tunnel session management (interactive terminal — backend skeleton, not production-ready)
|
||||||
- Offboarding wizard UI (SPEC-028 complete, implementation pending)
|
- Offboarding wizard UI (SPEC-028 complete, implementation pending)
|
||||||
|
- Alert mute UI -- Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard Task 5 NOT started)
|
||||||
|
|
||||||
**Open Gitea issues:**
|
**Open Gitea issues:**
|
||||||
- #10 — BSOD detection Phase 2/3 (dashboard + fetch_bsod_dump + full bugcheck table)
|
- #10 — BSOD detection Phase 2/3 (dashboard + fetch_bsod_dump + full bugcheck table)
|
||||||
@@ -478,11 +475,9 @@ These decisions are locked. Do not reverse without explicit user approval.
|
|||||||
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
|
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
|
||||||
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
|
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
|
||||||
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
||||||
<<<<<<< HEAD
|
|
||||||
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. |
|
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. |
|
||||||
=======
|
|
||||||
| 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical credential propagation (Global → Client → Site) with `is_inheritable` flag and de-duplication by (credential_type, label). `/effective` endpoints validated. Dashboard UI: clickable alert severity badges with client filtering, offline badge now scopes to client-specific agents. SPEC-028 offboarding wizard specification created (835 lines) covering site and client offboarding workflows with data export, dependency analysis, typed confirmation, and audit logging. FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section. |
|
| 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical credential propagation (Global → Client → Site) with `is_inheritable` flag and de-duplication by (credential_type, label). `/effective` endpoints validated. Dashboard UI: clickable alert severity badges with client filtering, offline badge now scopes to client-specific agents. SPEC-028 offboarding wizard specification created (835 lines) covering site and client offboarding workflows with data export, dependency analysis, typed confirmation, and audit logging. FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section. |
|
||||||
>>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13)
|
| 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped (second session). Offline sweep evaluator (60s tokio interval, `server/src/alerts/offline.rs`): server-only `agent_offline` alerts (classifier: `os_product_type` 2/3, else `os_name` ~/server/i, else `role_override`); site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to representative agent with site/fleet `dedup_key`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect — permanently false in steady state). Migration 054 (`agents.role_override`). Dashboard triage: servers elevated individually; workstations collapsed into roll-up. `PUT /api/agents/:id/role-override`. Verified live vs WIN-TG2STMODJG8 ("Windows Server 2019 Standard Evaluation", auto-classified via `os_name`); zero false mass-offline against ~55 chronically-offline agents; restart guard held across deploys. Known gap: `os_product_type` on only 16/168 agents; `os_name` is workhorse. Commits f1cdf5d/30e4f23/21d63bd/3eedf91. Alert ignore/mute (perma-silence): migration 055 (`alert_mutes` table + `muted` status); `dedup_key`-keyed; reason required (400 if missing); gates `create_or_update_alert` + `create_check_alert` bypass; `POST /api/alerts/:id/mute` + `/unmute`; dashboard UI NOT started. Verified live (mute -> `muted`; 60s sweep did not re-fire; unmute -> `active`). Commits 29c405e/a120e71. MSP360 Provider API probed live: monitoring-only tier (Companies/Users/Monitoring GET; management paths 404); plan delete/run/storage remain console tasks; white-label agent has no `cbb.exe`. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -495,11 +490,9 @@ These decisions are locked. Do not reverse without explicit user approval.
|
|||||||
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
|
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
|
||||||
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48.
|
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48.
|
||||||
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
|
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
|
||||||
<<<<<<< HEAD
|
|
||||||
- **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim.
|
- **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim.
|
||||||
=======
|
|
||||||
- **2026-06-07 recompile:** Updated for credential inheritance production deployment (server v0.3.45), clickable alert badges with client filtering, and SPEC-028 offboarding wizard specification. Added Recent Work section documenting 2026-06-07 session accomplishments. Updated Current Focus to reflect credential inheritance as deployed and offboarding wizard as spec-complete/implementation-pending. Updated Dashboard status to include credentials management with inheritance. Updated version numbers throughout (server 0.3.37 → 0.3.45). Added session-logs/2026-06-07-mike-gururmm-offboarding-spec.md to sources. Updated History Highlights with 2026-06-07 entry.
|
- **2026-06-07 recompile:** Updated for credential inheritance production deployment (server v0.3.45), clickable alert badges with client filtering, and SPEC-028 offboarding wizard specification. Added Recent Work section documenting 2026-06-07 session accomplishments. Updated Current Focus to reflect credential inheritance as deployed and offboarding wizard as spec-complete/implementation-pending. Updated Dashboard status to include credentials management with inheritance. Updated version numbers throughout (server 0.3.37 → 0.3.45). Added session-logs/2026-06-07-mike-gururmm-offboarding-spec.md to sources. Updated History Highlights with 2026-06-07 entry.
|
||||||
>>>>>>> 5869da2 (sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-07 12:59:13)
|
- **2026-06-07 recompile (second session):** Folded in role-aware offline alerting (server/src/alerts/offline.rs: offline_sweep + agent_role classifier + mass_offline site/fleet detection; migration 054 agents.role_override; dashboard triage + role-override control on agent detail; warm-up restart guard; four commits f1cdf5d/30e4f23/21d63bd/3eedf91) and alert ignore/mute perma-silence (alert_mutes table, muted status, is_dedup_muted/mute_condition/unmute_condition, two-choke-point gate at create_or_update_alert + create_check_alert bypass, mute/unmute API; migration 055; dashboard pending; two commits 29c405e/a120e71). Added MSP360 API scope finding (monitoring-only tier confirmed by live probe; management paths 404). Updated Alerting & Watchdog section with offline alerting and mute detail including new alert types (agent_offline, mass_offline_site, mass_offline_fleet) and new status (muted). Updated Repo Structure (alerts/ directory; db/alerts.rs key functions; dashboard/ entry with agentRole.ts). Updated Development / Current Focus with alert mute dashboard (Task 5 not started), offline classification gap, and MSP360 API scope item. Added alert mute UI to Dashboard incomplete list. Added second 2026-06-07 History row. Updated migration count to 55+ (054/055 confirmed). Added session log source. Patterns section and all existing History rows preserved verbatim.
|
||||||
|
|
||||||
## Backlinks
|
## Backlinks
|
||||||
|
|
||||||
|
|||||||
@@ -48,7 +48,7 @@ Not documented. iDRAC available at 172.16.1.73 (DHCP) for OOB management.
|
|||||||
| OwnCloud | 172.16.3.22 | running | OwnCloud file sync VM (cloud.acghosting.com) |
|
| OwnCloud | 172.16.3.22 | running | OwnCloud file sync VM (cloud.acghosting.com) |
|
||||||
| Unifi | (IP not documented) | running | UniFi Network controller |
|
| Unifi | (IP not documented) | running | UniFi Network controller |
|
||||||
| Windows 7 | — | shut off | — |
|
| Windows 7 | — | shut off | — |
|
||||||
| Windows Server 2016 | — | shut off | — |
|
| Windows Server 2016 | (none — APIPA) | running | Windows guest `ACG-DWP-X-BB`; e1000 NIC `vnet8` on br0, DHCP not leasing — see Known Issues |
|
||||||
| Windows Server 2016_Template | — | shut off | — |
|
| Windows Server 2016_Template | — | shut off | — |
|
||||||
|
|
||||||
## Access
|
## Access
|
||||||
@@ -89,6 +89,7 @@ Not documented. iDRAC available at 172.16.1.73 (DHCP) for OOB management.
|
|||||||
- **iDRAC IP is DHCP** (172.16.1.73) — may drift. Verify before relying on it for OOB access.
|
- **iDRAC IP is DHCP** (172.16.1.73) — may drift. Verify before relying on it for OOB access.
|
||||||
- **guruRMM API proxy stale** — see NPM table above. Fix before it causes a routing incident.
|
- **guruRMM API proxy stale** — see NPM table above. Fix before it causes a routing incident.
|
||||||
- **Post-power-failure recovery order matters** — see `.claude/POWER_FAILURE_RUNBOOK.md` for the full recovery sequence (Tailscale routes, libvirt/VMs, Seafile, NPM/DNS in order).
|
- **Post-power-failure recovery order matters** — see `.claude/POWER_FAILURE_RUNBOOK.md` for the full recovery sequence (Tailscale routes, libvirt/VMs, Seafile, NPM/DNS in order).
|
||||||
|
- **VM "Windows Server 2016" (`ACG-DWP-X-BB`) — no LAN (2026-06-07):** guest stuck on APIPA `169.254.157.152`, no DHCP lease. Host side is healthy (vnet8 bridged to br0, forwarding, receiving LAN broadcast); fault is guest-side — single e1000 NIC set to DHCP, pfSense (172.16.0.1) not leasing it. Diagnose via `virsh domifaddr 9 --source agent` and qemu guest-exec `ipconfig /all`. Fix path: `ipconfig /renew` in-guest (stuck-client case) or assign a static IP if that is the intended config. PAUSED pending Mike's DHCP-vs-static decision.
|
||||||
|
|
||||||
## Backlinks
|
## Backlinks
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user