sync: auto-sync from GURU-5070 at 2026-06-07 16:47:01
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-07 16:47:01
This commit is contained in:
170
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
Normal file
170
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# GuruRMM — Role-Aware Offline Alerting + Alert Ignore/Mute (+ MSP360 API probe, Jupiter VM diag)
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** GURU-5070
|
||||
- **Role:** admin
|
||||
|
||||
## Session Summary
|
||||
|
||||
Shipped two GuruRMM alerting features end-to-end and investigated an MSP360 API
|
||||
capability question, then diagnosed an unrelated Unraid VM network fault. After the
|
||||
morning backup-alert cleanup (separate log), Mike asked whether the MSP360 API could
|
||||
delete plans / trigger backups / configure storage for the three console follow-ups.
|
||||
Probed `api.mspbackups.com` with the vaulted Provider token: it is monitoring-only
|
||||
(Companies/Users/Monitoring, GET; every management path 404s; OPTIONS confirms
|
||||
read-only). The agent-CLI fallback also failed -- the white-labeled MSP360 agent on
|
||||
SERVER has no `cbb.exe` at any standard path. Concluded the three items stay
|
||||
MSP360-console tasks and filed them as coord todos for Mike.
|
||||
|
||||
Built **role-aware offline alerting + correlated mass-offline detection** (spec
|
||||
`role-aware-offline-alerting`). Servers offline are incidents; workstations offline
|
||||
are routine. Implemented server-side (Tasks 1-5): migration 054 `role_override`
|
||||
column + a shared `agent_role` classifier delegating to the canonical
|
||||
`agent_is_server`; `OfflineAlertingConfig` in the policy JSONB; a new scheduled
|
||||
offline-sweep evaluator generating server-only `agent_offline` alerts plus site
|
||||
(>=50% + >=3) and fleet (>=10) `mass_offline_*` aggregates; a warm-up restart guard;
|
||||
and `PUT /api/agents/:id/role-override`. Code Review REJECTED the first cut for a
|
||||
CRITICAL defect in the spec's own restart guard (`last_seen < started_at`
|
||||
permanently disabled detection in steady state because `last_seen` advances on every
|
||||
heartbeat); fixed by making the warm-up window the sole guard. Re-review APPROVED.
|
||||
Then the dashboard unit (Tasks 6-7): role-aware triage (offline servers individual +
|
||||
elevated; offline workstations collapsed into a quiet "N workstations offline"
|
||||
roll-up) and the role-override control on the agent detail page; plus a small server
|
||||
DTO fix so `GET /api/agents/:id` returns `role_override`. Verified live: a real
|
||||
offline server (WIN-TG2STMODJG8, "Windows Server 2019 Standard Evaluation",
|
||||
auto-classified via `os_name`) fired `agent_offline`; zero false mass-offline despite
|
||||
~55 chronically-offline agents; restart guard held across deploys.
|
||||
|
||||
Built **alert ignore/mute (perma-silence)** (spec `alert-mute`). Distinct from
|
||||
ack/resolve, which only quiet the current cycle: a mute keyed on `dedup_key`
|
||||
suppresses a recurring condition until un-ignored, with a required reason. Server
|
||||
unit (Tasks 1-4): migration 055 `alert_mutes` table + `muted` status; an
|
||||
`is_dedup_muted` gate inserted into the universal `create_or_update_alert` AND the
|
||||
one `create_check_alert` bypass so muted conditions write `status='muted'` (never
|
||||
active, never email); transactional `mute_condition`/`unmute_condition`;
|
||||
`POST /api/alerts/:id/mute` (reason required -> 400) and `/unmute`. Code Review
|
||||
APPROVED (the muted gate is a top-of-function early return, so the active path is
|
||||
byte-for-byte unchanged for normal alerts). Verified live against WIN-TG2STMODJG8's
|
||||
re-firing `agent_offline`: mute -> `muted`, survived a full offline sweep without
|
||||
re-firing, unmute -> `active`. The dashboard unit (Task 5) is NOT yet built.
|
||||
|
||||
Finally, diagnosed an Unraid VM network fault on Jupiter (172.16.3.20): the libvirt
|
||||
domain "Windows Server 2016" (guest hostname `ACG-DWP-X-BB`) had no LAN. Found the
|
||||
host side healthy (vnet8 bridged to br0, forwarding, receiving LAN broadcast), and
|
||||
the guest holding APIPA `169.254.157.152` with no gateway -- its single e1000 NIC is
|
||||
DHCP-enabled but not getting a lease from pfSense (172.16.0.1). Paused for Mike's
|
||||
call on DHCP-vs-static before changing the guest.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- Offline classification = auto (`os_product_type` 2/3, else `os_name`/`os_version`
|
||||
~/server/i) + a manual `role_override` column. `os_product_type` is populated on
|
||||
only 16/168 agents, so `os_name` is the workhorse signal; inventory-less offline
|
||||
servers (SIF-SERVER, Server2013) auto-classify as workstations and need the manual
|
||||
override -- Mike held on tagging them.
|
||||
- Mass/aggregate alerts pin to a representative offline agent with a site/fleet
|
||||
`dedup_key` rather than making `alerts.agent_id` nullable (avoids a rippling
|
||||
migration across every alert path).
|
||||
- Restart guard = warm-up window only (max grace after boot). Dropped the broken
|
||||
`last_seen < started_at` clause. Individual server alerts fire independently of the
|
||||
site-outage alert (so a still-down server keeps paging after an outage clears).
|
||||
- Alert mute keyed on `dedup_key` (universal recurring-condition id, always set);
|
||||
permanent until un-ignored; stays muted on severity escalation. Gate placed at the
|
||||
two creation choke points so it is universal across every alert type.
|
||||
- MSP360 console items stay manual: the Provider API token is monitoring-tier only;
|
||||
no REST path for plan delete / run / storage config at our access level.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- CRITICAL restart-guard defect (my spec error): `last_seen < started_at` silently
|
||||
disabled offline detection in steady state. Caught in Code Review; fixed to
|
||||
warm-up-only and re-reviewed. The bug-enshrining unit test was rewritten.
|
||||
- Split-brain classifier: the new `agent_role` diverged from the existing
|
||||
`agent_is_server`. Unified by delegating to the canonical fn (threaded
|
||||
`os_version`).
|
||||
- `GET /api/agents/:id` omitted `role_override` (returns `AgentResponse`, not
|
||||
`AgentWithDetails`), so the override card rendered blind. Fixed by adding the field
|
||||
to the base `Agent`/`AgentResponse` DTOs (all `Agent` SELECTs use `SELECT *`).
|
||||
- Orphaned offline alerts + per-agent policy N+1 in the sweep: replaced scattered
|
||||
resolves with one authoritative resolve-except pass; read per-agent grace only for
|
||||
servers; mass membership uses global grace/window.
|
||||
- MSP360 white-label: no `cbb.exe` on SERVER, so the agent-CLI automation route for
|
||||
the backup console tasks was not viable without per-box fingerprinting.
|
||||
- Submodule gitlink: detached to the pinned commit before `/save` so the session-log
|
||||
commit does not fold a gitlink bump.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
GuruRMM submodule (`azcomputerguru/gururmm`), all merged to `main`:
|
||||
- `server/migrations/054_agent_role_override.sql` (new) -- `role_override` column.
|
||||
- `server/migrations/055_alert_mutes.sql` (new) -- `alert_mutes` table + partial
|
||||
unique index `ON dedup_key WHERE active`.
|
||||
- `server/src/alerts/offline.rs` (new) -- offline sweep + mass detection.
|
||||
- `server/src/db/alerts.rs` -- `AlertStatus::Muted`; `is_dedup_muted`/
|
||||
`mute_condition`/`unmute_condition`; muted gate in `create_or_update_alert`;
|
||||
`create_or_update_alert` title/message/severity refresh (from morning).
|
||||
- `server/src/alerts/check_alerts.rs` -- mute gate + notify suppression.
|
||||
- `server/src/db/agents.rs` -- `role_override` on `Agent`/`AgentResponse`/
|
||||
`AgentWithDetails`; `agent_role` + `set_agent_role_override`.
|
||||
- `server/src/db/policies.rs`, `server/src/policy/{effective,merge}.rs` --
|
||||
`OfflineAlertingConfig` + merge.
|
||||
- `server/src/main.rs` -- offline-sweep spawn (60s).
|
||||
- `server/src/api/{alerts,agents,mod}.rs` -- mute/unmute + role-override endpoints.
|
||||
- `dashboard/src/lib/agentRole.ts` (new), `dashboard/src/components/ExceptionStream.tsx`,
|
||||
`dashboard/src/pages/AgentDetail.tsx`, `dashboard/src/api/client.ts` -- triage +
|
||||
override control.
|
||||
- specs/`role-aware-offline-alerting/` and specs/`alert-mute/` (new spec folders).
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
No new secrets. MSP360 Provider API: vault `msp-tools/msp360-api.sops.yaml`
|
||||
(`credentials.login` / `credentials.password`); base `https://api.mspbackups.com`,
|
||||
`POST /api/Provider/Login` -> `access_token`; monitoring-only tier. GuruRMM API:
|
||||
vault `infrastructure/gururmm-server.sops.yaml`. Jupiter: vault
|
||||
`infrastructure/jupiter-unraid-primary.sops.yaml` (root@172.16.3.20:22, key auth
|
||||
works from GURU-5070).
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- GuruRMM API/server: 172.16.3.30:3001 (Linux VM on Jupiter). Dashboards
|
||||
rmm-beta / rmm.azcomputerguru.com (shared API serves both).
|
||||
- Jupiter Unraid: 172.16.3.20, root:22, Unraid 6.12.85. VMs via virsh; bridge `br0`
|
||||
(uplink eth2), `172.16.3.20/22`. VM "Windows Server 2016" = guest `ACG-DWP-X-BB`,
|
||||
vnet8/br0/e1000, MAC 52:54:00:d4:8e:59, APIPA 169.254.157.152 (no DHCP lease).
|
||||
- Office LAN 172.16.0.0/22; pfSense 172.16.0.1 (router + DNS + DHCP).
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- Offline alerting commits: `f1cdf5d`, `30e4f23` (fix), `21d63bd` (DTO), `3eedf91`
|
||||
(dashboard). Alert-mute: `29c405e` (spec), `a120e71` (server).
|
||||
- Live mute verify: empty-reason 400; mute -> status=muted; after a 60s sweep the
|
||||
muted alert did NOT re-fire active (count=0); unmute -> active.
|
||||
- Jupiter diag: `virsh domiflist 9` (vnet8/br0/e1000); `virsh domifaddr 9 --source
|
||||
agent` -> 169.254.157.152; guest-exec `ipconfig /all` -> DHCP Enabled: Yes, APIPA,
|
||||
no gateway.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **alert-mute dashboard (Task 5)** -- NOT started. Ignore button + required-reason
|
||||
prompt + Muted filter + Un-ignore, on the alerts page AND the agent Alerts tab;
|
||||
muted rows show Un-ignore instead of Ack/Resolve. (Mike: "then continue" -> do this
|
||||
next.) Plus alert-mute Task 6 (roadmap) + Task 7 (final doc).
|
||||
- **offline alerting** -- Task 8 (roadmap entry) outstanding; classification gap:
|
||||
tag SIF-SERVER / Server2013 (and WIN-TG2STMODJG8 if it should be a workstation) via
|
||||
role_override -- held for Mike.
|
||||
- **Jupiter VM** -- decide DHCP vs static for ACG-DWP-X-BB; if DHCP, run `ipconfig
|
||||
/renew` via guest agent; if static, set the intended IP. Paused for Mike.
|
||||
- **MSP360 console (Mike-side todos):** delete SERVER's Nov-2024 plan; AD1 full
|
||||
backup for retention; LAB-Becky storage-or-delete.
|
||||
|
||||
## Reference Information
|
||||
|
||||
- Commits (main): `f1cdf5d`,`30e4f23`,`21d63bd`,`3eedf91`,`29c405e`,`a120e71`.
|
||||
- Specs: `projects/msp-tools/guru-rmm/specs/role-aware-offline-alerting/`,
|
||||
`.../specs/alert-mute/`.
|
||||
- MSP360 PlanType map: 3=Files,7=SQL,8=Restore,11=Image,13=Consistency,16=HyperV.
|
||||
- New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`; new
|
||||
status `muted`; new tables `alert_mutes`; new columns `agents.role_override`.
|
||||
- Test box: WIN-TG2STMODJG8 agent id b6c715df-09fe-4e97-b09a-82a1b535f041 (offline
|
||||
eval server, used for live verification).
|
||||
Reference in New Issue
Block a user