sync: auto-sync from GURU-5070 at 2026-06-07 10:33:04
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-07 10:33:04
This commit is contained in:
72
session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
Normal file
72
session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
# GuruRMM Backup-Alert Cleanup — Review, Merge, Storage-Alert Removal
|
||||||
|
|
||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-5070
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Continued the GuruRMM backup false-alert effort. Reviewed and merged the FU1+FU2 alert-quality change (commit `779f7f6`): a Code Review Agent confirmed the bind order, dedup, and `triggered_at`/`status`/`email_sent_at` safety of the generic `create_or_update_alert` refresh and the PlanType non-backup guard, and flagged that the change incidentally fixes a latent severity-escalation freeze. Merged `fe44dee..779f7f6` to `main` (shared API -> beta + prod). A verification watcher confirmed the server rebuilt (restart 16:47:06 UTC), synced, and the active `backup_failed` count fell from 15 to 2: NEPTUNE (consistency-check, PlanType 13) and SAGE-SQL (restore, PlanType 8) were cleared by FU2's exclusion; AD1 now reads its real "Retention policy cannot be applied (Warning); Files and folders were skipped: 99 (Info)" and correctly downgraded to Partial instead of showing "Unknown error" (FU1's decoder + refresh); LAB-Becky shows its true "Storage Account is not specified (Critical)". Both survivors are genuine and now self-describing.
|
||||||
|
|
||||||
|
Investigated agent SERVER (Gonzvar Tax Services, `9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5`) per Mike's backup-tab link and found a `backup_storage_low` "95% Full" alert. Root-caused it as a systematic false alert: `check_storage_threshold` in `mspbackups/sync.rs` computed `usage_percent = DataCopied / TotalData * 100`, but those MSP360 fields describe the backup dataset (how much of the source was uploaded), not the cloud destination's capacity. For image/full backups the ratio is naturally near 100%. Fleet-wide this produced 5 false alerts (3 Critical), the clearest being DF-HYPERV-B's "100% Full" on a 4 GB Hyper-V plan. MSP360 Managed Backup does not expose destination capacity in those fields, so the check can never be correct.
|
||||||
|
|
||||||
|
Shipped the removal (commit `b82c010`): deleted the `check_storage_threshold` call and function, and added `resolve_all_backup_storage_alerts` (type-scoped, idempotent, called once per sync tick after prune, mirroring the existing `resolve_orphaned_backup_alerts`) to clear the 5 stale alerts since nothing regenerates them. Code Review Agent APPROVED (type-scoped UPDATE cannot touch other alert types; clean removal with zero new warnings). Merged `87db008..b82c010` to `main`, rebasing cleanly over the interceding CI version-bump from the prior deploy.
|
||||||
|
|
||||||
|
Retracted an earlier incorrect claim that SERVER's compliance showed a false green. Checking the live compliance domain showed it already reads `non_compliant / BACKUP_STALE`: the abandoned Nov-2024 image plan is correctly flagged stale (past the 7-day backstop) while SERVER's real current plan ("Backup Image-Based on 9/26/2025", ran 06-07 02:00, next 06-08) is `compliant`, with the aggregate non_compliant because of the dead plan. The `BACKUP_STALE` machinery (cadence-derived window + 7-day backstop, with unit tests) already exists and works, so item #2 required no code. The `/backup-status` endpoint returning only the stale plan is what made the agent look stale-but-green at first glance.
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- Removed the `backup_storage_low` alert type entirely rather than trying to fix the threshold math: `DataCopied/TotalData` is structurally the wrong signal, and MSP360 Managed Backup does not expose true destination capacity. Genuine "storage filling up" alerting would need MSP360's storage-accounts endpoint as a separate feature (deferred, not scoped).
|
||||||
|
- Cleared existing false alerts with a once-per-tick type-scoped resolver (`resolve_all_backup_storage_alerts`) instead of a one-shot manual SQL run, so the cleanup is idempotent and self-heals on every deploy without operator intervention.
|
||||||
|
- Did NOT build the proposed staleness feature (#2) after verifying the existing `BACKUP_STALE` evaluator already handles it correctly. Surfaced the real-world fix (delete the dead MSP360 plan) as a Mike-side item instead of adding redundant code.
|
||||||
|
- Detached the submodule gitlink back to the pinned commit (`226ba9f`) before `/save` so the session-log commit does not fold an incidental submodule pointer bump into the parent repo.
|
||||||
|
|
||||||
|
## Problems Encountered
|
||||||
|
|
||||||
|
- Initial false-green assertion on SERVER: I claimed compliance would show the dead backup as healthy without checking the compliance domain. The live `/compliance` response showed `non_compliant / BACKUP_STALE` already correct. Corrected before writing any staleness code; #2 collapsed to a no-op.
|
||||||
|
- The `/api/agents/<id>/backup-status` endpoint returned only the single abandoned plan (not SERVER's healthy current plan), which made the agent look like a stale-but-green single-plan host. The compliance detail (which evaluates all plans) showed both. Noted as a minor UI/data-shape gap, not fixed this session.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
GuruRMM submodule (`azcomputerguru/gururmm`), all server-side Rust:
|
||||||
|
|
||||||
|
- `server/src/mspbackups/sync.rs` — (779f7f6) NON_BACKUP_PLAN_TYPES guard, `summarize_backup_error`, enriched `generate_backup_alert` wording; (b82c010) removed `check_storage_threshold` call + function, added once-per-tick `resolve_all_backup_storage_alerts` call after prune.
|
||||||
|
- `server/src/db/mspbackups.rs` — (b82c010) added `resolve_all_backup_storage_alerts(db) -> Result<u64, sqlx::Error>`.
|
||||||
|
- `server/src/db/alerts.rs` — (779f7f6) `create_or_update_alert` existing-active branch now refreshes `title`/`message`/`severity`.
|
||||||
|
- `server/src/mspbackups/client.rs` — (779f7f6) `BackupPlan.plan_type` (serde "PlanType", default).
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
No new credentials. GuruRMM API admin creds: vault `infrastructure/gururmm-server.sops.yaml` (`credentials.gururmm-api.admin-email` / `admin-password`). MSP360 provider creds unchanged.
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
- GuruRMM server: Rust/Axum @ 172.16.3.30:3001; dashboards rmm-beta.azcomputerguru.com / rmm.azcomputerguru.com (shared API for beta + prod).
|
||||||
|
- Internal Gitea: http://172.16.3.20:3000 (azcomputerguru/gururmm). Public mirror: git.azcomputerguru.com.
|
||||||
|
- Agent SERVER: `9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5`, Gonzvar Tax Services / Main, Windows, agent v0.6.57.
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
- Merge FU1/FU2: `git push origin fix/backup-alert-quality:main` -> `fe44dee..779f7f6`.
|
||||||
|
- Merge storage fix: `git push origin fix/remove-false-storage-alert:main` -> `87db008..b82c010` (rebased over CI bump 87db008).
|
||||||
|
- Verify FU1/FU2: active `backup_failed` 15 -> 2 after restart 16:47:06 UTC. Survivors: AD1 (Partial/Warning, retention) + LAB-Becky (Critical, no storage account).
|
||||||
|
- New resolver SQL: `UPDATE alerts SET status='resolved', resolved_at=NOW() WHERE alert_type='backup_storage_low' AND status IN ('active','acknowledged')`.
|
||||||
|
- `SQLX_OFFLINE=true cargo check` on both branches: exit 0, 87 pre-existing warnings, zero new.
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- **Verified:** the 5 `backup_storage_low` alerts dropped 5 -> 0 after the `b82c010` build restarted (17:21:41 UTC) — SERVER 95%, DF-HYPERV-B 100% & 92%, AD1 80%, IMC1 85% all cleared by the resolver.
|
||||||
|
- **Mike-side (MSP360 console, not code):**
|
||||||
|
- SERVER (Gonzvar) — delete the abandoned Nov-2024 image plan so its backup aggregate clears to compliant.
|
||||||
|
- AD1 — schedule a full backup so retention can apply (clears the retention Warning).
|
||||||
|
- LAB-Becky — configure a storage account for the 2023 plan, or delete the abandoned plan.
|
||||||
|
- **Deferred features:** genuine destination-capacity alerting via MSP360 storage-accounts endpoint; `/backup-status` endpoint returning all plans (not just one) for the agent backup tab.
|
||||||
|
- **Unrelated, still open:** Robert Wolkin Tailscale enrollment (paused awaiting Mike); FunctionRail DRY refactor + aria-current; promote dashboard beta -> prod when ready.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
- Commits: `779f7f6` (FU1+FU2 backup alert quality), `b82c010` (remove false backup_storage_low alert). Submodule pinned at `226ba9f` in parent.
|
||||||
|
- Backup tab URL: https://rmm-beta.azcomputerguru.com/agents/9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5?tab=backup
|
||||||
|
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup excluded: 8, 13.
|
||||||
|
- Key functions: `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE), `create_or_update_alert`.
|
||||||
@@ -55,7 +55,7 @@ Run `/wiki-lint` to check for stale entries and broken backlinks.
|
|||||||
|
|
||||||
| Article | Summary | Last Compiled |
|
| Article | Summary | Last Compiled |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.37; 55 enrolled agents; tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-04 |
|
| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; stable fleet pinned v0.6.47; lone beta agent GURU-5070 on v0.6.54 (per-agent channel override); server v0.3.37; 55 enrolled agents; backup-alert quality pass shipped 2026-06-07 (false backup_failed 15->2; backup_storage_low removed); tray BUG-020 (duplicate/ghost icons) fixed to beta (commit 137dd85); active development | 2026-06-07 |
|
||||||
| [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 |
|
| [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 |
|
||||||
| [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 |
|
| [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 |
|
||||||
| [The Computer Guru Show](projects/radio-show.md) | Radio show archive processing pipeline (Whisper + pyannote + SQLite FTS5) + post-show content workflow; 572 episodes indexed; FastAPI UI redesigned; Jupiter audio-file gap open | 2026-05-24 |
|
| [The Computer Guru Show](projects/radio-show.md) | Radio show archive processing pipeline (Whisper + pyannote + SQLite FTS5) + post-show content workflow; 572 episodes indexed; FastAPI UI redesigned; Jupiter audio-file gap open | 2026-05-24 |
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
type: project
|
type: project
|
||||||
name: gururmm
|
name: gururmm
|
||||||
display_name: GuruRMM
|
display_name: GuruRMM
|
||||||
last_compiled: 2026-06-04
|
last_compiled: 2026-06-07
|
||||||
compiled_by: GURU-5070/claude-main
|
compiled_by: GURU-5070/claude-main
|
||||||
aliases:
|
aliases:
|
||||||
- guru-rmm
|
- guru-rmm
|
||||||
@@ -50,6 +50,7 @@ sources:
|
|||||||
- session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md
|
- session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md
|
||||||
- session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md
|
- session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md
|
||||||
- "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)"
|
- "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)"
|
||||||
|
- session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
|
||||||
backlinks:
|
backlinks:
|
||||||
- clients/cascades-tucson
|
- clients/cascades-tucson
|
||||||
- systems/gururmm-build
|
- systems/gururmm-build
|
||||||
@@ -65,6 +66,8 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G
|
|||||||
|
|
||||||
**Current version:** agent 0.6.54 (beta) / 0.6.47 (stable) / server 0.3.37 as of 2026-06-04. Fleet on stable target 0.6.47 (pinned 2026-05-28); GURU-5070 is the lone beta agent (explicit per-agent override), running 0.6.54 and auto-riding each new beta build. Note: committed changelogs are stale (stop at agent v0.6.22 / server v0.3.1) — migrations + commit log are the authoritative feature record, not changelogs.
|
**Current version:** agent 0.6.54 (beta) / 0.6.47 (stable) / server 0.3.37 as of 2026-06-04. Fleet on stable target 0.6.47 (pinned 2026-05-28); GURU-5070 is the lone beta agent (explicit per-agent override), running 0.6.54 and auto-riding each new beta build. Note: committed changelogs are stale (stop at agent v0.6.22 / server v0.3.1) — migrations + commit log are the authoritative feature record, not changelogs.
|
||||||
|
|
||||||
|
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail.
|
||||||
|
|
||||||
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
|
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
|
||||||
|
|
||||||
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
|
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
|
||||||
@@ -77,7 +80,7 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G
|
|||||||
|
|
||||||
*Synthesized from authoritative artifacts (API routes, agent modules, 48 migrations, roadmap, commit log) at live `main` — not from session logs. See Compilation Notes.*
|
*Synthesized from authoritative artifacts (API routes, agent modules, 48 migrations, roadmap, commit log) at live `main` — not from session logs. See Compilation Notes.*
|
||||||
|
|
||||||
Agent↔server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to `interrupted`. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).
|
Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to `interrupted`. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).
|
||||||
|
|
||||||
### Monitoring & Telemetry
|
### Monitoring & Telemetry
|
||||||
- Core metrics per interval (policy-tunable per section): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win `GetLastInputInfo`, Linux `xprintidle`), public/WAN IP (cached, multi-service fallback). Cross-platform via `sysinfo`.
|
- Core metrics per interval (policy-tunable per section): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win `GetLastInputInfo`, Linux `xprintidle`), public/WAN IP (cached, multi-service fallback). Cross-platform via `sysinfo`.
|
||||||
@@ -95,7 +98,7 @@ Agent↔server communication is a persistent authenticated WebSocket with auto-r
|
|||||||
### Inventory & Discovery
|
### Inventory & Discovery
|
||||||
- Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
|
- Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
|
||||||
- VM / hypervisor / container detection (`032/033`): `is_virtual_machine`, `hypervisor_type`, `vm_uuid`, `is_hypervisor` + hosted VM UUIDs, `is_container`, `is_unraid`.
|
- VM / hypervisor / container detection (`032/033`): `is_virtual_machine`, `hypervisor_type`, `vm_uuid`, `is_hypervisor` + hosted VM UUIDs, `is_container`, `is_unraid`.
|
||||||
- User/group inventory (`037`–`040`): local + domain + Azure AD accounts (enabled, pw-never-expires, last-logon, is_admin, AD email/UPN/dept), domain-join classification (none/ad/aad/hybrid), domain name, M365 tenant ID, **domain-controller detection (`is_dc`)**, group membership. Policy-scheduled (default 24h).
|
- User/group inventory (`037`-`040`): local + domain + Azure AD accounts (enabled, pw-never-expires, last-logon, is_admin, AD email/UPN/dept), domain-join classification (none/ad/aad/hybrid), domain name, M365 tenant ID, **domain-controller detection (`is_dc`)**, group membership. Policy-scheduled (default 24h).
|
||||||
- Network discovery: server-dispatched scans — TCP probes over configurable ranges/ports, ARP MAC, reverse DNS, basic OS fingerprint; devices stream back and are persisted/editable.
|
- Network discovery: server-dispatched scans — TCP probes over configurable ranges/ports, ARP MAC, reverse DNS, basic OS fingerprint; devices stream back and are persisted/editable.
|
||||||
|
|
||||||
### Patch / Agent Update Management
|
### Patch / Agent Update Management
|
||||||
@@ -104,7 +107,7 @@ Agent↔server communication is a persistent authenticated WebSocket with auto-r
|
|||||||
- Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Scaffolding only — promotion is manual; health-gated automation is written-but-unwired (Phase 2).**
|
- Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Scaffolding only — promotion is manual; health-gated automation is written-but-unwired (Phase 2).**
|
||||||
|
|
||||||
### Policy & Configuration Management
|
### Policy & Configuration Management
|
||||||
- Inheritance chain global → client → site → agent; server computes merged effective policy, pushes via `ConfigUpdate`. Effective policy queryable per scope.
|
- Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via `ConfigUpdate`. Effective policy queryable per scope.
|
||||||
- Checks engine (`018`/`019`): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win `sc.exe`, Linux `systemctl`). Policy-attached check templates (`024`) with push-to-agent sync. On-demand `run-checks`.
|
- Checks engine (`018`/`019`): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win `sc.exe`, Linux `systemctl`). Policy-attached check templates (`024`) with push-to-agent sync. On-demand `run-checks`.
|
||||||
- Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but aren't routed yet [verify].**
|
- Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but aren't routed yet [verify].**
|
||||||
|
|
||||||
@@ -113,7 +116,12 @@ Agent↔server communication is a persistent authenticated WebSocket with auto-r
|
|||||||
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
|
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
|
||||||
|
|
||||||
### Backup Integration (MSP360 / MSPBackups)
|
### Backup Integration (MSP360 / MSPBackups)
|
||||||
- Multi-provider config (`034`/`035`) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent↔MSP360 mapping (`044`) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
|
- Multi-provider config (`034`/`035`) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (`044`) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
|
||||||
|
- **Alert quality pass (2026-06-07, commit `779f7f6`):** Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from backup alerting and compliance evaluation (FU2 guard). MSP360 message JSON decoded into readable alert text via `summarize_backup_error` (FU1). `create_or_update_alert` now refreshes `title`/`message`/`severity` on re-trigger, also fixing a latent severity-escalation freeze where re-triggered alerts kept stale severity. Fleet result: false `backup_failed` alerts 15 -> 2; survivors (AD1: retention warning + file skips, LAB-Becky: no storage account configured) are genuine and self-describing.
|
||||||
|
- **`backup_storage_low` alert type REMOVED (2026-06-07, commit `b82c010`):** `check_storage_threshold` computed `usage_percent = DataCopied / TotalData * 100`, but those MSP360 fields describe how much of the source dataset was uploaded, not the cloud destination's remaining capacity. Image/full-backup plans are naturally near 100% by design; the check produced 5 fleet-wide false alerts (3 Critical), the clearest being DF-HYPERV-B "100% Full" on a 4 GB Hyper-V plan. MSP360 Managed Backup does not expose true destination capacity in those fields. Removed `check_storage_threshold` call + function; added `resolve_all_backup_storage_alerts` (type-scoped UPDATE, idempotent, once-per-sync-tick after prune) to clear the 5 stale alerts. Fleet: 5 -> 0 `backup_storage_low` alerts verified. Genuine destination-capacity alerting deferred (would need MSP360 storage-accounts endpoint — separate feature, not scoped).
|
||||||
|
- **Backup staleness:** Handled by the existing `BACKUP_STALE` evaluator (cadence-derived window + 7-day backstop + unit tests). A plan reporting status=success can still be `non_compliant/BACKUP_STALE` if its last run is past the window (e.g. an abandoned plan on the same agent as a healthy current plan). No new code required.
|
||||||
|
- **Key functions** (`server/src/mspbackups/` + `server/src/db/`): `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE), `create_or_update_alert`.
|
||||||
|
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded from alerting/compliance): 8, 13.
|
||||||
|
|
||||||
### Remote Access (Tunnel)
|
### Remote Access (Tunnel)
|
||||||
- Agent side substantially built (`TunnelManager` state machine; Open/Close/Data). **Server side is a dead-code skeleton — not declared in `api/mod.rs`, no `/tunnel` routes, WS handler logs "not yet implemented." Not production-ready.**
|
- Agent side substantially built (`TunnelManager` state machine; Open/Close/Data). **Server side is a dead-code skeleton — not declared in `api/mod.rs`, no `/tunnel` routes, WS handler logs "not yet implemented." Not production-ready.**
|
||||||
@@ -193,9 +201,9 @@ gururmm/
|
|||||||
├── server/ Rust/Axum API server
|
├── server/ Rust/Axum API server
|
||||||
│ └── src/
|
│ └── src/
|
||||||
│ ├── api/ REST handlers
|
│ ├── api/ REST handlers
|
||||||
│ ├── db/ Database layer (sqlx); db/bsod_events.rs
|
│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs
|
||||||
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
|
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
|
||||||
│ └── mspbackups/ MSP360 backup integration
|
│ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type)
|
||||||
├── tray/ System tray binary
|
├── tray/ System tray binary
|
||||||
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
|
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
|
||||||
├── deploy/
|
├── deploy/
|
||||||
@@ -210,15 +218,16 @@ gururmm/
|
|||||||
|
|
||||||
### Current Focus
|
### Current Focus
|
||||||
|
|
||||||
As of 2026-06-04 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.37):
|
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.37+):
|
||||||
|
|
||||||
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main → beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event → 3s wait → `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
||||||
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab + BSOD in Alerts stream (issue #10, dashboard bullets unchecked); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab + BSOD in Alerts stream (issue #10, dashboard bullets unchecked); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
||||||
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
|
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
|
||||||
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
|
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
|
||||||
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
|
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
|
||||||
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
|
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
|
||||||
- **MSP360 backup integration** — Phase 1 complete (monitoring, alerts, mapping, storage thresholds; dashboard UI shipped 2026-05-31). Phase 2 (management) not started.
|
- **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint).
|
||||||
|
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future.
|
||||||
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
|
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
|
||||||
|
|
||||||
### Patterns & Anti-Patterns
|
### Patterns & Anti-Patterns
|
||||||
@@ -298,7 +307,7 @@ Gitea push to main
|
|||||||
| beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (now dispatched by the webhook alongside agent/server builds, change-gated on `last-built-commit-dashboard`) |
|
| beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (now dispatched by the webhook alongside agent/server builds, change-gated on `last-built-commit-dashboard`) |
|
||||||
| prod | https://rmm.azcomputerguru.com | `/var/www/gururmm/dashboard` | explicit only — `sudo /opt/gururmm/promote-dashboard.sh --confirm` (backs up prod; `--rollback` restores) |
|
| prod | https://rmm.azcomputerguru.com | `/var/www/gururmm/dashboard` | explicit only — `sudo /opt/gururmm/promote-dashboard.sh --confirm` (backs up prod; `--rollback` restores) |
|
||||||
|
|
||||||
**Do NOT hand-rsync into the prod web root** (the old `npm run build && rsync ... dashboard/` is superseded). One artifact serves both channels — the Vite build bakes in the absolute prod API URL (`rmm-api.azcomputerguru.com`), so beta uses shared prod data and is byte-identical to prod; beta is branded by an nginx-layer `sub_filter` BETA banner (`deploy/nginx/rmm-beta.conf`), so promotion is a plain rsync. **Serving/TLS:** second nginx vhost on `.30` (`server_name rmm-beta`, specific name beats prod `_`), Cloudflare `rmm-beta` A→`72.194.62.10` proxied (mirrors `rmm`), Jupiter NPM proxy host **id=11** → `.30:80` presenting cert **id=10** (zone SSL mode is Full; if ever Full-Strict, beta needs its own SAN/cert).
|
**Do NOT hand-rsync into the prod web root** (the old `npm run build && rsync ... dashboard/` is superseded). One artifact serves both channels — the Vite build bakes in the absolute prod API URL (`rmm-api.azcomputerguru.com`), so beta uses shared prod data and is byte-identical to prod; beta is branded by an nginx-layer `sub_filter` BETA banner (`deploy/nginx/rmm-beta.conf`), so promotion is a plain rsync. **Serving/TLS:** second nginx vhost on `.30` (`server_name rmm-beta`, specific name beats prod `_`), Cloudflare `rmm-beta` A->``72.194.62.10` proxied (mirrors `rmm`), Jupiter NPM proxy host **id=11** -> `.30:80` presenting cert **id=10** (zone SSL mode is Full; if ever Full-Strict, beta needs its own SAN/cert).
|
||||||
|
|
||||||
**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start.
|
**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start.
|
||||||
|
|
||||||
@@ -318,7 +327,7 @@ Gitea push to main
|
|||||||
|
|
||||||
## Active State
|
## Active State
|
||||||
|
|
||||||
**Fleet (as of 2026-06-04, live Postgres verified):**
|
**Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):**
|
||||||
- 55 enrolled agents total
|
- 55 enrolled agents total
|
||||||
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
|
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
|
||||||
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
|
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
|
||||||
@@ -342,14 +351,14 @@ Gitea push to main
|
|||||||
| Swanson, Len | Residential | Home | LAS-GAMER |
|
| Swanson, Len | Residential | Home | LAS-GAMER |
|
||||||
|
|
||||||
**API auth:**
|
**API auth:**
|
||||||
- `POST /api/auth/login` → JWT (~24h)
|
- `POST /api/auth/login` -> JWT (~24h)
|
||||||
- Creds: vault `infrastructure/gururmm-server.sops.yaml` → `credentials.gururmm-api.admin-email` / `admin-password`
|
- Creds: vault `infrastructure/gururmm-server.sops.yaml` -> `credentials.gururmm-api.admin-email` / `admin-password`
|
||||||
- Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`
|
- Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`
|
||||||
- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context` — **`system`** (default; Session 0 / SYSTEM) or **`user_session`** (runs in the logged-on user's desktop session via WTS token impersonation; Windows-only, needs an active session), plus `timeout_seconds`/`elevated`. The agent does NOT run everything as LocalSystem — `user_session` is the per-user path (migration `041_add_command_context`, `agent/src/watchdog/wts.rs`).
|
- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context` — **`system`** (default; Session 0 / SYSTEM) or **`user_session`** (runs in the logged-on user's desktop session via WTS token impersonation; Windows-only, needs an active session), plus `timeout_seconds`/`elevated`. The agent does NOT run everything as LocalSystem — `user_session` is the per-user path (migration `041_add_command_context`, `agent/src/watchdog/wts.rs`).
|
||||||
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
|
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
|
||||||
|
|
||||||
**Dashboard — complete and working:**
|
**Dashboard — complete and working:**
|
||||||
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts, Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent↔backup mappings/verify UI, Organizations management + dev-admin impersonation UI.
|
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts, Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI.
|
||||||
|
|
||||||
**Dashboard — incomplete (see UI_GAPS.md):**
|
**Dashboard — incomplete (see UI_GAPS.md):**
|
||||||
- Enrollment management UI (revoke keys, audit log, duplicate hostname warnings)
|
- Enrollment management UI (revoke keys, audit log, duplicate hostname warnings)
|
||||||
@@ -367,9 +376,9 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI
|
|||||||
- #19 — subscriber broadcast
|
- #19 — subscriber broadcast
|
||||||
|
|
||||||
**BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):**
|
**BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):**
|
||||||
- Symptom: duplicate AND ghost `gururmm-tray.exe` tray icons. Live evidence: 5 stacked tray processes in Session 1 on GURU-5070 (one per watchdog restart over 6/1–6/2).
|
- Symptom: duplicate AND ghost `gururmm-tray.exe` tray icons. Live evidence: 5 stacked tray processes in Session 1 on GURU-5070 (one per watchdog restart over 6/1-6/2).
|
||||||
- Root cause: `TrayLauncher` (`agent/src/watchdog/wts.rs`) tracked launches only in an in-memory `HashMap<sid,HANDLE>` that resets on watchdog restart (esp. agent auto-update), so it relaunched trays into sessions that already had one; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping the tray's `Drop` → `NIM_DELETE` (ghost).
|
- Root cause: `TrayLauncher` (`agent/src/watchdog/wts.rs`) tracked launches only in an in-memory `HashMap<sid,HANDLE>` that resets on watchdog restart (esp. agent auto-update), so it relaunched trays into sessions that already had one; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping the tray's `Drop` -> `NIM_DELETE` (ghost).
|
||||||
- Fix (commit `137dd85`, gururmm@main → beta): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW` (idempotent); (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event → 3s wait → `TerminateProcess` fallback.
|
- Fix (commit `137dd85`, gururmm@main -> beta): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW` (idempotent); (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback.
|
||||||
- Verified: independent Grok review + Code Review Agent APPROVE.
|
- Verified: independent Grok review + Code Review Agent APPROVE.
|
||||||
- Follow-up (coord todo `25fdf31a`): wire `terminate_all` graceful-shutdown into the watchdog policy-disable/uninstall path so fix #3 becomes active.
|
- Follow-up (coord todo `25fdf31a`): wire `terminate_all` graceful-shutdown into the watchdog policy-disable/uninstall path so fix #3 becomes active.
|
||||||
|
|
||||||
@@ -384,13 +393,13 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI
|
|||||||
These decisions are locked. Do not reverse without explicit user approval.
|
These decisions are locked. Do not reverse without explicit user approval.
|
||||||
|
|
||||||
1. **Per-agent enrollment keys** — MSI contains server URL + site_id only. Agent calls `POST /api/enroll` on first run; server issues unique per-agent key stored hashed. Enables revocation, clone detection, audit trail.
|
1. **Per-agent enrollment keys** — MSI contains server URL + site_id only. Agent calls `POST /api/enroll` on first run; server issues unique per-agent key stored hashed. Enables revocation, clone detection, audit trail.
|
||||||
2. **Site-specific MSI generation** — Universal base MSI from CI; dashboard endpoint generates site-specific MSI with site_id baked in via WiX property → `HKLM\SOFTWARE\GuruRMM\SiteId`.
|
2. **Site-specific MSI generation** — Universal base MSI from CI; dashboard endpoint generates site-specific MSI with site_id baked in via WiX property -> `HKLM\SOFTWARE\GuruRMM\SiteId`.
|
||||||
3. **No TOML/config for endpoints** — Server URL compiled into binary. No runtime config files for server URL or site_id.
|
3. **No TOML/config for endpoints** — Server URL compiled into binary. No runtime config files for server URL or site_id.
|
||||||
4. **Policy inheritance chain** — global → site → client → agent. Server computes merged effective policy and pushes via `ConfigUpdate` WebSocket message.
|
4. **Policy inheritance chain** — global -> site -> client -> agent. Server computes merged effective policy and pushes via `ConfigUpdate` WebSocket message.
|
||||||
5. **Platform parity rule** — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible.
|
5. **Platform parity rule** — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible.
|
||||||
6. **Watchdog as separate process** — Main agent cannot reliably restart itself after a crash.
|
6. **Watchdog as separate process** — Main agent cannot reliably restart itself after a crash.
|
||||||
7. **Build pipeline is the only path to production** — Enforces signing, checksum generation, consistent artifact layout.
|
7. **Build pipeline is the only path to production** — Enforces signing, checksum generation, consistent artifact layout.
|
||||||
8. **Multi-tenancy identity model (ADR-001)** — Dev team with partner impersonation. Three levels: Dev → Partner → Client. Computer Guru is partner #1.
|
8. **Multi-tenancy identity model (ADR-001)** — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1.
|
||||||
9. **Holistic feature development (DESIGN.md)** — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
|
9. **Holistic feature development (DESIGN.md)** — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
|
||||||
10. **AI-optional operation** — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
|
10. **AI-optional operation** — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
|
||||||
|
|
||||||
@@ -416,6 +425,7 @@ These decisions are locked. Do not reverse without explicit user approval.
|
|||||||
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
|
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
|
||||||
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
|
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
|
||||||
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
||||||
|
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -426,8 +436,9 @@ These decisions are locked. Do not reverse without explicit user approval.
|
|||||||
- macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). `build-mac.sh` is a stub as of 2026-05-24 — unclear if automated pipeline includes macOS yet. [unverified]
|
- macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). `build-mac.sh` is a stub as of 2026-05-24 — unclear if automated pipeline includes macOS yet. [unverified]
|
||||||
- Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23) — likely still unfixed. [unverified]
|
- Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23) — likely still unfixed. [unverified]
|
||||||
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
|
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
|
||||||
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 → 48.
|
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48.
|
||||||
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
|
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
|
||||||
|
- **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim.
|
||||||
|
|
||||||
## Backlinks
|
## Backlinks
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user