sync: auto-sync from GURU-5070 at 2026-06-07 10:33:04

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-07 10:33:04
This commit is contained in:
2026-06-07 10:33:08 -07:00
parent 7ba2f26fde
commit b848e34a8e
3 changed files with 106 additions and 23 deletions

View File

@@ -0,0 +1,72 @@
# GuruRMM Backup-Alert Cleanup — Review, Merge, Storage-Alert Removal
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Continued the GuruRMM backup false-alert effort. Reviewed and merged the FU1+FU2 alert-quality change (commit `779f7f6`): a Code Review Agent confirmed the bind order, dedup, and `triggered_at`/`status`/`email_sent_at` safety of the generic `create_or_update_alert` refresh and the PlanType non-backup guard, and flagged that the change incidentally fixes a latent severity-escalation freeze. Merged `fe44dee..779f7f6` to `main` (shared API -> beta + prod). A verification watcher confirmed the server rebuilt (restart 16:47:06 UTC), synced, and the active `backup_failed` count fell from 15 to 2: NEPTUNE (consistency-check, PlanType 13) and SAGE-SQL (restore, PlanType 8) were cleared by FU2's exclusion; AD1 now reads its real "Retention policy cannot be applied (Warning); Files and folders were skipped: 99 (Info)" and correctly downgraded to Partial instead of showing "Unknown error" (FU1's decoder + refresh); LAB-Becky shows its true "Storage Account is not specified (Critical)". Both survivors are genuine and now self-describing.
Investigated agent SERVER (Gonzvar Tax Services, `9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5`) per Mike's backup-tab link and found a `backup_storage_low` "95% Full" alert. Root-caused it as a systematic false alert: `check_storage_threshold` in `mspbackups/sync.rs` computed `usage_percent = DataCopied / TotalData * 100`, but those MSP360 fields describe the backup dataset (how much of the source was uploaded), not the cloud destination's capacity. For image/full backups the ratio is naturally near 100%. Fleet-wide this produced 5 false alerts (3 Critical), the clearest being DF-HYPERV-B's "100% Full" on a 4 GB Hyper-V plan. MSP360 Managed Backup does not expose destination capacity in those fields, so the check can never be correct.
Shipped the removal (commit `b82c010`): deleted the `check_storage_threshold` call and function, and added `resolve_all_backup_storage_alerts` (type-scoped, idempotent, called once per sync tick after prune, mirroring the existing `resolve_orphaned_backup_alerts`) to clear the 5 stale alerts since nothing regenerates them. Code Review Agent APPROVED (type-scoped UPDATE cannot touch other alert types; clean removal with zero new warnings). Merged `87db008..b82c010` to `main`, rebasing cleanly over the interceding CI version-bump from the prior deploy.
Retracted an earlier incorrect claim that SERVER's compliance showed a false green. Checking the live compliance domain showed it already reads `non_compliant / BACKUP_STALE`: the abandoned Nov-2024 image plan is correctly flagged stale (past the 7-day backstop) while SERVER's real current plan ("Backup Image-Based on 9/26/2025", ran 06-07 02:00, next 06-08) is `compliant`, with the aggregate non_compliant because of the dead plan. The `BACKUP_STALE` machinery (cadence-derived window + 7-day backstop, with unit tests) already exists and works, so item #2 required no code. The `/backup-status` endpoint returning only the stale plan is what made the agent look stale-but-green at first glance.
## Key Decisions
- Removed the `backup_storage_low` alert type entirely rather than trying to fix the threshold math: `DataCopied/TotalData` is structurally the wrong signal, and MSP360 Managed Backup does not expose true destination capacity. Genuine "storage filling up" alerting would need MSP360's storage-accounts endpoint as a separate feature (deferred, not scoped).
- Cleared existing false alerts with a once-per-tick type-scoped resolver (`resolve_all_backup_storage_alerts`) instead of a one-shot manual SQL run, so the cleanup is idempotent and self-heals on every deploy without operator intervention.
- Did NOT build the proposed staleness feature (#2) after verifying the existing `BACKUP_STALE` evaluator already handles it correctly. Surfaced the real-world fix (delete the dead MSP360 plan) as a Mike-side item instead of adding redundant code.
- Detached the submodule gitlink back to the pinned commit (`226ba9f`) before `/save` so the session-log commit does not fold an incidental submodule pointer bump into the parent repo.
## Problems Encountered
- Initial false-green assertion on SERVER: I claimed compliance would show the dead backup as healthy without checking the compliance domain. The live `/compliance` response showed `non_compliant / BACKUP_STALE` already correct. Corrected before writing any staleness code; #2 collapsed to a no-op.
- The `/api/agents/<id>/backup-status` endpoint returned only the single abandoned plan (not SERVER's healthy current plan), which made the agent look like a stale-but-green single-plan host. The compliance detail (which evaluates all plans) showed both. Noted as a minor UI/data-shape gap, not fixed this session.
## Configuration Changes
GuruRMM submodule (`azcomputerguru/gururmm`), all server-side Rust:
- `server/src/mspbackups/sync.rs` — (779f7f6) NON_BACKUP_PLAN_TYPES guard, `summarize_backup_error`, enriched `generate_backup_alert` wording; (b82c010) removed `check_storage_threshold` call + function, added once-per-tick `resolve_all_backup_storage_alerts` call after prune.
- `server/src/db/mspbackups.rs` — (b82c010) added `resolve_all_backup_storage_alerts(db) -> Result<u64, sqlx::Error>`.
- `server/src/db/alerts.rs` — (779f7f6) `create_or_update_alert` existing-active branch now refreshes `title`/`message`/`severity`.
- `server/src/mspbackups/client.rs` — (779f7f6) `BackupPlan.plan_type` (serde "PlanType", default).
## Credentials & Secrets
No new credentials. GuruRMM API admin creds: vault `infrastructure/gururmm-server.sops.yaml` (`credentials.gururmm-api.admin-email` / `admin-password`). MSP360 provider creds unchanged.
## Infrastructure & Servers
- GuruRMM server: Rust/Axum @ 172.16.3.30:3001; dashboards rmm-beta.azcomputerguru.com / rmm.azcomputerguru.com (shared API for beta + prod).
- Internal Gitea: http://172.16.3.20:3000 (azcomputerguru/gururmm). Public mirror: git.azcomputerguru.com.
- Agent SERVER: `9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5`, Gonzvar Tax Services / Main, Windows, agent v0.6.57.
## Commands & Outputs
- Merge FU1/FU2: `git push origin fix/backup-alert-quality:main` -> `fe44dee..779f7f6`.
- Merge storage fix: `git push origin fix/remove-false-storage-alert:main` -> `87db008..b82c010` (rebased over CI bump 87db008).
- Verify FU1/FU2: active `backup_failed` 15 -> 2 after restart 16:47:06 UTC. Survivors: AD1 (Partial/Warning, retention) + LAB-Becky (Critical, no storage account).
- New resolver SQL: `UPDATE alerts SET status='resolved', resolved_at=NOW() WHERE alert_type='backup_storage_low' AND status IN ('active','acknowledged')`.
- `SQLX_OFFLINE=true cargo check` on both branches: exit 0, 87 pre-existing warnings, zero new.
## Pending / Incomplete Tasks
- **Verified:** the 5 `backup_storage_low` alerts dropped 5 -> 0 after the `b82c010` build restarted (17:21:41 UTC) — SERVER 95%, DF-HYPERV-B 100% & 92%, AD1 80%, IMC1 85% all cleared by the resolver.
- **Mike-side (MSP360 console, not code):**
- SERVER (Gonzvar) — delete the abandoned Nov-2024 image plan so its backup aggregate clears to compliant.
- AD1 — schedule a full backup so retention can apply (clears the retention Warning).
- LAB-Becky — configure a storage account for the 2023 plan, or delete the abandoned plan.
- **Deferred features:** genuine destination-capacity alerting via MSP360 storage-accounts endpoint; `/backup-status` endpoint returning all plans (not just one) for the agent backup tab.
- **Unrelated, still open:** Robert Wolkin Tailscale enrollment (paused awaiting Mike); FunctionRail DRY refactor + aria-current; promote dashboard beta -> prod when ready.
## Reference Information
- Commits: `779f7f6` (FU1+FU2 backup alert quality), `b82c010` (remove false backup_storage_low alert). Submodule pinned at `226ba9f` in parent.
- Backup tab URL: https://rmm-beta.azcomputerguru.com/agents/9fe137ba-6164-4b7a-8a9d-4e8c4b9e40a5?tab=backup
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup excluded: 8, 13.
- Key functions: `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE), `create_or_update_alert`.