Files
claudetools/session-logs/2026-05-19-gururmm-backup-fixes.md
Mike Swanson 5ead5d4dee sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-19 17:56:56
Author: Mike Swanson
Machine: DESKTOP-0O8A1RL
Timestamp: 2026-05-19 17:56:56
2026-05-19 17:57:02 -07:00

97 lines
7.1 KiB
Markdown

# Session Log: GuruRMM — Bug Fixes & MSP360 Backup Integration
**Date:** 2026-05-19
**Duration:** ~3 hours
## User
- **User:** Mike Swanson (mike)
- **Machine:** DESKTOP-0O8A1RL
- **Role:** admin
## Summary
Two areas of work: fixed 4 agent/server bugs identified from AD2's crash loop, then diagnosed and fixed the MSP360 backup integration which had never been configured.
## Part 1: 4-Bug Fix (v0.6.25)
Investigated why AD2's RMM agent was crash-looping and why the watchdog never fired. Root cause: agent 0.6.22/0.6.23 sent `user_inventory_report` WS messages the server couldn't deserialize. Also found a 48-minute update gap where the 30s grace period was too short for a Windows Defender scan of the new binary.
### Bugs fixed (commits 56723b1, 2a7b74b):
1. **Grace period too short during updates** — extended to poll `agent_updates` for up to 2 hours before marking agent offline
2. **AgentMessage unknown variant crash** — silently skips unknown WS message types (forward-compat); previously crashed the WS handler
3. **WatchdogEvent not persisted** — WatchdogEvent messages now written to `watchdog_events` DB table
4. **Watchdog never started**`ensure_watchdog_running()` was implemented but never called from `run_agent()`; `agent-id.txt` sidecar (required by `post_watchdog_alert`) was never written after WS auth
5. **Reviewer notes** (commit 2a7b74b): `has_in_progress_update` NULL gap fixed; `warn!` on WatchdogEvent DB insert failure
## Part 2: MSP360 Backup Integration
Backup tab on AD2 showed nothing. Root cause chain:
1. `mspbackups_config` was empty — API credentials never configured. Fixed: loaded credentials from vault (`msp-tools/msp360-api.sops.yaml`), configured via API.
2. `POST /api/mspbackups/config` failed with `partner_id` NOT NULL violation — handler was passing `None`. Fixed in commit `3b29acc`.
3. Build pipeline only builds agents, not server. Discovered `build-server.sh` at `/opt/gururmm/build-server.sh`.
4. SOPS vault file had unquoted YAML timestamp (`created: 2026-05-18T00:00:00Z`) causing `time.Time` walk error. Fixed by quoting it in the raw YAML.
5. MSP360 `/api/Monitoring` returns `null` for `LastStart`/`NextStart` on 14 records — struct had `String` not `Option<String>`. Fixed in commit `91630cb`.
6. Hostname match picked offline phantom AD2 agent (f6a99fe7, crash-loop duplicate) instead of online agent (49c66d8b). Fixed in commit `86e7ade`: `find_agent_by_hostname_ci` now orders by `status='online'` first.
7. `last_backup_at`/`next_backup_at` stored as NULL — MSP360 dates lack timezone (`2026-05-19T07:00:04`, not RFC3339). Fixed in commit `f146bd9`: fallback parser treats naive timestamps as UTC.
### Result
AD2 backup tab now shows: `status: success`, last backup `2026-05-19T07:00:04Z`, next `2026-05-20T07:00:00Z`, plan `AD2 Image`, 6 files, ~355 GB. Syncs every 15 minutes.
## Server Builds (manual — not triggered by agent pipeline)
- `sudo /opt/gururmm/build-server.sh` — used for all server-only deploys
- Server binary at `/opt/gururmm/gururmm-server`, service: `gururmm-server`
## Commits (gururmm repo)
- `56723b1` — fix: 4-bug fix (grace period, AgentMessage forward-compat, WatchdogEvent, watchdog start)
- `2a7b74b` — fix: reviewer notes (NULL gap, warn! on watchdog event)
- `3b29acc` — fix: mspbackups config partner_id lookup
- `91630cb` — fix: handle null LastStart/NextStart in MSP360 BackupPlan
- `86e7ade` — fix: prefer online agent in MSP360 hostname match
- `f146bd9` — fix: parse MSP360 no-timezone dates as UTC
## Anti-Pattern Added
Build-server.sh is separate from build-agents.sh. Server code changes require manual `sudo /opt/gururmm/build-server.sh` after pushing to Gitea.
---
## Update: ~17:30 PT — Self-heal alert view + agent alerts tab
### Session Summary
Resumed from a context compaction boundary. The self-heal alert changes (committed to server but not built in the previous context) were deployed first: server rebuilt to v0.3.3 and dashboard deployed. CONTEXT.md was updated to reflect the split versioning (agent 0.6.25 / server 0.3.3) and to document the `build-server.sh` anti-pattern.
A second feature request came in: the top-level Alerts tab should show only active (unacknowledged) alerts, while the agent detail page should have its own verbose, filterable alert history. Three commits landed in total across the two work blocks.
For the agent detail alerts tab: the `alertsApi.list` endpoint already supported `agent_id` filtering in `AlertFilter`. `AlertRow`, `StatusBadge`, and `formatRelative` were exported from `Alerts.tsx` for reuse in `AgentDetail.tsx`. A new `AgentAlertsPanel` component was added inline (following the same pattern as `AgentLogsPanel`), defaulting to all statuses to show full history.
### Key Decisions
- **Default to `active` not `unresolved` on top-level Alerts**: "Unresolved" (active + acknowledged) was the previous session's choice, but acknowledged alerts have already been triaged — the at-a-glance view should only show what needs attention. Acknowledged and resolved are still a dropdown away.
- **Agent detail shows all statuses by default**: Contrast with the fleet view — the per-machine tab is the history view, so defaulting to all statuses (including resolved) gives a complete picture of what happened on that machine.
- **Exported shared components from Alerts.tsx rather than creating a new file**: `AlertRow`, `StatusBadge`, `formatRelative`, `SeverityBadge` were already complete and tested. Extracting to a shared component file was not worth the churn; direct exports kept the diff minimal.
- **No server-side changes needed**: `GET /api/alerts` already accepts `agent_id` in `AlertFilter`. The feature was purely a dashboard change.
### Configuration Changes
- `dashboard/src/pages/Alerts.tsx` — default status filter `"unresolved"``"active"`; dropdown reordered; `AlertRow`, `StatusBadge`, `SeverityBadge`, `formatRelative` exported
- `dashboard/src/pages/AgentDetail.tsx``"alerts"` added to `TabId` and `VALID_TABS`; `AgentAlertsPanel` component added; "Alerts" tab wired into tab bar and `TabPanel` tree
- `server/src/db/alerts.rs``"unresolved"` meta-filter maps to `IN ('active', 'acknowledged')`; `status_contributes_param` boolean guards bind-slot indexing (deployed in previous context, built this session)
- `projects/msp-tools/guru-rmm/CONTEXT.md` — version split to agent 0.6.25 / server 0.3.3; `build-server.sh` anti-pattern documented
### Infrastructure
- Server: 172.16.3.30 | gururmm-server service | `/usr/local/bin/gururmm-server`
- Dashboard: nginx @ `/var/www/gururmm/dashboard/` | proxied via https://rmm.azcomputerguru.com
### Commits (gururmm repo)
- `2b10d17` — feat: self-heal alert view — unresolved default filter (`server/src/db/alerts.rs`, `dashboard/src/pages/Alerts.tsx`)
- `e5ac537` — (previous session boundary commit)
- `f888788` — feat: agent alerts tab + active-only default on top-level view (`dashboard/src/pages/Alerts.tsx`, `dashboard/src/pages/AgentDetail.tsx`)
### Server Builds
- `sudo /opt/gururmm/build-server.sh` ran at 00:14 UTC (17:14 PT) → v0.3.3 deployed
- Dashboard built (`npm run build`) and deployed to `/var/www/gururmm/dashboard/` twice (once per feature batch)