Files
claudetools/wiki/projects/gururmm.md
Mike Swanson 14362628a2 sync: auto-sync from GURU-BEAST-ROG at 2026-06-07 21:26:22
Author: Mike Swanson
Machine: GURU-BEAST-ROG
Timestamp: 2026-06-07 21:26:22
2026-06-07 21:26:26 -07:00

564 lines
68 KiB
Markdown

---
type: project
name: gururmm
display_name: GuruRMM
last_compiled: 2026-06-07
compiled_by: GURU-BEAST-ROG/discord-bot
aliases:
- guru-rmm
sources:
- "gururmm@main: server/src/api/*.rs (REST API surface, ~30 route modules)"
- "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs)"
- "gururmm@main: server/migrations/*.sql (55+ migrations — feature checkpoints, incl. 048_bsod_events, 054_agent_role_override, 055_alert_mutes)"
- "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/"
- "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)"
- "gururmm@main: server/migrations/048_bsod_events.sql"
- "gururmm@main: agent/src/bsod.rs"
- "gururmm@main: deploy/build-pipeline/webhook-handler.py"
- "gururmm@main: deploy/build-pipeline/build-server.sh"
- "gururmm@main: commit 137dd85 (BUG-020 tray fix: single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event)"
- projects/msp-tools/guru-rmm/CONTEXT.md
- projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md
- projects/msp-tools/guru-rmm/docs/UI_GAPS.md
- projects/msp-tools/guru-rmm/docs/ARCHITECTURE_DECISIONS.md
- projects/msp-tools/guru-rmm/docs/tech-stack.md
- projects/msp-tools/guru-rmm/docs/DESIGN.md
- .claude/memory/reference_gururmm_server.md
- .claude/memory/reference_gururmm_api.md
- .claude/memory/gururmm-development-principles.md
- .claude/memory/feedback_gururmm_agent_parity.md
- .claude/memory/reference_pluto_build_server.md
- .claude/memory/project_mac_gururmm_setup_pending.md
- .claude/memory/feedback_gururmm_build_channel_default.md
- .claude/memory/reference_gururmm.md
- credentials.md
- session-logs/2025-12-15-session.md
- session-logs/2025-12-20-session.md
- session-logs/2026-04-19-session.md
- session-logs/2026-04-21-session.md
- session-logs/2026-04-29-session.md
- session-logs/2026-05-12-guru-rmm-macos-agent-phase1.md
- session-logs/2026-05-15-session.md
- session-logs/2026-05-16-session.md
- session-logs/2026-05-17-session.md
- session-logs/2026-05-19-gururmm-backup-fixes.md
- session-logs/2026-05-19-session.md
- session-logs/2026-05-21-session.md
- session-logs/2026-05-23-session.md
- session-logs/2026-05-24-session.md
- session-logs/2026-05-24-GURU-KALI-session.md
- session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md
- session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md
- session-logs/2026-06-07-mike-gururmm-offboarding-spec.md
- "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)"
- session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
- session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
- session-logs/2026-06-07-mike-gururmm-ui-gaps-enrollment.md
backlinks:
- clients/cascades-tucson
- systems/gururmm-build
- systems/jupiter
- systems/pluto
---
# GuruRMM
## Summary
GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 55 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints.
**Current version:** agent 0.6.54 (beta) / 0.6.47 (stable) / server 0.3.45 as of 2026-06-07. Fleet on stable target 0.6.47 (pinned 2026-05-28); GURU-5070 is the lone beta agent (explicit per-agent override), running 0.6.54 and auto-riding each new beta build. Note: committed changelogs are stale (stop at agent v0.6.22 / server v0.3.1) — migrations + commit log are the authoritative feature record, not changelogs.
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail.
**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07 (second session):** Scheduled offline-sweep evaluator (60s tokio interval, server-only). `agent_offline` alerts for servers only; classifier: `os_product_type` 2/3, else `os_name`/`os_version` ~/server/i, else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to a representative offline agent. Warm-up restart guard. Dashboard: servers elevated in triage individually; workstations collapsed. Alert mute (perma-silence, migration 055 `alert_mutes` + `muted` status): keyed on `dedup_key`, permanent until un-ignored, reason required; gates both `create_or_update_alert` and `create_check_alert` bypass. Dashboard Ignore/Muted/Un-ignore UI NOT yet built. Commits f1cdf5d/30e4f23/21d63bd/3eedf91 (offline), 29c405e/a120e71 (mute). Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier.
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
**Goal:** Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system — not bolted-together products.
---
## Recent Work
### 2026-06-07 — UI Gap Batch + Enrollment Audit (GURU-BEAST-ROG session)
**API changes:**
- `get_agent` handler now returns `AgentWithDetails` — includes `client_id` and `client_name` via JOIN query (was previously missing client context)
- New endpoints: `GET /api/agents/:id/bsod-events`, `GET /api/agents/:id/version-history`
- New endpoints: `GET /api/install-reports`, `GET /api/install-reports/:id`
- New endpoint: `GET /api/discovery/all-devices`
- `DELETE /api/agents/:id/key` redesigned as dual-revoke: nulls `agents.api_key_hash` (legacy Mode 1/2) AND sets `enrolled_agents.revoked = TRUE` (modern agk_ Mode 3). Previous implementation only touched the legacy field, leaving modern agents able to reconnect.
- New endpoint: `GET /api/sites/:id/enrolled-agents` — full enrollment audit with duplicate-hostname annotation (scoped to active/non-revoked rows)
- New endpoint: `POST /api/enrolled-agents/:id/revoke` — per-enrollment targeted revocation with cascade to `agents.api_key_hash`
**Dashboard additions:**
- AgentDetail: Crashes tab (BSOD events), version history table in Updates tab
- Dashboard.tsx: fleet stats wired to `/agents/stats` endpoint (was computed client-side)
- SiteDetail: Revoke Key button + Enrollment tab with full audit table, status badges, duplicate-hostname warnings, per-row revoke with confirm dialog
- New pages: InstallReports (`/install-reports`), Discovery (`/discovery`)
- FunctionRail: nav links for Install Reports + Fleet Discovery
**Hotfix (production 500s):**
- `get_agent_with_details_by_id` SQL was missing `a.role_override` — migration 054 added `role_override: String` (non-optional) to `AgentWithDetails` and the pre-054 SQL template was used for the new function. `sqlx::FromRow` panics at runtime when a required field has no column. All `GET /api/agents/:id` calls returned 500 until hotfix commit `6faa382`.
**Key commits:** `5854c63` (UI gap batch), `6faa382` (hotfix: role_override in SQL), `49a7109` (enrollment audit batch), `00129af` (fix: wire revoke_agent_key_handler to route)
**Tunnel status:** Server `tunnel.rs` is a dead-code skeleton logging "not yet implemented" — deferred; needs xterm.js design + WS protocol spec before implementation.
---
### 2026-06-07 — Credential Inheritance Deployment & Offboarding Spec
**Deployed to production:**
- Server v0.3.45 with credential inheritance feature enabled
- Credential inheritance allows hierarchical credential propagation (Global → Client → Site) with opt-in `is_inheritable` flag
- De-duplication logic by (credential_type, label) with most-specific-scope-wins resolution
- `/effective` endpoints validated for clients and sites showing proper inheritance and conflict resolution
**Dashboard UI enhancements:**
- Clickable alert severity badges in ClientExceptionsBand component
- Badge clicks filter /alerts page by severity + client_id for scoped alert viewing
- Offline badge filters /agents page to show client-specific offline agents only
- Deep-linking support via URL parameters for client filtering on Alerts and Agents pages
**Specification work:**
- SPEC-028 offboarding wizard specification created (835 lines)
- Covers site and client offboarding workflows with 6-step and 5-step modals respectively
- Includes data export, dependency analysis, typed confirmation, audit logging, and cascading deletions
- FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section covering offboarding/onboarding features
**Key features of offboarding spec:**
- Multi-step modal workflow with clear progression
- Pre-flight dependency checks (alerts, pending commands, active connections)
- Comprehensive data export (credentials, policies, network devices, audit trail) with temp tokens and 1-hour expiry
- Typed name confirmation for destructive final step
- Immutable audit_logs table for compliance and traceability
- Enforced cascade deletion validation for clients
---
## Capabilities / Feature Set
*Synthesized from authoritative artifacts (API routes, agent modules, 48 migrations, roadmap, commit log) at live `main` — not from session logs. See Compilation Notes.*
Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to `interrupted`. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).
### Monitoring & Telemetry
- Core metrics per interval (policy-tunable per section): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win `GetLastInputInfo`, Linux `xprintidle`), public/WAN IP (cached, multi-service fallback). Cross-platform via `sysinfo`.
- Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Windows via bundled **LibreHardwareMonitor** + WMI (`ohw.rs`); Linux via `/sys/class/thermal`; `sysinfo::Components` fallback.
- Process drill-down: top 10 by CPU + top 10 by memory in every metrics payload (cross-platform).
- Network state: per-interface IPv4/IPv6, MAC, derived CIDR subnets — sent on change only.
- **Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01):** Windows-only `agent/src/bsod.rs` polls `C:\Windows\Minidump` for new `.dmp` files (5-min filetime poll), parses the kernel dump header at fixed `DUMP_HEADER64`/`DUMP_HEADER32` offsets (bugcheck code @0x38, 4 parameters @0x40/48/50/58, FILETIME @0xFA8 — the `minidump` crate parses only Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references the System event log (WER 1001 / Kernel-Power 41) for Report Id and faulting driver, deduplicates via a `C:\ProgramData\GuruRMM\bsod-seen.json` watermark (first run baselines existing dumps as seen, alerts on none), and sends `AgentMessage::BsodEvent`. Server: migration `048_bsod_events.sql` + `server/src/db/bsod_events.rs` + `ws/mod.rs` handler inserts the row and raises an **always-Critical** alert, deduplicated by unique `(agent_id, dump_sha256)` + alert `dedup_key`. Verified end-to-end against a real `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys) on GURU-5070. Phase 2/3 deferred: dashboard "Crashes" tab + BSOD in Alerts stream, `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
### Remote Execution
- Command types: `shell`, `powershell`, `python`, raw `script{interpreter}`, and `claude_task`. Options: `timeout_seconds`, `elevated`.
- **Execution context** (`041_add_command_context`): `system` (default — Session 0 / service SYSTEM) **or `user_session`** (runs in the active logged-on user's desktop session via WTS token impersonation: `WTSQueryUserToken` + `CreateProcessAsUserW` + per-user env block). Windows-only; requires an active session.
- Commands individually cancellable (`POST /commands/:id/cancel`) and aborted on disconnect. Auditable history; status running/completed/failed/timeout/interrupted.
- Script library (`017_scripts`): stored scripts dispatched to agents with args/env/timeout/`run_as_user`; per-run history.
### Inventory & Discovery
- Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
- VM / hypervisor / container detection (`032/033`): `is_virtual_machine`, `hypervisor_type`, `vm_uuid`, `is_hypervisor` + hosted VM UUIDs, `is_container`, `is_unraid`.
- User/group inventory (`037`-`040`): local + domain + Azure AD accounts (enabled, pw-never-expires, last-logon, is_admin, AD email/UPN/dept), domain-join classification (none/ad/aad/hybrid), domain name, M365 tenant ID, **domain-controller detection (`is_dc`)**, group membership. Policy-scheduled (default 24h).
- Network discovery: server-dispatched scans — TCP probes over configurable ranges/ports, ARP MAC, reverse DNS, basic OS fingerprint; devices stream back and are persisted/editable.
### Patch / Agent Update Management
- Self-updater: server sends version + URL + SHA256; agent downloads, verifies checksum, atomically swaps binary, restarts, and **auto-rolls back** to a kept backup if it fails to reconnect (~180s window).
- Auto-update gated on effective policy `auto_update` (channel + defer_hours; maintenance-window field received but not yet enforced [verify]). Force via `POST /agents/:id/update`. Update channels at agent/site/client (`026`).
- Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Scaffolding only — promotion is manual; health-gated automation is written-but-unwired (Phase 2).**
### Policy & Configuration Management
- Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via `ConfigUpdate`. Effective policy queryable per scope.
- Checks engine (`018`/`019`): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win `sc.exe`, Linux `systemctl`). Policy-attached check templates (`024`) with push-to-agent sync. On-demand `run-checks`.
- Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but aren't routed yet [verify].**
### Alerting & Watchdog
- Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope.
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
- **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Generates server-only `agent_offline` alerts based on `agent_role` classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054, `agents.role_override`). Site rule (>=50% of a site's servers offline AND >=3 absolute) -> `mass_offline_site` aggregate; fleet rule (>=10 servers offline) -> `mass_offline_fleet` aggregate; aggregates pinned to a representative offline agent with site/fleet `dedup_key` (avoids making `alerts.agent_id` nullable). Restart guard = warm-up window after boot only (NOT `last_seen < started_at`, which is permanently false in steady state -- code review caught and fixed this spec defect). Dashboard: offline servers individual + elevated in triage; offline workstations collapsed into a "N workstations offline" roll-up; role-override control on agent detail page. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier; inventory-less offline servers (e.g., SIF-SERVER, Server2013) auto-classify as workstations until manually overridden.
- **Alert ignore/mute -- perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Mute keyed on `dedup_key` (universal recurring-condition id, always set). Permanent until un-ignored; reason required (400 if missing). Gate inserted at the top of `create_or_update_alert` AND the `create_check_alert` bypass so muted conditions write `status='muted'` and never email -- the active alert path is byte-for-byte unchanged. Transactional `mute_condition`/`unmute_condition`. `POST /api/alerts/:id/mute` + `/unmute`. Distinct from ack/resolve (which quiet only the current cycle). Dashboard Ignore button, Muted filter, and Un-ignore NOT yet built. New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`. New status: `muted`. Key functions: `is_dedup_muted`, `mute_condition`, `unmute_condition` (db/alerts.rs); `offline_sweep`, `agent_role` (alerts/offline.rs).
### Credentials Management
- Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint (known HIGH item: `/reveal` ownership-scope check — [verify current state]).
- **Credential inheritance (deployed 2026-06-07):** Opt-in hierarchical cascade with `is_inheritable` flag allowing credentials to propagate from Global → Client → Site. De-duplication by (credential_type, label) with most-specific scope winning. `/effective` endpoints merge and return inherited + direct credentials with `inherited_from` indicator.
### Backup Integration (MSP360 / MSPBackups)
- Multi-provider config (`034`/`035`) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (`044`) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
- **Alert quality pass (2026-06-07, commit `779f7f6`):** Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from backup alerting and compliance evaluation (FU2 guard). MSP360 message JSON decoded into readable alert text via `summarize_backup_error` (FU1). `create_or_update_alert` now refreshes `title`/`message`/`severity` on re-trigger, also fixing a latent severity-escalation freeze where re-triggered alerts kept stale severity. Fleet result: false `backup_failed` alerts 15 -> 2; survivors (AD1: retention warning + file skips, LAB-Becky: no storage account configured) are genuine and self-describing.
- **`backup_storage_low` alert type REMOVED (2026-06-07, commit `b82c010`):** `check_storage_threshold` computed `usage_percent = DataCopied / TotalData * 100`, but those MSP360 fields describe how much of the source dataset was uploaded, not the cloud destination's remaining capacity. Image/full-backup plans are naturally near 100% by design; the check produced 5 fleet-wide false alerts (3 Critical), the clearest being DF-HYPERV-B "100% Full" on a 4 GB Hyper-V plan. MSP360 Managed Backup does not expose true destination capacity in those fields. Removed `check_storage_threshold` call + function; added `resolve_all_backup_storage_alerts` (type-scoped UPDATE, idempotent, once-per-sync-tick after prune) to clear the 5 stale alerts. Fleet: 5 -> 0 `backup_storage_low` alerts verified. Genuine destination-capacity alerting deferred (would need MSP360 storage-accounts endpoint — separate feature, not scoped).
- **Backup staleness:** Handled by the existing `BACKUP_STALE` evaluator (cadence-derived window + 7-day backstop + unit tests). A plan reporting status=success can still be `non_compliant/BACKUP_STALE` if its last run is past the window (e.g. an abandoned plan on the same agent as a healthy current plan). No new code required.
- **Key functions** (`server/src/mspbackups/` + `server/src/db/`): `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE), `create_or_update_alert`.
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded from alerting/compliance): 8, 13.
### Remote Access (Tunnel)
- Agent side substantially built (`TunnelManager` state machine; Open/Close/Data). **Server side is a dead-code skeleton — not declared in `api/mod.rs`, no `/tunnel` routes, WS handler logs "not yet implemented." Not production-ready.**
### Identity / Multi-tenancy / Security
- Auth: JWT (login/register/me); agents auth over WS via per-agent API key + hardware device_id.
- **Microsoft Entra ID SSO** (OAuth2/OIDC + PKCE), gated on server config. Multi-provider incl. Google is spec'd (SPEC-008) but **Google not implemented [verify]**.
- Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin **user impersonation** (`/auth/impersonate/:id`). Backend present; dashboard UI shipped 2026-05-31.
- Enrollment & keys (`012`): per-agent key issuance on first run, site API keys (regenerable), site-specific MSI with SITEKEY injected at download, public install-report ingestion. Legacy PowerShell agent path for Server 2008 R2.
- Logs: agent log upload (periodic + on-demand), per-agent events (`042`), fleet log view, AI-assisted log analysis (`/logs/analyze`) — AI-optional per locked decision.
### Enrollment Detail (migration 012 + 2026-06-07 audit)
The `enrolled_agents` table (migration 012) is the authoritative enrollment history log:
| Field | Notes |
|---|---|
| `site_id` | Site the agent enrolled under |
| `agent_id` | Set on WebSocket connect (links enrollment record to the agent row) |
| `hostname` | Hostname at enrollment time |
| `agent_key_hash` | Hash of the `agk_` enrollment key (intentionally hidden; not surfaced in API) |
| `enrolled_at` | Timestamp of initial enrollment |
| `last_seen` | Updated on each WS connection |
| `ip_address` | IP at last connection |
| `os_version` | OS at enrollment time |
| `revoked` | Boolean; set TRUE by revocation endpoints |
**WS auth modes:**
- **Mode 3** (modern, `agk_` keys): checks `enrolled_agents.agent_key_hash` first
- **Mode 1/2** (legacy): falls back to `agents.api_key_hash`
**Revocation — dual-layer (fixed 2026-06-07):**
- `DELETE /api/agents/:id/key` (bulk): revokes BOTH `agents.api_key_hash` (Mode 1/2) AND sets `enrolled_agents.revoked = TRUE` for all enrollment records (Mode 3). Previously only touched the legacy field.
- `POST /api/enrolled-agents/:id/revoke` (targeted): per-enrollment revocation with cascade to `agents.api_key_hash`.
**Enrollment audit endpoint:** `GET /api/sites/:id/enrolled-agents` — returns full enrollment history with `is_duplicate_hostname` annotation. Duplicate flag is scoped to active (non-revoked) rows only; a revoked + active pair is not flagged.
**Dashboard:** Enrollment tab on SiteDetail — audit table with status badges, duplicate-hostname warnings, and per-row Revoke button with confirmation dialog (shipped 2026-06-07).
### Platform coverage
- **Cross-platform (Win/Linux/macOS):** metrics, network state, hardware/software/service inventory, user/group + DC/domain detection, checks (service checks Win+Linux), discovery, self-update, scripts/commands.
- **Windows-only:** `user_session` command context (WTS), LibreHardwareMonitor temps, remote registry, tray-into-session, watchdog SCM supervision, BSOD detection.
- **macOS:** agent deployed (Phase 1); tray is a stub; automated Mac build pipeline is an intentional stub (no build host) [verify before claiming CI Mac releases].
---
## Architecture
### Components
| Component | Location | Tech | State |
|---|---|---|---|
| Server | 172.16.3.30:3001, systemd `gururmm-server`, binary `/usr/local/bin/gururmm-server` | Rust, Axum | deployed, production |
| Dashboard | https://rmm.azcomputerguru.com, nginx at `/var/www/gururmm/dashboard/` | React + TypeScript + Vite, shadcn/ui, Tailwind CSS v4 | deployed, production |
| Agent (Windows) | Endpoints, installed as `GuruRMMAgent` Windows service via WiX MSI | Rust, Windows MSVC | deployed; stable fleet on 0.6.47; GURU-5070 (beta) on 0.6.54 |
| Agent (Linux) | Endpoints, systemd `gururmm-agent`, binary `/usr/local/bin/gururmm-agent` | Rust, musl static | deployed |
| Agent (macOS) | Endpoints, LaunchDaemon `com.azcomputerguru.gururmm-agent.plist` | Rust, aarch64/x86_64 | Phase 1 deployed 2026-05-12; code signing issue on Apple Silicon |
| Tray (Windows) | System tray, named pipe IPC | Rust | deployed |
| Tray (Linux) | System tray, Unix socket IPC, libappindicator/GTK | Rust, GTK | deployed 2026-05-24 (PR #13+#14 merged) |
| Tray (macOS) | Menu bar | Rust | stub/TODO (issue #18) |
| PostgreSQL DB | localhost:5432 on 172.16.3.30, database `gururmm` | PostgreSQL | deployed |
| Coord API | 172.16.3.30:8001/api/coord | FastAPI (part of ClaudeTools API) | deployed |
| Build pipeline | 172.16.3.30:9000 webhook + `/opt/gururmm/` scripts | Python (webhook-handler.py), Bash | deployed; split into per-platform scripts 2026-05-24; server build wired into webhook 2026-06-02 |
| Pluto (Windows build VM) | 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid) | Rust MSVC, WiX v4 | operational |
### Key Files & Repos
- **Active repo:** `azcomputerguru/gururmm` — http://172.16.3.20:3000/azcomputerguru/gururmm
- **Submodule working tree:** `D:\claudetools\projects\msp-tools\guru-rmm` — tracks active repo; develop here and push to Gitea
- **Server binary:** `/usr/local/bin/gururmm-server` on 172.16.3.30
- **Agent binary (Linux):** `/usr/local/bin/gururmm-agent`
- **Agent config (Linux/macOS):** `/etc/gururmm/agent.toml` (root, mode 600); macOS uses `/usr/local/etc/gururmm/site.plist`
- **Agent registry (Windows):** `HKLM\SOFTWARE\GuruRMM\SiteId` (written by MSI)
- **Windows service name:** `GuruRMMAgent` (NOT `gururmm-agent`)
- **Downloads dir:** `/var/www/gururmm/downloads/` on 172.16.3.30
- **Webhook handler:** `/opt/gururmm/webhook-handler.py` (port 9000, systemd `gururmm-webhook`)
- **Build scripts:** `/opt/gururmm/build-shared.sh`, `build-linux.sh`, `build-windows.sh`, `build-mac.sh` (split 2026-05-24; `build-agents.sh` is now a compat wrapper)
- **Server build script:** `/opt/gururmm/build-server.sh` (now dispatched by webhook on `server/` changes; has change-gate marker + binary backup + auto-rollback)
- **Per-platform SHA tracking:** `/opt/gururmm/last-built-commit-{linux,windows,mac}`
- **Server build SHA tracking:** `/opt/gururmm/last-built-commit-server`
- **Pluto known-hosts:** `/opt/gururmm/pluto_known_hosts` (pinned SSH keys; installed 2026-05-24)
- **Build log (Linux):** `/var/log/gururmm-build-linux.log`
- **Build log (Windows):** `/var/log/gururmm-build-windows.log`
- **Build log (Server):** `/var/log/gururmm-build-server.log`
- **API (internal):** http://172.16.3.30:3001
- **API (external):** https://rmm-api.azcomputerguru.com (Cloudflare)
- **Dashboard:** https://rmm.azcomputerguru.com
- **DB URL:** `postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm`
- **Vault path:** `infrastructure/gururmm-server.sops.yaml`
### Repo Structure
```
gururmm/
├── agent/ Rust agent (managed endpoints)
│ └── src/
│ ├── bsod.rs Windows BSOD/kernel-crash detection (DUMP_HEADER64/32 offset parser)
│ ├── ipc.rs Unix socket IPC (Linux); Windows named pipe
│ ├── tunnel/ TunnelManager state machine
│ ├── metrics/ sysinfo collection + temps (LibreHardwareMonitor/WMI on Win, /sys/class/thermal on Linux) — BUG-001 resolved
│ ├── registry_ops/ Windows registry read/write
│ ├── updater/ Self-update handler
│ └── main.rs systemd unit template generation
├── server/ Rust/Axum API server
│ └── src/
│ ├── api/ REST handlers (alerts.rs: mute/unmute; agents.rs: role-override)
│ ├── alerts/ Alerting modules (offline.rs: offline_sweep, mass_offline detection, agent_role classifier, warm-up restart guard)
│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs, db/alerts.rs (is_dedup_muted, mute_condition, unmute_condition, AlertStatus::Muted)
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
│ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type)
├── dashboard/ React/TypeScript UI (lib/agentRole.ts: server classifier; components/ExceptionStream.tsx: offline triage with workstation roll-up; pages/AgentDetail.tsx: role-override control)
├── tray/ System tray binary
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
├── deploy/
│ └── build-pipeline/ webhook-handler.py, build-*.sh, build-server.sh
├── scripts/ Build/ops scripts
└── docs/ FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md, DESIGN.md, specs/
```
---
## Development
### Current Focus
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
- **Credential inheritance (deployed 2026-06-07):** Production server running v0.3.45 with full credential inheritance and de-duplication. `/effective` endpoints validated. Dashboard clickable alert badges and client-scoped filtering implemented.
- **SPEC-028 offboarding wizard (specification complete):** 835-line spec created for site and client offboarding workflows. Includes data export, dependency analysis, typed confirmation, and audit logging. Roadmap updated with "Client & Site Lifecycle Management" section. Implementation pending.
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab shipped 2026-06-07 (AgentDetail Crashes tab + version history in Updates tab). Remaining: BSOD in Alerts stream (issue #10); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
- **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint).
- **MSP360 API scope (confirmed 2026-06-07):** Provider API token (vault `msp-tools/msp360-api.sops.yaml`) is monitoring-only at our tier: Companies/Users/Monitoring GET endpoints; every management path (plan delete, manual run, storage config) returns 404. White-labeled MSP360 agent has no `cbb.exe` at any standard path. Plan delete/run/storage remain MSP360-console tasks.
- **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt on alert row; Muted filter on alerts page + agent Alerts tab; Un-ignore control; muted rows show Un-ignore instead of Ack/Resolve. Server endpoints ready: `POST /api/alerts/:id/mute` + `/unmute`.
- **Tunnel session management (P2, deferred):** Server `tunnel.rs` is a dead-code skeleton that logs "not yet implemented"; wiring the route would expose a non-functional endpoint. Needs: xterm.js design, WS protocol spec for dashboard client, proper Phase 2 implementation. Estimated 3-5 days separate effort.
- **SSH from BEAST to production broken:** `known_hosts` issue prevents SSH key auth from GURU-BEAST-ROG to `172.16.3.30`. Workaround during incident: JWT token + curl for endpoint diagnosis. Needs production server host key added to BEAST's `~/.ssh/known_hosts`.
- **`revoke_agent_key` dead code:** Old single-layer handler is dead code now that the route points to `revoke_agent_key_handler`. Should be removed in a cleanup pass.
- **Enrollment key display gap:** `enrolled_agents` table doesn't expose a key prefix — key hash is intentionally hidden. UI shows hostname, dates, IPs but not a redacted key identifier. Could add first-8-chars of hash as a non-sensitive display field in a future pass.
- **Offline alerting -- classification gap:** SIF-SERVER (SifOidak.local DC), Server2013 (Sombra), and other inventory-less offline servers currently auto-classify as workstations (no `os_product_type`; `os_name` doesn't match ~/server/i for those names). Needs manual `role_override` via `PUT /api/agents/:id/role-override` -- Mike held on tagging them.
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future.
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
### Patterns & Anti-Patterns
**Anti-patterns — never repeat:**
| Pattern | What Went Wrong |
|---|---|
| `useMemo` with stable deps for data-dependent values | queryClient is stable, memo never recomputes after queries resolve. Use `useQuery` instead. |
| CSS variable text colors inside the sidebar | Sidebar bg is hardcoded dark; CSS vars flip in light mode. Use `text-white` explicitly inside sidebar. |
| Deploying without stopping the server first | "text file busy" kernel error. Always `systemctl stop` before `cp`. |
| Building without `DATABASE_URL` | sqlx compile-time macros fail. `DATABASE_URL` is in `/home/guru/.cargo/env`. |
| DB migrations without inserting into `_sqlx_migrations` | Server crashes on start. Must insert SHA-384 checksum manually. |
| WiX MSI builds on Linux | WiX requires `msi.dll`. MSI must be built on Pluto (Windows). |
| Manual builds via SSH | All builds go through `webhook-handler.py`. Never SSH and run `cargo build` + artifact copy manually. |
| TOML/config for agent endpoint or site_id | Server URL compiled into binary, site_id baked into MSI. No runtime config files for these values. |
| `path.find('\\')` in `#[cfg(windows)]` files | Compiles on Linux silently, fails on Pluto MSVC with unterminated char literal. Use `'\\\\'`. |
| `STATUS_BADGE_CLASSES` Record const | Vite/Rollup may optimize away the lookup. Use explicit `getStatusBadgeClass()` if/else function. |
| SSH heredoc for TypeScript edits | Shell strips double-quote characters. Edit locally in submodule, push to Gitea, pull on server. |
| `Restart-Service GuruRMMAgent -Force` in command scripts | Kills agent before it can report result. Commands stay forever `running`. Use scheduled task with delay instead. |
| `sudo -u guru git` in systemd build context | git rejects repo as dubious ownership when running as root on guru-owned repo. Use `safe.directory` config or `sudo -u guru git`. |
| Self-updating running bash script | bash reads line-by-line from disk; replacing mid-execution silently skips remaining blocks. |
| `+1.77` legacy builds without `--ignore-rust-version` | Fail MSRV check after adding `rust-version` to Cargo.toml. Add `--ignore-rust-version` to legacy build lines only. |
| `StrictHostKeyChecking=no` for Pluto SSH | Replaced with pinned known-hosts at `/opt/gururmm/pluto_known_hosts`. MITM would compromise build artifacts. |
| CRLF line endings in migration files | sqlx SHA-384 checksum mismatch causes server crash on start. `.gitattributes` + `core.autocrlf=false` + pre-commit hook prevents this. |
| Dead WebSocket write half | WS write fails, send task dies, receive loop keeps agent in `ConnectedAgents` with dead write half. Commands silently fail. Fix: `tokio::select!` monitoring both tasks. |
| Using the `minidump` crate for Windows kernel dumps | The crate only parses Breakpad MDMP format, not Windows kernel PAGEDU64 dumps. Parse `DUMP_HEADER64`/`DUMP_HEADER32` at fixed offsets directly (validated against real dumps). |
| Build classification defaulting to stable | New agent builds should default to `beta`; promotion to `stable` is an explicit re-tag of the `<binary>.channel` sidecar. Defaulting stable races the auto-update fleet ahead before any beta soak. |
| Webhook dispatching only agent builds | The webhook historically triggered `build-linux.sh`/`build-windows.sh`/`build-mac.sh` but never `build-server.sh`; server changes silently went unbuilt until manual intervention. Now fixed — server build is dispatched alongside agent builds, gated by `last-built-commit-server`. |
| Auto-update-on-connect + default-stable tagging racing the fleet | If a build is tagged `stable` (even briefly by mistake), agents on the stable channel auto-update immediately on next heartbeat. Once fleet has updated, rolling back requires a new build — the prior binary is cleaned up. Default beta, soak, promote explicitly. |
**Good patterns:**
- **Platform parity rule** — any agent feature goes on Windows + Linux + macOS in the same commit. If a real implementation isn't feasible, add a working stub + `// TODO(platform): <os> — <reason>`. No silent no-ops.
- **Per-platform last-built-commit tracking** — Linux builds succeed and record progress independently of Windows builds.
- **Holistic feature development** — every feature ships backend + API + dashboard UI + docs together. Backend-only features are rejected.
- **sqlx offline mode** — compile-time query validation requires DB reachable or offline cache present.
- **`RuntimeDirectory=gururmm` in systemd unit** — systemd-native way to give agent writable `/run/gururmm/` for IPC socket.
- **Registry-first path resolution** — read `HKLM:\SOFTWARE\GuruRMM` for install dir, fall back to service PathName, then hardcoded default.
- **`interrupt_running_commands()` at reconnect** — flips all `status='running'` commands for reconnecting agent to `status='interrupted'`.
- **Build change-gate + backup/rollback in `build-server.sh`** — skips rebuild when `server/` is unchanged (marker `last-built-commit-server`); backs up previous binary; restores it if the new binary fails `is-active`. Prevents unnecessary rebuilds and covers the BUG-003 no-rollback gap for server.
- **Server's own root RMM agent for privileged ops** — the server (172.16.3.30) runs the GuruRMM Linux agent as root (hostname `gururmm`); it can read/write `/var/www/gururmm/downloads`, re-tag `.channel` sidecars, and trigger `build-server.sh` without SSH or `sshpass`.
- **GURU-5070 as permanent beta-channel canary** — per-agent `update_channel = 'beta'` override (only agent in the fleet with an explicit channel; site/all-other-agents default to `NULL` = stable). Gets every new beta build immediately; stable fleet is protected by the explicit `update_rollouts` pin.
### Build & Deploy
**CRITICAL: Never trigger builds manually via SSH. All builds go through the webhook pipeline.**
```
Gitea push to main
-> webhook-handler.py (172.16.3.30:9000, parallel threads per platform)
-> build-shared.sh (auto-version bump, git sync — runs once)
-> build-linux.sh (cargo build on .30; log: /var/log/gururmm-build-linux.log)
-> build-windows.sh (SSH -> Pluto 172.16.3.36 via pinned known-hosts
cargo build --release x64 MSVC + i686 MSVC
+1.77 legacy builds with --ignore-rust-version
WiX MSI build for site-specific base
sign-windows.sh (jsign + Azure Trusted Signing)
SCP artifacts back; log: /var/log/gururmm-build-windows.log)
-> build-mac.sh (stub — no build machine configured yet)
-> build-server.sh (gated: skips if no server/ changes since last-built-commit-server;
backs up current binary; builds + deploys; auto-rollback if fails is-active;
log: /var/log/gururmm-build-server.log)
-> artifacts -> /var/www/gururmm/downloads/ with sha256 + -latest symlinks
-> per-platform last-built-commit files updated
-> systemctl restart gururmm-agent (local agent on .30)
```
**Auto-version:** `build-shared.sh` diffs `agent/`, `server/`, `dashboard/` against last built SHA. For each changed component, bumps patch version in `Cargo.toml` or `package.json`, commits `[ci-version-bump]`, pushes. Webhook skips builds where all commits are version bumps.
**Build channel classification:** New agent builds are tagged `beta` by default (`build-windows.sh` and `build-linux.sh` fixed 2026-06-02; macOS already did this). Promotion to `stable` is an explicit step: `echo stable > /var/www/gururmm/downloads/<binary>.channel`. This is distinct from agents defaulting to the `stable` *channel* (correct and unchanged) — agents on the stable channel receive only the latest `stable`-tagged binary; beta agents receive the absolute-latest.
**Dashboard channels — BETA-FIRST (2026-06-02):** mirrors the agent beta/stable model. Every push to `main` touching `dashboard/` auto-builds to **beta**; **prod is explicit-promote only**.
| Channel | URL | Web root | Updates |
|---|---|---|---|
| beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (now dispatched by the webhook alongside agent/server builds, change-gated on `last-built-commit-dashboard`) |
| prod | https://rmm.azcomputerguru.com | `/var/www/gururmm/dashboard` | explicit only — `sudo /opt/gururmm/promote-dashboard.sh --confirm` (backs up prod; `--rollback` restores) |
**Do NOT hand-rsync into the prod web root** (the old `npm run build && rsync ... dashboard/` is superseded). One artifact serves both channels — the Vite build bakes in the absolute prod API URL (`rmm-api.azcomputerguru.com`), so beta uses shared prod data and is byte-identical to prod; beta is branded by an nginx-layer `sub_filter` BETA banner (`deploy/nginx/rmm-beta.conf`), so promotion is a plain rsync. **Serving/TLS:** second nginx vhost on `.30` (`server_name rmm-beta`, specific name beats prod `_`), Cloudflare `rmm-beta` A->``72.194.62.10` proxied (mirrors `rmm`), Jupiter NPM proxy host **id=11** -> `.30:80` presenting cert **id=10** (zone SSL mode is Full; if ever Full-Strict, beta needs its own SAN/cert).
**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start.
**Pluto (172.16.3.36):**
- Windows Server 2019 VM on Jupiter (Unraid)
- SSH: `ssh -o UserKnownHostsFile=/opt/gururmm/pluto_known_hosts Administrator@172.16.3.36`
- Rust stable 1.95.0 + 1.77 pinned for legacy builds
- VS Build Tools (MSVC), sccache at `C:\sccache`, WiX v4, Gitea clone at `C:\gururmm\`
**Auto-update delivery:**
- Server scans every 300s; dispatches update command on agent heartbeat
- Gated on effective policy `auto_update` (default on when policy is null)
- Agent: downloads to PrivateTmp, verifies SHA-256, replaces binary, restarts service
- Force-trigger: `POST /api/agents/:id/update`
---
## Active State
**Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):**
- 55 enrolled agents total
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
- GURU-5070 running 0.6.54 (beta). Permanent canary; gets every new beta build immediately upon reconnect.
**Enrolled clients/sites (live API, 2026-05-24 baseline; no removals since):**
| Client | Type | Sites | Notable agents |
|---|---|---|---|
| AZ Computer Guru (internal) | Internal | DF Server Storage, Howard-VM, Main Office, Mike's Car, Mikes House | Jupiter, PLUTO, gururmm, GURU-KALI, GURU-5070, Mikes-MacBook-Air.local, ACG-DC16, NEPTUNE, ix.azcomputerguru.com |
| BirthBiologic | Corporate | Main Office | BB-SERVER |
| Cascades of Tucson | Corporate | CascadesTucson | 27 agents — CS-SERVER, RECEPTIONIST-PC, ASSISTMAN-PC, MDIRECTOR-PC, MEMRECEPT-PC, and ~22 others |
| Dataforth Corp | Corporate | D1 | AD2, DF-GAGETRAK |
| Grabb & Durando Law Office | Corporate | Main Office | GND-SERVER |
| Instrumental Music Center | Corporate | IMCMain | IMC1 |
| Key, Paul | Residential | Home | KEY-MEDIA |
| Peaceful Spirit | Residential | Bridgette Home, Country Club, Mara Home | BridgettePSHomeComputer, PST-SERVER, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew |
| Safesite | Corporate | Glendale | MSI |
| Sombra Residential LLC | Corporate | main office | DESKTOP-UQRN4K3, Server2013 |
| Stamback Septic | Corporate | StambackSeptic | DESKTOP-BTR2AM3, StambackLaptopNew |
| Swanson, Len | Residential | Home | LAS-GAMER |
**API auth:**
- `POST /api/auth/login` -> JWT (~24h)
- Creds: vault `infrastructure/gururmm-server.sops.yaml` -> `credentials.gururmm-api.admin-email` / `admin-password`
- Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`
- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context`**`system`** (default; Session 0 / SYSTEM) or **`user_session`** (runs in the logged-on user's desktop session via WTS token impersonation; Windows-only, needs an active session), plus `timeout_seconds`/`elevated`. The agent does NOT run everything as LocalSystem — `user_session` is the per-user path (migration `041_add_command_context`, `agent/src/watchdog/wts.rs`).
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
**Dashboard — complete and working:**
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support, AgentDetail Crashes tab (BSOD events), AgentDetail version history in Updates tab, fleet stats from `/agents/stats`, SiteDetail Revoke Key button + Enrollment audit tab (with duplicate-hostname badge), Install Reports page, Fleet Discovery page.
**Dashboard — incomplete (see UI_GAPS.md):**
- Watchdog alerts UI — blocked on 2 missing server routes
- BSOD in Alerts stream (Phase 2 deferred)
- Tunnel session management (interactive terminal — server skeleton "not yet implemented"; needs xterm.js design + WS protocol spec)
- Offboarding wizard UI (SPEC-028 complete, implementation pending)
- Alert mute UI -- Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard Task 5 NOT started)
- Documentation / user guide / inline help tooltips (P3, not started)
**Open Gitea issues:**
- #10 — BSOD detection Phase 2/3 (dashboard + fetch_bsod_dump + full bugcheck table)
- #15 — Pipeline tray build (publish tray binary to downloads)
- #16 — Windows IPC peer authz
- #17 — logind console user resolution
- #18 — macOS tray
- #19 — subscriber broadcast
**BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):**
- Symptom: duplicate AND ghost `gururmm-tray.exe` tray icons. Live evidence: 5 stacked tray processes in Session 1 on GURU-5070 (one per watchdog restart over 6/1-6/2).
- Root cause: `TrayLauncher` (`agent/src/watchdog/wts.rs`) tracked launches only in an in-memory `HashMap<sid,HANDLE>` that resets on watchdog restart (esp. agent auto-update), so it relaunched trays into sessions that already had one; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping the tray's `Drop` -> `NIM_DELETE` (ghost).
- Fix (commit `137dd85`, gururmm@main -> beta): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW` (idempotent); (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback.
- Verified: independent Grok review + Code Review Agent APPROVE.
- Follow-up (coord todo `25fdf31a`): wire `terminate_all` graceful-shutdown into the watchdog policy-disable/uninstall path so fix #3 becomes active.
**Security backlog (HIGH):**
- `credentials/:id/reveal` — horizontal privilege escalation (no ownership scope check)
- `internal_err()` — ~130 call sites returning raw DB errors to callers
---
## Key Architecture Decisions (LOCKED)
These decisions are locked. Do not reverse without explicit user approval.
1. **Per-agent enrollment keys** — MSI contains server URL + site_id only. Agent calls `POST /api/enroll` on first run; server issues unique per-agent key stored hashed. Enables revocation, clone detection, audit trail.
2. **Site-specific MSI generation** — Universal base MSI from CI; dashboard endpoint generates site-specific MSI with site_id baked in via WiX property -> `HKLM\SOFTWARE\GuruRMM\SiteId`.
3. **No TOML/config for endpoints** — Server URL compiled into binary. No runtime config files for server URL or site_id.
4. **Policy inheritance chain** — global -> site -> client -> agent. Server computes merged effective policy and pushes via `ConfigUpdate` WebSocket message.
5. **Platform parity rule** — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible.
6. **Watchdog as separate process** — Main agent cannot reliably restart itself after a crash.
7. **Build pipeline is the only path to production** — Enforces signing, checksum generation, consistent artifact layout.
8. **Multi-tenancy identity model (ADR-001)** — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1.
9. **Holistic feature development (DESIGN.md)** — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
10. **AI-optional operation** — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
---
## History Highlights
| Date | Event |
|---|---|
| 2025-12-15 | Project genesis: Windows service + Linux installer + site code auth + build server. DB migrated from Jupiter Docker to local PostgreSQL. |
| 2026-04-19 | Full drill-down navigation, auto-install on first run, Pluto build VM setup started. |
| 2026-04-21 | MSI build fix (missing WiX extension flag). DESIGN.md created (holistic development mandate). BirthBiologic onboarded. |
| 2026-04-29 | UI_GAPS.md created. Holistic development principle formalized. |
| 2026-05-12 | macOS agent Phase 1 deployed from Mikes-MacBook-Air. Code signing issue on Apple Silicon noted. |
| 2026-05-15 | Dead WebSocket write-half bug fixed. Temperature struct field name mismatch fixed. |
| 2026-05-16 | Watchdog bugs fixed (sc.exe fallback, suppress_until, hypervisor detection). /feature-request skill created. |
| 2026-05-17 | Syncro PSA Integration added to roadmap (P1) after Howard /feature-request. Office power failure recovery — all VMs recovered. |
| 2026-05-18 | Multi-tenancy architecture (ADR-001) decided. 5 SPEC documents created (SPEC-001 through SPEC-006). |
| 2026-05-19 | 4-bug fix for AD2 crash loop. MSP360 backup integration completed (6 fixes). Clickable CPU/Memory gauge cards + process drill-down modal. |
| 2026-05-23 | /rmm-audit pass. Agent optimization Phases 1A-3. Auto-version bump mechanism. MSRV bumped to 1.85. Fleet at 0.6.29. |
| 2026-05-24 | Linux tray IPC + GTK (PR #13+#14) and peer-cred authz (PR #14) merged. PR #21 (ReadWritePaths fix) merged. Build pipeline split into per-platform scripts. Pluto known-hosts pinned. Fleet converged to 0.6.38. |
| 2026-05-31 | Roadmap reconciliation (17 corrections — roadmap understated built state). MSPBackups mapping/verify UI + dev-admin impersonation UI deployed (dashboard v0.2.32). BUG-008/013/014 status corrected to fixed. SPEC-021 (logged-in user domain detection) written after Howard feature request. |
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. |
| 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical credential propagation (Global → Client → Site) with `is_inheritable` flag and de-duplication by (credential_type, label). `/effective` endpoints validated. Dashboard UI: clickable alert severity badges with client filtering, offline badge now scopes to client-specific agents. SPEC-028 offboarding wizard specification created (835 lines) covering site and client offboarding workflows with data export, dependency analysis, typed confirmation, and audit logging. FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section. |
| 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped (second session). Offline sweep evaluator (60s tokio interval, `server/src/alerts/offline.rs`): server-only `agent_offline` alerts (classifier: `os_product_type` 2/3, else `os_name` ~/server/i, else `role_override`); site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to representative agent with site/fleet `dedup_key`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect — permanently false in steady state). Migration 054 (`agents.role_override`). Dashboard triage: servers elevated individually; workstations collapsed into roll-up. `PUT /api/agents/:id/role-override`. Verified live vs WIN-TG2STMODJG8 ("Windows Server 2019 Standard Evaluation", auto-classified via `os_name`); zero false mass-offline against ~55 chronically-offline agents; restart guard held across deploys. Known gap: `os_product_type` on only 16/168 agents; `os_name` is workhorse. Commits f1cdf5d/30e4f23/21d63bd/3eedf91. Alert ignore/mute (perma-silence): migration 055 (`alert_mutes` table + `muted` status); `dedup_key`-keyed; reason required (400 if missing); gates `create_or_update_alert` + `create_check_alert` bypass; `POST /api/alerts/:id/mute` + `/unmute`; dashboard UI NOT started. Verified live (mute -> `muted`; 60s sweep did not re-fire; unmute -> `active`). Commits 29c405e/a120e71. MSP360 Provider API probed live: monitoring-only tier (Companies/Users/Monitoring GET; management paths 404); plan delete/run/storage remain console tasks; white-label agent has no `cbb.exe`. |
| 2026-06-07 | UI gap batch + enrollment audit (GURU-BEAST-ROG session). `get_agent` upgraded to `AgentWithDetails` with `client_id`/`client_name` JOIN. Six new/updated server endpoints: `GET /api/agents/:id/bsod-events`, `GET /api/agents/:id/version-history`, `GET /api/install-reports` + `/:id`, `GET /api/discovery/all-devices`, `DELETE /api/agents/:id/key` (redesigned as dual-layer revoke). Two new enrollment endpoints: `GET /api/sites/:id/enrolled-agents` (audit with duplicate-hostname annotation) + `POST /api/enrolled-agents/:id/revoke`. Dashboard: AgentDetail Crashes tab + version history, Dashboard.tsx fleet stats from `/agents/stats`, SiteDetail Revoke Key + Enrollment audit tab, new pages InstallReports + Discovery. Hotfix `6faa382`: `role_override` field missing from new `get_agent_with_details_by_id` SQL caused all `GET /api/agents/:id` to return 500 in production; fixed within same session. Enrollment `enrolled_agents` table (migration 012) required no new migration — gap was API exposure and UI only. WS auth: Mode 3 checks `enrolled_agents.agent_key_hash` first; Mode 1/2 falls back to `agents.api_key_hash`; dual-revoke now covers both layers. Key commits: `5854c63`, `6faa382`, `49a7109`, `00129af`. All UI gaps except Tunnel and documentation are now complete. SSH from BEAST to production broken (known_hosts issue) — workaround: JWT + curl. |
---
## Compilation Notes
- **2026-05-26 recompile:** Added the Capabilities / Feature Set section, synthesized from authoritative artifacts at live `main` (cd27a59) — API route modules, agent source tree, 46 migrations, roadmap, and the feat/perf commit log — NOT from session logs. This was prompted by the prior seeding missing the `user_session` command context entirely (it had only ever stated "runs as LocalSystem"). Corrected: command execution contexts, temperature monitoring (BUG-001 is resolved, not pending), Entra-only SSO, and added user-inventory/discovery/VM-detection/safe-rollout surfaces. **Changelogs are an unreliable capability source here** — committed changelogs stop at agent v0.6.22 while the fleet runs 0.6.39+; migrations + commit log are authoritative.
- Tunnel subsystem (verified against live main): agent side substantially built; server side is a dead-code skeleton (not declared in `api/mod.rs`, no routes, WS handler logs "not yet implemented"). Confirmed, not unverified.
- macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). `build-mac.sh` is a stub as of 2026-05-24 — unclear if automated pipeline includes macOS yet. [unverified]
- Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23) — likely still unfixed. [unverified]
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48.
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
- **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim.
- **2026-06-07 recompile:** Updated for credential inheritance production deployment (server v0.3.45), clickable alert badges with client filtering, and SPEC-028 offboarding wizard specification. Added Recent Work section documenting 2026-06-07 session accomplishments. Updated Current Focus to reflect credential inheritance as deployed and offboarding wizard as spec-complete/implementation-pending. Updated Dashboard status to include credentials management with inheritance. Updated version numbers throughout (server 0.3.37 → 0.3.45). Added session-logs/2026-06-07-mike-gururmm-offboarding-spec.md to sources. Updated History Highlights with 2026-06-07 entry.
- **2026-06-07 recompile (second session):** Folded in role-aware offline alerting (server/src/alerts/offline.rs: offline_sweep + agent_role classifier + mass_offline site/fleet detection; migration 054 agents.role_override; dashboard triage + role-override control on agent detail; warm-up restart guard; four commits f1cdf5d/30e4f23/21d63bd/3eedf91) and alert ignore/mute perma-silence (alert_mutes table, muted status, is_dedup_muted/mute_condition/unmute_condition, two-choke-point gate at create_or_update_alert + create_check_alert bypass, mute/unmute API; migration 055; dashboard pending; two commits 29c405e/a120e71). Added MSP360 API scope finding (monitoring-only tier confirmed by live probe; management paths 404). Updated Alerting & Watchdog section with offline alerting and mute detail including new alert types (agent_offline, mass_offline_site, mass_offline_fleet) and new status (muted). Updated Repo Structure (alerts/ directory; db/alerts.rs key functions; dashboard/ entry with agentRole.ts). Updated Development / Current Focus with alert mute dashboard (Task 5 not started), offline classification gap, and MSP360 API scope item. Added alert mute UI to Dashboard incomplete list. Added second 2026-06-07 History row. Updated migration count to 55+ (054/055 confirmed). Added session log source. Patterns section and all existing History rows preserved verbatim.
- **2026-06-07 recompile (GURU-BEAST-ROG/discord-bot):** Folded in UI gap batch + enrollment audit session. Added new Recent Work entry (2026-06-07, UI gaps + enrollment). Added Enrollment Detail subsection under Identity / Multi-tenancy documenting `enrolled_agents` table schema, WS auth modes, dual-layer revoke design, and enrollment audit endpoint. Updated `get_agent` capability (now `AgentWithDetails` with client JOIN). Documented six new/updated endpoints (bsod-events, version-history, install-reports, discovery/all-devices, dual-revoke key, enrollment audit + revoke). Updated Dashboard complete list: added Crashes tab, version history, fleet stats, SiteDetail Revoke Key + Enrollment tab, InstallReports + Discovery pages. Updated Dashboard incomplete list: removed enrollment management (now complete); updated Tunnel status (server skeleton confirmed "not yet implemented"); added documentation gap. Updated Current Focus: BSOD Phase 2 partially complete (Crashes tab shipped; Alerts stream + dump upload still deferred); added Tunnel deferred detail, SSH from BEAST broken, dead code cleanup, enrollment key display gap. Added new 2026-06-07 History row (UI gap batch + enrollment audit). Added session log to sources. Updated compiled_by to GURU-BEAST-ROG/discord-bot. History and Patterns sections preserved verbatim.
## Backlinks
- [[clients/cascades-tucson]] — RECEPTIONIST-PC enrolled (site CascadesTucson)
- [[systems/gururmm-build]] — Linux VM at 172.16.3.30 on Jupiter; GuruRMM API 3001, ClaudeTools API 8001, Coord API, MariaDB, PostgreSQL, build pipeline; originally a container on Jupiter, migrated to own VM
- [[systems/jupiter]] — Unraid host at 172.16.3.20; virsh host for all VMs (GuruRMM VM, Unifi, OwnCloud, Pluto/Claude-Builder); Docker: Gitea port 3000, NPM, Seafile; iptables PREROUTING routes :443 to NPM (NPM proxy `rmm-api -> 172.16.3.20:3001` in credentials.md is STALE — actual GuruRMM API is on 172.16.3.30)
- [[systems/pluto]] — Windows build server (MSI, WiX) at 172.16.3.36