Files
claudetools/wiki/projects/gururmm.md

69 KiB

type, name, display_name, last_compiled, compiled_by, aliases, sources, backlinks
type name display_name last_compiled compiled_by aliases sources backlinks
project gururmm GuruRMM 2026-06-11 GURU-5070/claude-main
guru-rmm
gururmm@main: server/src/api/*.rs (REST API surface, ~37 route modules)
gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs, device_id.rs)
gururmm@main: server/migrations/*.sql (59 migrations — feature checkpoints through 059_command_delivery_attempts)
gururmm@main: server/migrations/048_bsod_events.sql
gururmm@main: server/migrations/056_audit_log.sql
gururmm@main: server/migrations/057_log_signatures.sql
gururmm@main: server/migrations/058_command_acked_at.sql
gururmm@main: server/migrations/059_command_delivery_attempts.sql
gururmm@main: agent/src/bsod.rs
gururmm@main: server/src/api/updates.rs (promote/rollback endpoints)
gururmm@main: deploy/build-pipeline/webhook-handler.py
gururmm@main: deploy/build-pipeline/build-server.sh
gururmm@main: commit 137dd85 (BUG-020 tray fix)
gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/
gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)
projects/msp-tools/guru-rmm/CONTEXT.md
projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md
projects/msp-tools/guru-rmm/docs/UI_GAPS.md
projects/msp-tools/guru-rmm/docs/ARCHITECTURE_DECISIONS.md
projects/msp-tools/guru-rmm/docs/tech-stack.md
projects/msp-tools/guru-rmm/docs/DESIGN.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/shape.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/plan.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/references.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/standards.md
.claude/memory/reference_gururmm_server.md
.claude/memory/reference_gururmm_api.md
.claude/memory/gururmm-development-principles.md
.claude/memory/feedback_gururmm_agent_parity.md
.claude/memory/reference_pluto_build_server.md
.claude/memory/project_mac_gururmm_setup_pending.md
.claude/memory/feedback_gururmm_build_channel_default.md
.claude/memory/reference_gururmm.md
credentials.md
session-logs/2025-12-15-session.md
session-logs/2025-12-20-session.md
session-logs/2026-04-19-session.md
session-logs/2026-04-21-session.md
session-logs/2026-04-29-session.md
session-logs/2026-05-12-guru-rmm-macos-agent-phase1.md
session-logs/2026-05-15-session.md
session-logs/2026-05-16-session.md
session-logs/2026-05-17-session.md
session-logs/2026-05-19-gururmm-backup-fixes.md
session-logs/2026-05-19-session.md
session-logs/2026-05-21-session.md
session-logs/2026-05-23-session.md
session-logs/2026-05-24-session.md
session-logs/2026-05-24-GURU-KALI-session.md
session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md
session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md
session-logs/2026-06-07-mike-gururmm-offboarding-spec.md
live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)
session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
session-logs/2026-06-07-mike-gururmm-ui-gaps-enrollment.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-host-migration-cutover.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-backfill-build-pipeline.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-promotion-ghost-identity.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability-phase1.md
clients/cascades-tucson
systems/gururmm-build
systems/jupiter
systems/pluto

GuruRMM

Summary

GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 215 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints.

Current version: agent 0.6.63 (stable) / server 0.3.68 as of 2026-06-11. Fleet converged to 0.6.63 (~168-182 typically online); ~20 stragglers on older versions auto-update on reconnect. Changelogs are stale (stop at agent v0.6.22) — migrations + commit log are the authoritative feature record.

Agent-comms-durability Phase 1 shipped 2026-06-11: Commands black-holed at NAT'd sites (e.g., UDR Ultra) no longer falsely fail. Agent sends a CommandAck on receipt; server reaper re-delivers un-acked commands (never fails them); pending commands re-offered on every heartbeat (rides the warm conntrack). Verified: PST-SERVER (Peaceful Spirit, behind UDR Ultra) returned a command with acked_at stamped and ack_latency 16 ms. Fleet of 168 agents converged to 0.6.63 in ~3 min. Migrations 058 (acked_at) + 059 (delivery_attempts) applied on server restart. See Architecture section for full detail.

Physical server migration completed 2026-06-11: The kitchen-sink Ubuntu VM at 172.16.3.30 was retired. A physical box took the same IP (172.16.3.30) running Ubuntu 26.04 + PostgreSQL 18. Production back online within 27 min (maintenance window). Old VM parked at 172.16.3.46 with data pristine as rollback anchor. Server binary moved from /usr/local/bin/gururmm-server (old VM) to /opt/gururmm/gururmm-server (new box). See also: gururmm-build.

Backup-alert quality pass shipped 2026-06-07: False backup_failed alerts reduced 15 -> 2 fleet-wide. backup_storage_low alert type removed entirely. See Backup Integration section.

Role-aware offline alerting + alert ignore/mute shipped 2026-06-07: Server-only agent_offline alerts; site/fleet mass-offline aggregation; alert_mutes perma-silence (dashboard UI still pending).

See also: wiki/projects/guru-rmm.md is a redirect tombstone pointing here (slug disambiguation).

Repo: azcomputerguru/gururmm on Gitea (internal: http://172.16.3.20:3000). The copy at D:\claudetools\projects\msp-tools\guru-rmm is a git submodule tracking the active repo. Development happens in the submodule working tree; commits/pushes go to Gitea from there (GURU-5070 cannot push to gururmm Gitea — use the server at 172.16.3.30 or 172.16.3.47).

Goal: Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system.


Recent Work

2026-06-11 — Physical Server Migration (Cutover + Backfill + Build Pipeline)

Cutover (morning ~06:53-07:20 MST):

  • Stopped all VM writers, fresh-restored gururmm/guruconnect (PG 18) and claudetools (MariaDB 11.8) on the new physical box via direct SSH to the VM.
  • VM released 172.16.3.30; new box took 172.16.3.30 + fresh management IP 172.16.3.47 via self-confirming detached netplan apply (auto-reverts if gateway unreachable — prevents a bricked network config).
  • All nine stack services active; 162/212 agents reconnected within 15 min. Fleet broadcast sent; 7 migration locks released.
  • Two production issues fixed in-window: (1) nginx had stale config missing /ws location — systemctl reload nginx fixed it and agents reconnected; (2) Prometheus failed due to WAL copied mid-write — moved wal/chunks_head aside, persisted blocks healthy (~2h of samples lost during downtime anyway).

Backfill + Build Pipeline (afternoon session):

  • 7-day whale data (metrics 1,189,924 rows + agent_logs 2,262,938 rows; ~3.4 GB total) backfilled VM->new box via direct SSH \copy stream in ~2.5 min. Load verified lossless; SSD sustained 186-214 MB/s writes with ZERO pool-timeouts while 212 live agents wrote concurrently.
  • Three build-env gaps found and fixed after first push-triggered build: (1) sccache not installed (copied v0.8.2 from VM); (2) root had no Pluto SSH key (copied guru's id_ed25519 to /root/.ssh); (3) /etc/gururmm-signing.env + jsign + Java missing (installed Java + jsign-7.1.jar; recreated jsign wrapper).
  • Parallel Windows build prototyped (7 Start-Job isolated builds, 3.51x speedup 28.9m -> 8.2m, 15 rustc vs 1). Integration into production build-windows.sh DEFERRED (design captured in runbook).
  • First clean build on new box: signed Windows agent v0.6.61 beta. Independently verified (Authenticode, Arizona Computer Guru LLC via Azure Trusted Signing, RFC3161 timestamp, sha256 OK).

2026-06-11 — Durable Agent Identity (Ghost Agent Spec + Phase 1 Task 1)

Ghost agent root cause:

  • device_id is a random UUIDv4 stored in ONE file (%ProgramData%\GuruRMM\.device-id / /var/lib/gururmm/.device-id). The agent's own cleanup.ps1 (Steps 6+7) deleted both ProgramData and HKLM\SOFTWARE\GuruRMM, wiping it. A past hardware->random scheme change also orphaned agents (11 ghosts with old win-/hw- device_ids). Server dedups ONLY on device_id at WS-connect; no reconciliation. 11 hostnames have duplicates (9 same-site true ghosts; 2 — MSI, SERVER — are different machines at different sites and must NOT be merged). 32 tables FK agent_id.
  • Context: channel pin regression discovered (GURU-5070 still on 0.6.59 after v0.6.61 promotion) because the beta override lived on the orphaned ghost enrollment. Fix: set update_channel='beta' at site "Mike's Car" (survives re-enrollment); resolve_agent_channel = agent.or(site).or(client).or('stable').

Spec + Phase 1 Task 1:

  • Multi-AI design review (Gemini + Grok, strong convergence): durable multi-location device_id (primary) + hardware fingerprint (reconciliation guard, not key). Identity model: agent/src/device_id.rs rewritten — HKLM registry + ProgramData on Windows; /etc + /var/lib on Linux; /Library + mirror on macOS — with self-heal, plus cleanup.ps1 whitelisting the identity files. Committed 0b81d33 (v0.6.62). Linux build green.
  • Spec: specs/durable-agent-identity/ (shape/plan/references/standards). Phase 1 (Task 1 shipped) stops ~99% of new ghosts. Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for 9 existing ghosts) per spec after soak.

2026-06-11 — Agent Comms Durability Phase 1 (Spec + Slices A/B/C + Deploy + Fleet)

Root cause (confirmed in code, ~95% AI-validated): NAT conntrack asymmetry. Commands are a server->agent PUSH over the persistent WS; agent->server heartbeats refresh the UDR conntrack and replies ride back. A server->agent command push in an idle gap hits a torn-down conntrack and is silently dropped (server sender.send() buffers without error for ~15 min). The server's 30s WS ping (server->agent) CANNOT revive a dead conntrack; only agent->server can. Server had no half-open detection (read loop ignores Pong). Agent already heartbeats every 30s — so the conntrack floor is sub-30s — meaning durable delivery nets are the correct fix, not keepalive alone.

Spec: specs/agent-comms-durability/ (shape/plan/references/standards). Multi-AI reviewed (Gemini 3.1-pro + Grok 4.3, 2 rounds, strong convergence). Phased: Phase 1 = durable delivery (shipped); Phase 2 = live TTY (planned); Phase 3 = AIMD keepalive + bulk->HTTPS + half-open eviction (planned).

Slice A (dormant server foundation — committed b93f2ef): Migration 058 adds commands.acked_at TIMESTAMPTZ. Server-side CommandAck handler arm and mark_command_acked() in db. Additive + dormant — no behavior change until agents send ACKs.

Slice B (agent — committed 45870b1, deployed 2026-06-11 as v0.6.63): AgentMessage::CommandAck { command_id } added. Agent sends CommandAck on RECEIPT of every dispatched command (before execution). CommandExecutor gained a FIFO-bounded recent-results cache (64 entries, 256 KB cap each): re-reports cached result for an already-completed command; ignores an in-flight duplicate; never double-runs. Result is cached BEFORE deregistering the running task so a re-delivery in the complete()/send() gap always finds it cached (re-report) rather than not-running (re-run).

Slice C (server — committed ca1657b, deployed 2026-06-11 as v0.3.68): Migration 059 adds delivery_attempts INTEGER NOT NULL DEFAULT 0 + partial index idx_commands_agent_acked ON commands(agent_id) WHERE acked_at IS NOT NULL (capability gate). db::requeue_undelivered_commands returns an un-acked command past a 60s ACK deadline to pending (re-delivery, never failed). db::fail_timed_out_commands rewritten into three safe cases — can no longer falsely-fail a deliverable command. mark_command_running increments delivery_attempts (cap 10). Reconnect block refactored into shared redispatch_pending_commands helper (send_to-based); called on EVERY heartbeat — so a black-holed push is re-delivered on the next agent beat (which just refreshed the NAT conntrack) without needing a reconnect.

Capability gate (version-independent): Reaper only re-delivers for agents that have demonstrably ACK'd at least once (EXISTS(... acked_at IS NOT NULL)). Old non-ACK agents keep the legacy fail-on-timeout path and can never double-run. Safe to deploy at any point during gradual agent rollout.

Deploy + fleet rollout:

  • Installer bug fixed (5c0d004): Start-Process -Wait -PassThru + exit code check (old & $stagingPath install 2>&1 | Out-Null swallowed output + surfaced misleading NativeCommandError). Agent CLI ANSI log garbage fixed (parse before logging init, file-only for one-shot CLI).
  • Server auto-deployed by webhook as v0.3.68 (commit 7376340). Migrations 058+059 applied. All durability code live server-side.
  • Build pipeline dubious-ownership fixed: git config --system --add safe.directory /home/guru/gururmm (/etc/gitconfig; webhook runs builds as root on guru-owned repo).
  • Canary: clients Goldstein + Peaceful Spirit set to update_channel='beta'; all 6 canary agents updated 0.6.61->0.6.62->0.6.63 within ~24 min. PST-SERVER (UDR Ultra, the original affected agent) verified: hostname; whoami returned exit 0, acked_at stamped, ack_latency 16 ms, delivery_attempts 1. Full ACK durability chain proven live.
  • Fleet promotion: POST /api/updates/rollouts/0.6.63/promote {"os":"windows","arch":"amd64"} -> 6 files re-tagged stable. 168 agents on 0.6.63 in ~3 min; online count held at 182 throughout (no mass dropout).

2026-06-07 — UI Gap Batch + Enrollment Audit (GURU-BEAST-ROG session)

API changes:

  • get_agent handler now returns AgentWithDetails — includes client_id and client_name via JOIN query
  • New endpoints: GET /api/agents/:id/bsod-events, GET /api/agents/:id/version-history
  • New endpoints: GET /api/install-reports, GET /api/install-reports/:id
  • New endpoint: GET /api/discovery/all-devices
  • DELETE /api/agents/:id/key redesigned as dual-revoke: nulls agents.api_key_hash (legacy Mode 1/2) AND sets enrolled_agents.revoked = TRUE (modern agk_ Mode 3)
  • New endpoint: GET /api/sites/:id/enrolled-agents — full enrollment audit with duplicate-hostname annotation
  • New endpoint: POST /api/enrolled-agents/:id/revoke — per-enrollment targeted revocation with cascade

Dashboard additions:

  • AgentDetail: Crashes tab (BSOD events), version history table in Updates tab
  • Dashboard.tsx: fleet stats wired to /agents/stats endpoint (was computed client-side)
  • SiteDetail: Revoke Key button + Enrollment tab with full audit table, status badges, duplicate-hostname warnings, per-row revoke
  • New pages: InstallReports (/install-reports), Discovery (/discovery)
  • FunctionRail: nav links for Install Reports + Fleet Discovery

Hotfix (production 500s): get_agent_with_details_by_id SQL was missing a.role_override (migration 054 added it as non-optional). All GET /api/agents/:id calls returned 500 until hotfix commit 6faa382.

Key commits: 5854c63 (UI gap batch), 6faa382 (hotfix), 49a7109 (enrollment audit batch), 00129af (fix route wiring)


2026-06-07 — Credential Inheritance Deployment & Offboarding Spec

  • Server v0.3.45 deployed with credential inheritance (Global → Client → Site, is_inheritable flag, de-duplication by (credential_type, label), /effective endpoints).
  • Dashboard: clickable alert severity badges with client filtering; offline badge scopes to client-specific agents.
  • SPEC-028 offboarding wizard specification created (835 lines) — 6-step site + 5-step client offboarding modals, data export (temp tokens, 1h expiry), typed confirmation, audit logging, cascade deletion. Implementation pending.
  • FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section.

Capabilities / Feature Set

Synthesized from authoritative artifacts (API routes, agent modules, 59 migrations, roadmap, commit log) — not from session logs. See Compilation Notes.

Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to interrupted. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).

Monitoring & Telemetry

  • Core metrics per interval (policy-tunable): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win GetLastInputInfo, Linux xprintidle), public/WAN IP (cached, multi-service fallback). Cross-platform via sysinfo.
  • Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Linux via /sys/class/thermal; sysinfo::Components fallback. Windows LibreHardwareMonitor removed 2026-05-27 (CVE-2020-14979); Windows thermal collection pending a safe replacement strategy.
  • Process drill-down: top 10 by CPU + top 10 by memory in every metrics payload (cross-platform).
  • Network state: per-interface IPv4/IPv6, MAC, derived CIDR subnets — sent on change only.
  • Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01): agent/src/bsod.rs polls C:\Windows\Minidump for new .dmp files (5-min filetime poll), parses kernel dump header at fixed DUMP_HEADER64/DUMP_HEADER32 offsets (bugcheck code @0x38, 4 parameters, FILETIME @0xFA8 — the minidump crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references System event log (WER 1001 / Kernel-Power 41) for Report Id + faulting driver, deduplicates via C:\ProgramData\GuruRMM\bsod-seen.json watermark. Sends AgentMessage::BsodEvent. Server: migration 048 + server/src/db/bsod_events.rs + ws handler — always-Critical alert deduplicated by (agent_id, dump_sha256). Verified against a real 0x116 VIDEO_TDR_FAILURE (nvlddmkm.sys). Dashboard Crashes tab shipped 2026-06-07. Phase 2/3 deferred: BSOD in Alerts stream, fetch_bsod_dump on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).

Remote Execution

  • Command types: shell, powershell, python, raw script{interpreter}, and claude_task. Options: timeout_seconds, elevated.
  • Execution context (041_add_command_context): system (default — Session 0 / service SYSTEM) or user_session (runs in the active logged-on user's desktop session via WTS token impersonation: WTSQueryUserToken + CreateProcessAsUserW + per-user env block). Windows-only; requires an active session.
  • Commands individually cancellable (POST /commands/:id/cancel) and aborted on disconnect. Auditable history; status running/completed/failed/timeout/interrupted.
  • Script library (017_scripts): stored scripts dispatched to agents with args/env/timeout/run_as_user; per-run history.
  • Agent Comms Durability (Phase 1, shipped 2026-06-11 — agent v0.6.63 / server v0.3.68):
    • Problem: server->agent command pushes black-holed at NAT'd sites (UDR Ultra) when conntrack gaps tear down the inbound direction. Agent heartbeats work (agent->server refreshes conntrack); reaper falsely failed black-holed commands after timeout.
    • Fix: agent sends CommandAck { command_id } on RECEIPT (migration 058 acked_at). Agent deduplicates by command_id — re-reports cached result, ignores in-flight duplicate, never double-runs. Server reaper re-delivers an un-acked command past 60s ACK deadline (returns to pending) instead of failing it. Migration 059 adds delivery_attempts counter (cap 10) + partial index for capability gate. Pending commands re-offered on (re)connect AND on every heartbeat. Verified at PST-SERVER: acked_at stamped, ack_latency 16 ms, delivery_attempts 1.
    • Capability gate: reaper only re-delivers for agents that have ACK'd at least once (idx_commands_agent_acked partial index). Old agents keep legacy fail-on-timeout path; safe mid-rollout.
    • Code anchors: server/src/db/commands.rs (requeue_undelivered_commands, fail_timed_out_commands rewrite, DELIVERY_ACK_DEADLINE_SECS=60, MAX_DELIVERY_ATTEMPTS=10), server/src/ws/mod.rs (redispatch_pending_commands, CommandAck handler), agent/src/transport/websocket.rs + agent/src/commands/mod.rs (ACK send + recent-results cache, RECENT_CAP=64, MAX_CACHED_RESULT_BYTES=256 KB).
    • Phase 2 (live TTY — rides warm WS, seq/resume, cadence switch) and Phase 3 (AIMD keepalive, bulk->HTTPS, half-open eviction sweeper) are PLANNED, not shipped.

Inventory & Discovery

  • Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
  • VM / hypervisor / container detection (032/033): is_virtual_machine, hypervisor_type, vm_uuid, is_hypervisor + hosted VM UUIDs, is_container, is_unraid.
  • User/group inventory (037-040): local + domain + Azure AD accounts (enabled, pw-never-expires, last-logon, is_admin, AD email/UPN/dept), domain-join classification (none/ad/aad/hybrid), domain name, M365 tenant ID, domain-controller detection (is_dc), group membership. Policy-scheduled (default 24h).
  • Network discovery: server-dispatched scans — TCP probes over configurable ranges/ports, ARP MAC, reverse DNS, basic OS fingerprint; devices stream back and are persisted/editable.

Patch / Agent Update Management

  • Self-updater: server sends version + URL + SHA256; agent downloads, verifies checksum, atomically swaps binary, restarts, and auto-rolls back to a kept backup if it fails to reconnect (~180s window).
  • Auto-update gated on effective policy auto_update (channel + defer_hours). Force via POST /agents/:id/update. Update channels at agent/site/client (026).
  • Channel model: new builds tag beta by default; agents default to stable channel; promotion to stable is a DELIBERATE re-tag of the .channel sidecar. scanner.rs get_latest_version: stable agents get latest STABLE-tagged binary; beta gets absolute-latest. resolve_agent_channel = agent.or(site).or(client).or('stable').
  • Promotion API: POST /api/updates/rollouts/:version/promote body {"os","arch"} — re-tags all .channel files for the version, records update_rollouts row, force-rescans. Rollback: POST /api/updates/rollouts/:version/rollback.
  • Safe-rollout (046): update_rollouts/health-metrics/events tables + /updates/rollouts promote/rollback. Health-gated automation is written-but-unwired (Phase 2); promotion is manual via API.

Policy & Configuration Management

  • Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via ConfigUpdate. Effective policy queryable per scope.
  • Checks engine (018/019): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win sc.exe, Linux systemctl). Policy-attached check templates (024) with push-to-agent sync. On-demand run-checks.
  • Remote registry (Windows, winreg): agent supports enumerate/read/write (typed)/create/delete. HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but are not routed yet [verify].

Alerting & Watchdog

  • Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (022) with effective resolution; per-client email settings (020). Maintenance mode (021) to suppress alerting per scope.
  • Watchdog: separate supervising process (polls GuruRMMAgent every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS.
  • Role-aware offline alerting (shipped 2026-06-07): Scheduled offline-sweep evaluator (60s tokio interval; server/src/alerts/offline.rs). Server-only agent_offline alerts. Classifier: os_product_type IN {2,3} -> server; else os_name/os_version ~/server/i -> server; else manual role_override (migration 054). Site rule (>=50% + >=3) -> mass_offline_site; fleet rule (>=10) -> mass_offline_fleet. Dashboard: servers elevated individually; workstations collapsed. PUT /api/agents/:id/role-override. Known gap: os_product_type populated on only ~16/168 agents; os_name is the workhorse classifier.
  • Alert ignore/mute — perma-silence (server only, shipped 2026-06-07; dashboard pending): alert_mutes table (migration 055) + muted alert status. Keyed on dedup_key. Permanent until un-ignored; reason required (400 if missing). Gate at create_or_update_alert AND create_check_alert bypass. POST /api/alerts/:id/mute + /unmute. Dashboard Ignore/Muted/Un-ignore NOT yet built.

Credentials Management

  • Encrypted credentials vault (016): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate /reveal decrypt endpoint. Known HIGH item: /reveal ownership-scope check — [verify current state].
  • Credential inheritance (deployed 2026-06-07): Opt-in hierarchical cascade with is_inheritable flag (Global → Client → Site). De-duplication by (credential_type, label), most-specific scope wins. /effective endpoints return inherited + direct credentials with inherited_from indicator.

Audit Log (Migration 056)

  • Append-only audit_log table (migration 056): (id, occurred_at, user_id, action, target_type, target_id, detail JSONB). Actor preserved with SET NULL on user delete. Seeded by credential reveals (credential.reveal); generic shape for future sensitive operations. Indexes on occurred_at DESC, user_id, (target_type, target_id).

Systemic Log Feedback Intelligence (Migration 057)

  • Per-line provenance + deterministic fingerprint added to agent_logs table (nullable; new rows populate, old rows stay NULL): agent_version, signature_hash BIGINT, normalizer_version SMALLINT. Fingerprint computed server-side at ingest (server/src/fingerprint.rs).
  • Durable aggregate log_signatures table: one row per distinct normalized error signature (occurrence_count, affected_agents, sample_message, first/last_seen, normalizer_version). Long-retention (raw logs short-retention); alerting + dashboards can read this.
  • log_signature_versions table: per-signature x agent_version rollup — answers "which versions does signature X appear on?" (version regression correlation).

Backup Integration (MSP360 / MSPBackups)

  • Multi-provider config (034/035) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (044) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
  • Alert quality pass (2026-06-07): Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from alerting/compliance. MSP360 message JSON decoded via summarize_backup_error. create_or_update_alert refreshes title/message/severity on re-trigger. Fleet: false backup_failed 15 -> 2; survivors (AD1, LAB-Becky) are genuine.
  • backup_storage_low alert type REMOVED (2026-06-07): DataCopied/TotalData measures backup-dataset completeness, not destination capacity. Produced 5 fleet-wide false alerts. resolve_all_backup_storage_alerts clears stragglers. Genuine destination-capacity alerting deferred.
  • MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded): 8, 13.
  • MSP360 API scope (confirmed 2026-06-07): monitoring-only tier; management paths (plan delete, manual run, storage config) return 404; plan management remains MSP360-console tasks.
  • Key functions: derive_backup_status, error_is_benign, summarize_backup_error, resolve_orphaned_backup_alerts, resolve_all_backup_storage_alerts, evaluate_plan (BACKUP_STALE).

Remote Access (Tunnel)

  • Agent side substantially built (TunnelManager state machine; Open/Close/Data). Server side is a dead-code skeleton — not declared in api/mod.rs, no /tunnel routes, WS handler logs "not yet implemented." Not production-ready. Live TTY design will supersede the skeleton (Phase 2 of agent-comms-durability spec).

Identity / Multi-tenancy / Security

  • Auth: JWT (login/register/me); agents auth over WS via per-agent API key + hardware device_id.
  • Microsoft Entra ID SSO (OAuth2/OIDC + PKCE), gated on server config. Google planned per SPEC-008 but not implemented [verify].
  • Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin user impersonation (/auth/impersonate/:id). Backend + dashboard UI shipped 2026-05-31.
  • Enrollment & keys (012): per-agent key issuance on first run, site API keys (regenerable), site-specific MSI with SITEKEY injected at download, public install-report ingestion. Legacy PowerShell agent path for Server 2008 R2.
  • Logs: agent log upload (periodic + on-demand), per-agent events (042), fleet log view, AI-assisted log analysis (/logs/analyze) — AI-optional per locked decision.

Enrollment Detail (migration 012 + 2026-06-07 audit)

The enrolled_agents table is the authoritative enrollment history log:

Field Notes
site_id Site the agent enrolled under
agent_id Set on WebSocket connect
hostname Hostname at enrollment time
agent_key_hash Hash of the agk_ enrollment key (not surfaced in API)
enrolled_at Timestamp of initial enrollment
last_seen Updated on each WS connection
ip_address IP at last connection
os_version OS at enrollment time
revoked Boolean; set TRUE by revocation endpoints

WS auth modes: Mode 3 (modern, agk_ keys): checks enrolled_agents.agent_key_hash first. Mode 1/2 (legacy): falls back to agents.api_key_hash.

Revocation — dual-layer (fixed 2026-06-07): DELETE /api/agents/:id/key (bulk) revokes BOTH layers. POST /api/enrolled-agents/:id/revoke (targeted) per-enrollment with cascade.

Platform coverage

  • Cross-platform (Win/Linux/macOS): metrics, network state, hardware/software/service inventory, user/group + DC/domain detection, checks (service checks Win+Linux), discovery, self-update, scripts/commands.
  • Windows-only: user_session command context (WTS), remote registry, tray-into-session, watchdog SCM supervision, BSOD detection.
  • macOS: agent deployed (Phase 1); tray is a stub; automated Mac build pipeline is an intentional stub (no build host) [verify before claiming CI Mac releases].

Architecture

Components

Component Location Tech State
Server 172.16.3.30:3001, systemd gururmm-server.service (User=root, binary /opt/gururmm/gururmm-server, EnvironmentFile /opt/gururmm/.env) Rust, Axum deployed, production (physical box, Ubuntu 26.04, PG 18; migrated 2026-06-11)
Dashboard https://rmm.azcomputerguru.com, nginx at /var/www/gururmm/dashboard/ React + TypeScript + Vite, shadcn/ui, Tailwind CSS v4 deployed, production
Agent (Windows) Endpoints, installed as GuruRMMAgent Windows service via WiX MSI Rust, Windows MSVC deployed; stable fleet on 0.6.63
Agent (Linux) Endpoints, systemd gururmm-agent, binary /usr/local/bin/gururmm-agent Rust, musl static deployed
Agent (macOS) Endpoints, LaunchDaemon com.azcomputerguru.gururmm-agent.plist Rust, aarch64/x86_64 Phase 1 deployed 2026-05-12; code signing issue on Apple Silicon
Tray (Windows) System tray, named pipe IPC Rust deployed
Tray (Linux) System tray, Unix socket IPC, libappindicator/GTK Rust, GTK deployed 2026-05-24 (PR #13+#14)
Tray (macOS) Menu bar Rust stub/TODO (issue #18)
PostgreSQL DB localhost:5432 on 172.16.3.30, database gururmm PostgreSQL 18 deployed (migrated from PG VM, backfilled 2026-06-11)
Coord API 172.16.3.30:8001/api/coord FastAPI (part of ClaudeTools API) deployed
Build pipeline 172.16.3.30:9000 webhook + /opt/gururmm/ scripts Python (webhook-handler.py), Bash deployed; builds agents AND server (server auto-deploy wired 2026-06-02)
Pluto (Windows build VM) 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid) Rust MSVC, WiX v4 operational; Xeon E5-2695 v3 8c/16t, sequential build ~23min (parallel prototype 3.51x speedup, deferred integration)
Old VM (rollback anchor) 172.16.3.46, powered on, data pristine Ubuntu (original) parked; do NOT delete until soak completes

Key Files & Repos

  • Active repo: azcomputerguru/gururmmhttp://172.16.3.20:3000/azcomputerguru/gururmm
  • Submodule working tree: D:\claudetools\projects\msp-tools\guru-rmm — tracks active repo; develop here and push to Gitea (push from 172.16.3.30 or 172.16.3.47 — GURU-5070 is not authorized)
  • Server binary: /opt/gururmm/gururmm-server on 172.16.3.30 (physical box; prior VM had /usr/local/bin/gururmm-server)
  • Server config: /opt/gururmm/.env (EnvironmentFile for systemd; DATABASE_URL here; root-only read)
  • Source checkout on server: /home/guru/gururmm (build-server.sh: git reset --hard origin/main -> cargo build -> deploy with backup+rollback)
  • Agent binary (Linux): /usr/local/bin/gururmm-agent
  • Agent config (Linux/macOS): /etc/gururmm/agent.toml (root, mode 600); macOS uses /usr/local/etc/gururmm/site.plist
  • Agent registry (Windows): HKLM\SOFTWARE\GuruRMM\SiteId (written by MSI); HKLM\SOFTWARE\GuruRMM\DeviceId (written by agent, added Phase 1 Task 1)
  • Windows service name: GuruRMMAgent (NOT gururmm-agent)
  • Downloads dir: /var/www/gururmm/downloads/ on 172.16.3.30 — base binaries gururmm-agent-<plat>-<ver>.exe + .channel + .sha256; -latest symlinks; per-site signed cache gururmm-agent-site-<site_id>-<plat>-<ver>.exe (built+jsign-signed on demand)
  • Webhook handler: /opt/gururmm/webhook-handler.py (port 9000 localhost; nginx :80 /webhook/ -> :9000; systemd gururmm-webhook)
  • Build scripts: /opt/gururmm/build-shared.sh, build-linux.sh, build-windows.sh, build-mac.sh (split 2026-05-24; build-agents.sh is compat wrapper)
  • Server build script: /opt/gururmm/build-server.sh (dispatched by webhook on server/ changes; change-gate last-built-commit-server; backup current binary; auto-rollback if fails is-active)
  • Per-platform SHA tracking: /opt/gururmm/last-built-commit-{linux,windows,mac,server}
  • Build log (Linux): /var/log/gururmm-build-linux.log
  • Build log (Windows): /var/log/gururmm-build-windows.log
  • Build log (Server): /var/log/gururmm-build-server.log
  • API (internal): http://172.16.3.30:3001
  • API (external): https://rmm-api.azcomputerguru.com (Cloudflare -> NPM on Jupiter .20 -> .30:80)
  • Dashboard: https://rmm.azcomputerguru.com
  • DB URL: postgres://gururmm:<pw>@localhost:5432/gururmm (pw in /opt/gururmm/.env; do not hardcode here)
  • Vault paths: infrastructure/gururmm-server.sops.yaml (API creds), infrastructure/gururmm-server-physical (SSH ed25519 key, sudo pw)
  • SSH to prod: ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30 (ed25519, key-only; sudo pw in vault infrastructure/gururmm-server-physical)

Repo Structure

gururmm/
├── agent/          Rust agent (managed endpoints)
│   └── src/
│       ├── bsod.rs         Windows BSOD/kernel-crash detection
│       ├── commands/mod.rs CommandExecutor + recent-results cache (ACK dedup)
│       ├── device_id.rs    Durable multi-location device_id (Phase 1 Task 1)
│       ├── ipc.rs          Unix socket IPC (Linux); Windows named pipe
│       ├── transport/
│       │   ├── mod.rs      AgentMessage enum (incl. CommandAck)
│       │   └── websocket.rs WS transport; execute_command ACK + dedup
│       ├── tunnel/         TunnelManager state machine
│       ├── metrics/        sysinfo collection + temps
│       ├── registry_ops/   Windows registry read/write
│       ├── updater/        Self-update handler
│       └── main.rs         systemd unit template generation
├── server/         Rust/Axum API server
│   └── src/
│       ├── api/            REST handlers (~37 route modules; updates.rs: promote/rollback)
│       ├── alerts/         Alerting modules (offline.rs: offline_sweep + mass_offline)
│       ├── db/             Database layer (sqlx); commands.rs (requeue_undelivered, fail_timed_out rewrite)
│       ├── fingerprint.rs  Log signature normalization + hash
│       ├── ws/             WebSocket handler (CommandAck, redispatch_pending_commands)
│       └── mspbackups/     MSP360 backup integration
├── dashboard/      React/TypeScript UI
├── tray/           System tray binary
├── installer/      WiX v4 MSI (gururmm-agent.wxs); cleanup.ps1 (now preserves device_id)
├── deploy/
│   └── build-pipeline/ webhook-handler.py, build-*.sh, build-server.sh
├── scripts/        Build/ops scripts
└── docs/           FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md,
                    DESIGN.md, specs/, HOST_MIGRATION_RUNBOOK.md

Development

Current Focus

As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68):

  • Agent-comms-durability Phase 1 SHIPPED. Remaining Phase 1 items: (a) is_connected reads null fleet-wide — cosmetic dashboard bug; gap is in API DTO/dashboard layer, not ws/mod.rs; (b) keepalive AIMD shortening (Task 4; deferred as optimization — durability nets are the actual fix).
  • Phase 2 (live TTY, planned): Extend WS message enum with TTY stream frames (stdin/stdout/stderr, resize, seq). "Activate" cadence switch (agent goes HOT on next heartbeat after tech opens live window). Resume from seq on mid-session drop. Single-use time-bounded session token; max one interactive session per agent; full audit. Estimated 3-5 days separate effort.
  • Phase 3 (planned): Adaptive keepalive (AIMD, persist interval), bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper (last_inbound track + evict >~90s; treat missed Pong as close).
  • Durable agent identity Phase 1 Tasks 2-3 (pending): Task 2 = hardware-fingerprint capture (agent inventory.rs baseboard serial + primary MAC; server migration agents.hardware_fingerprint + hardware_signals JSONB + legacy_device_id). Task 3 = dashboard "probable duplicate" surfacing (read-only). Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for the 9 existing ghosts) pending soak.
  • Confirm Windows v0.6.62 build compiles + deploys (validates new registry cfg(windows) code from durable-identity Phase 1 Task 1). If it failed, fix and re-push.
  • Host migration soak: drop .47 (new box) + .46 (VM) management IPs after stability soak; decommission old VM at .46. GURU-5070 Ethernet repair.
  • SPEC-028 offboarding wizard (specification complete, 835 lines; implementation pending): Site + client offboarding workflows, data export, typed confirmation, audit logging.
  • BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open): Fix #3 (graceful Global\GuruRMM_TrayShutdown_{sid} event) is implemented but dormant — terminate_all has no caller. Coord todo 25fdf31a: wire into watchdog policy-disable/uninstall path.
  • BSOD detection Phase 2/3 (deferred): Dashboard Crashes tab shipped 2026-06-07. Remaining: BSOD in Alerts stream (issue #10), fetch_bsod_dump on-demand upload, full ~350-entry bugcheck name table.
  • Linux fleet unit drift: Auto-updater replaces binary but does NOT refresh systemd unit file. Pre-BUG-016-fix agents have new binary + old unit (missing StateDirectory=gururmm). Needs ops-script pass.
  • Watchdog alerts UI — backend complete but PUT /watchdog-alerts/:id/resolve and DELETE /watchdog-alerts/:id routes missing on server.
  • Alert mute dashboard (Task 5) -- NOT started: Ignore button + required-reason prompt; Muted filter; Un-ignore control. Server endpoints ready.
  • MSP360 backup Phase 2 (not started): Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint, monitoring-only API tier confirmed).
  • Parallel Windows build integration (DEFERRED, design captured in runbook): 7-job Start-Job parallel build achieves 3.51x speedup on Pluto. Blocked on Windows orchestration friction (non-interactive PS, git host-key trust). Run via new-box setsid when integrated.
  • Tray IPC + peer authorization — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
  • Security audit backlog: credentials/:id/reveal horizontal privilege escalation (HIGH), internal_err() raw DB errors at ~130 call sites (HIGH). update_rollouts.promoted_by UUID vs users.id int mismatch (schema quirk from migration 046).
  • /backup-status endpoint shape gap: Returns only one plan per agent; agents with a dead old plan + healthy current plan look stale-but-green. Not fixed — noted for future.

Patterns & Anti-Patterns

Anti-patterns — never repeat:

Pattern What Went Wrong
useMemo with stable deps for data-dependent values queryClient is stable, memo never recomputes after queries resolve. Use useQuery instead.
CSS variable text colors inside the sidebar Sidebar bg is hardcoded dark; CSS vars flip in light mode. Use text-white explicitly inside sidebar.
Deploying without stopping the server first "text file busy" kernel error. Always systemctl stop before cp.
Building without DATABASE_URL sqlx compile-time macros fail. DATABASE_URL is in /opt/gururmm/.env (or /home/guru/.cargo/env on dev).
DB migrations without inserting into _sqlx_migrations Server crashes on start. Must insert SHA-384 checksum manually.
WiX MSI builds on Linux WiX requires msi.dll. MSI must be built on Pluto (Windows).
Manual builds via SSH All builds go through webhook-handler.py. Never SSH and run cargo build + artifact copy manually.
TOML/config for agent endpoint or site_id Server URL compiled into binary, site_id baked into MSI. No runtime config files for these values.
path.find('\\') in #[cfg(windows)] files Compiles on Linux silently, fails on Pluto MSVC with unterminated char literal. Use '\\\\'.
STATUS_BADGE_CLASSES Record const Vite/Rollup may optimize away the lookup. Use explicit getStatusBadgeClass() if/else function.
SSH heredoc for TypeScript edits Shell strips double-quote characters. Edit locally in submodule, push to Gitea, pull on server.
Restart-Service GuruRMMAgent -Force in command scripts Kills agent before it can report result. Commands stay forever running. Use scheduled task with delay instead.
sudo -u guru git in systemd build context git rejects repo as dubious ownership when running as root on guru-owned repo. Use git config --system --add safe.directory instead (applies to all users/envs — fixed 2026-06-11).
Self-updating running bash script bash reads line-by-line from disk; replacing mid-execution silently skips remaining blocks.
+1.77 legacy builds without --ignore-rust-version Fail MSRV check after adding rust-version to Cargo.toml. Add --ignore-rust-version to legacy build lines only.
StrictHostKeyChecking=no for Pluto SSH Replaced with pinned known-hosts at /opt/gururmm/pluto_known_hosts. MITM would compromise build artifacts.
CRLF line endings in migration files sqlx SHA-384 checksum mismatch causes server crash on start. .gitattributes + core.autocrlf=false + pre-commit hook prevents this.
Dead WebSocket write half WS write fails, send task dies, receive loop keeps agent in ConnectedAgents with dead write half. Commands silently fail. Fix: tokio::select! monitoring both tasks.
Using the minidump crate for Windows kernel dumps The crate only parses Breakpad MDMP format, not Windows kernel PAGEDU64 dumps. Parse DUMP_HEADER64/DUMP_HEADER32 at fixed offsets directly (validated against real dumps).
Build classification defaulting to stable New agent builds should default to beta; promotion to stable is an explicit re-tag of the <binary>.channel sidecar. Defaulting stable races the auto-update fleet ahead before any beta soak.
Webhook dispatching only agent builds The webhook historically triggered agent builds but never build-server.sh; server changes silently went unbuilt until manual intervention. Fixed 2026-06-02 — server build is dispatched alongside agent builds, gated by last-built-commit-server.
Auto-update-on-connect + default-stable tagging racing the fleet If a build is tagged stable (even briefly by mistake), agents on the stable channel auto-update immediately. Once fleet has updated, rolling back requires a new build. Default beta, soak, promote explicitly.
Channel pin at agent-level for a beta canary Agent-level update_channel is keyed to agent_id and lost on re-enrollment. Use site-level or client-level override (survives re-enroll).
Installer using & $stagingPath install 2>&1 | Out-Null under $ErrorActionPreference='Stop' Swallows output; surfaces a misleading NativeCommandError on any non-zero exit. Use Start-Process -Wait -PassThru + check $proc.ExitCode. Fixed 2026-06-11 (commit 5c0d004).
Reaper flipping un-acked commands to failed Causes false-failure for commands black-holed in NAT conntrack gaps. The reaper must only fail commands that were ACK'd but overran their real execution timeout. Un-acked commands must be requeued to pending.

Good patterns:

  • Platform parity rule — any agent feature goes on Windows + Linux + macOS in the same commit. If a real implementation isn't feasible, add a working stub + // TODO(platform): <os> — <reason>. No silent no-ops.
  • Per-platform last-built-commit tracking — Linux builds succeed and record progress independently of Windows builds.
  • Holistic feature development — every feature ships backend + API + dashboard UI + docs together. Backend-only features are rejected.
  • sqlx offline mode — compile-time query validation requires DB reachable or offline cache present.
  • RuntimeDirectory=gururmm in systemd unit — systemd-native way to give agent writable /run/gururmm/ for IPC socket.
  • Registry-first path resolution — read HKLM:\SOFTWARE\GuruRMM for install dir, fall back to service PathName, then hardcoded default.
  • interrupt_running_commands() at reconnect — flips all status='running' commands for reconnecting agent to status='interrupted'.
  • Build change-gate + backup/rollback in build-server.sh — skips rebuild when server/ is unchanged; backs up previous binary; restores it if the new binary fails is-active.
  • Server's own root RMM agent for privileged ops — the server (172.16.3.30) runs the GuruRMM Linux agent as root (hostname gururmm); it can re-tag .channel sidecars and trigger build-server.sh without SSH or sshpass.
  • GURU-5070 site as permanent beta-channel canary — site "Mike's Car" has update_channel='beta' (survives agent re-enrollment). Gets every new beta build; stable fleet protected by explicit update_rollouts pin. (Channel pin on site, not agent — agent-level overrides are lost on re-enroll.)
  • Capability gate via partial index, not version-string compare — CI auto-bumps versions ([ci-version-bump]); pinning behavior to a fixed version string is fragile. Use a DB-observable capability marker (e.g., acked_at IS NOT NULL) that is set when the agent actually demonstrates the capability.
  • Two independently-deployable slices for cross-component features — agent slice and server slice can be deployed in any order; the server slice uses a capability gate so old agents keep legacy behavior.

Build & Deploy

CRITICAL: Never trigger builds manually via SSH. All builds go through the webhook pipeline.

Gitea push to main
  -> webhook-handler.py (172.16.3.30:9000 via nginx :80 /webhook/, parallel threads per platform)
    -> build-shared.sh   (auto-version bump, git sync — runs once)
    -> build-linux.sh    (cargo build on .30; log: /var/log/gururmm-build-linux.log)
    -> build-windows.sh  (SSH -> Pluto 172.16.3.36 via pinned known-hosts
                          cargo build --release x64 MSVC + i686 MSVC
                          +1.77 legacy builds with --ignore-rust-version
                          WiX MSI build for site-specific base
                          sign-windows.sh (jsign + Azure Trusted Signing, /etc/gururmm-signing.env)
                          SCP artifacts back; log: /var/log/gururmm-build-windows.log)
    -> build-mac.sh      (stub — no build machine configured yet)
    -> build-server.sh   (gated: skips if no server/ changes since last-built-commit-server;
                          backs up current binary; builds + deploys; auto-rollback if fails is-active;
                          log: /var/log/gururmm-build-server.log)
  -> artifacts -> /var/www/gururmm/downloads/ with sha256 + -latest symlinks + .channel (beta)
  -> per-platform last-built-commit files updated
  -> systemctl restart gururmm-agent (local agent on .30)

Auto-version: build-shared.sh diffs agent/, server/, dashboard/ against last built SHA. For each changed component, bumps patch version in Cargo.toml or package.json, commits [ci-version-bump], pushes. Webhook skips builds where all commits are version bumps.

Build channel classification: New agent builds are tagged beta by default. Promotion to stable is an explicit step via the API: POST /api/updates/rollouts/:version/promote {"os":"windows","arch":"amd64"} — re-tags all .channel files for that version, records update_rollouts row, triggers scanner rescan. Rollback: POST /api/updates/rollouts/:version/rollback. Agents on the stable channel only receive the latest stable-tagged binary.

Downloads layout:

  • Base binaries: gururmm-agent-<plat>-<ver>.exe + .channel (beta/stable) + .sha256
  • Latest symlinks: gururmm-agent-<plat>-latest.exe + .channel + .sha256
  • Per-site signed cache: gururmm-agent-site-<site_id>-<plat>-<ver>.exe (built+jsign-signed on demand at install-request time)

Dashboard channels — BETA-FIRST:

Channel URL Web root Updates
beta https://rmm-beta.azcomputerguru.com /var/www/gururmm/dashboard-beta auto on push — build-dashboard.sh (change-gated on last-built-commit-dashboard)
prod https://rmm.azcomputerguru.com /var/www/gururmm/dashboard explicit only — sudo /opt/gururmm/promote-dashboard.sh --confirm (backs up prod; --rollback restores)

Do NOT hand-rsync into the prod web root. One artifact serves both channels — the Vite build bakes in the absolute prod API URL; beta uses an nginx-layer sub_filter BETA banner.

DB migrations — manual; must insert SHA-384 checksum into _sqlx_migrations or server crashes on start. Migrations run automatically on server restart (sqlx), but the checksum must be correct.

Pluto (172.16.3.36):

  • Windows Server 2019 VM on Jupiter (Unraid)
  • SSH: ssh -o UserKnownHostsFile=/opt/gururmm/pluto_known_hosts Administrator@172.16.3.36
  • Rust stable 1.95.0 + 1.77 pinned for legacy builds
  • VS Build Tools (MSVC), sccache at C:\sccache, WiX v4, Gitea clone at C:\gururmm\
  • Xeon E5-2695 v3 @2.3GHz, 8c/16t KVM VM; sequential build ~23min; parallel prototype 3.51x (deferred)

Auto-update delivery:

  • Server scans every 300s; dispatches update command on agent heartbeat
  • Gated on effective policy auto_update (default on when policy is null)
  • Agent: downloads to PrivateTmp, verifies SHA-256, replaces binary, restarts service
  • Force-trigger: POST /api/agents/:id/update
  • As of agent-comms-durability Phase 1: update dispatches piggyback on heartbeat re-offer path (same redispatch_pending_commands), so NAT'd agents receive updates reliably

Active State

Fleet (as of 2026-06-11, post-0.6.63 rollout):

  • ~215 enrolled agents total (growing)
  • Stable channel: 0.6.63 windows/amd64 (promoted 2026-06-11); 168 agents on 0.6.63 during rollout, 182 online throughout (no mass dropout). ~20 stragglers on older versions auto-update on reconnect.
  • Beta channel: site "Mike's Car" (103c10b9-c1de-4dd8-b382-b8362ed3143e) has update_channel='beta' (persists across re-enrollment). All GURU-5070 machines are on this site.
  • GURU-5070 live agent: 819df0c8; ghost (orphaned prior enrollment): c043d9ac (offline, durable-identity fix in progress).

Enrolled clients/sites (2026-05-24 baseline; no removals since):

Client Type Sites Notable agents
AZ Computer Guru (internal) Internal DF Server Storage, Howard-VM, Main Office, Mike's Car, Mikes House Jupiter, PLUTO, gururmm, GURU-KALI, GURU-5070, Mikes-MacBook-Air.local, ACG-DC16, NEPTUNE, ix.azcomputerguru.com
BirthBiologic Corporate Main Office BB-SERVER
Cascades of Tucson Corporate CascadesTucson 27 agents — CS-SERVER, RECEPTIONIST-PC, ASSISTMAN-PC, MDIRECTOR-PC, MEMRECEPT-PC, and ~22 others
Dataforth Corp Corporate D1 AD2, DF-GAGETRAK
Goldstein (verify) (verify) Canary client for 0.6.62/0.6.63 beta soak
Grabb & Durando Law Office Corporate Main Office GND-SERVER
Instrumental Music Center Corporate IMCMain IMC1
Key, Paul Residential Home KEY-MEDIA
Peaceful Spirit Residential Bridgette Home, Country Club, Mara Home BridgettePSHomeComputer, PST-SERVER (UDR Ultra, comms-durability verification), PST-SERVER2, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew
Safesite Corporate Glendale DESKTOP-3USU20B (comms-durability affected)
Sombra Residential LLC Corporate main office DESKTOP-UQRN4K3, Server2013
Stamback Septic Corporate StambackSeptic DESKTOP-BTR2AM3, StambackLaptopNew
Swanson, Len Residential Home LAS-GAMER

API auth:

  • POST /api/auth/login -> JWT (~24h)
  • Creds: vault infrastructure/gururmm-server.sops.yaml -> credentials.gururmm-api.admin-email / admin-password; or via bash .claude/scripts/rmm-auth.sh
  • Key endpoints: GET /api/agents, POST /api/agents/:id/command, GET /api/commands/:id, POST /api/agents/:id/update, POST /api/updates/rollouts/:version/promote
  • Command fields: command_type (shell/powershell/python/script/claude_task), command (script text, JSON-encoded), optional contextsystem (default) or user_session (Windows WTS), plus timeout_seconds/elevated.

Dashboard — complete and working: Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO, Policies Dashboard (all tabs), Registry editor (read-only via HTTP), MSP360 backup status + mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance, AgentDetail Crashes tab + version history, fleet stats from /agents/stats, SiteDetail Revoke Key + Enrollment audit tab, Install Reports page, Fleet Discovery page.

Dashboard — incomplete (see UI_GAPS.md):

  • Watchdog alerts UI — blocked on 2 missing server routes
  • BSOD in Alerts stream (Phase 2 deferred)
  • Tunnel session management (server skeleton "not yet implemented"; Phase 2 live TTY design will supersede)
  • Offboarding wizard UI (SPEC-028 complete; implementation pending)
  • Alert mute UI — Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard NOT started)
  • is_connected reflects real connectivity (null fleet-wide cosmetic bug; gap in API/dashboard layer)
  • Documentation / user guide / inline help tooltips (P3, not started)

Open Gitea issues:

  • #10 — BSOD detection Phase 2/3 (dashboard Alerts stream + fetch_bsod_dump + full bugcheck table)
  • #15 — Pipeline tray build (publish tray binary to downloads)
  • #16 — Windows IPC peer authz
  • #17 — logind console user resolution
  • #18 — macOS tray
  • #19 — subscriber broadcast

BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):

  • Root cause: TrayLauncher tracked launches only in an in-memory HashMap<sid,HANDLE> that resets on watchdog restart; no single-instance guard in the tray; terminate_all hard-killed via TerminateProcess skipping NIM_DELETE (ghost).
  • Fix (commit 137dd85): (1) per-session Local\GuruRMM_Tray single-instance mutex; (2) launcher reconciliation via WTSEnumerateProcessesW; (3) graceful Global\GuruRMM_TrayShutdown_{sid} event -> 3s wait -> TerminateProcess fallback. Fix #3 is dormant pending terminate_all wiring (coord todo 25fdf31a).

Security backlog (HIGH):

  • credentials/:id/reveal — horizontal privilege escalation (no ownership scope check)
  • internal_err() — ~130 call sites returning raw DB errors to callers

Key Architecture Decisions (LOCKED)

These decisions are locked. Do not reverse without explicit user approval.

  1. Per-agent enrollment keys — MSI contains server URL + site_id only. Agent calls POST /api/enroll on first run; server issues unique per-agent key stored hashed. Enables revocation, clone detection, audit trail.
  2. Site-specific MSI generation — Universal base MSI from CI; dashboard endpoint generates site-specific MSI with site_id baked in via WiX property -> HKLM\SOFTWARE\GuruRMM\SiteId.
  3. No TOML/config for endpoints — Server URL compiled into binary. No runtime config files for server URL or site_id.
  4. Policy inheritance chain — global -> site -> client -> agent. Server computes merged effective policy and pushes via ConfigUpdate WebSocket message.
  5. Platform parity rule — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible.
  6. Watchdog as separate process — Main agent cannot reliably restart itself after a crash.
  7. Build pipeline is the only path to production — Enforces signing, checksum generation, consistent artifact layout.
  8. Multi-tenancy identity model (ADR-001) — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1.
  9. Holistic feature development (DESIGN.md) — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
  10. AI-optional operation — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
  11. Durability-first command delivery — The durable Postgres commands row is the source of truth; the WebSocket is only a delivery hint. An un-acked command is re-deliverable, never reaped. Idempotent end-to-end by command_id so re-delivery never double-executes. (Established 2026-06-11, agent-comms-durability spec.)

History Highlights

Date Event
2025-12-15 Project genesis: Windows service + Linux installer + site code auth + build server. DB migrated from Jupiter Docker to local PostgreSQL.
2026-04-19 Full drill-down navigation, auto-install on first run, Pluto build VM setup started.
2026-04-21 MSI build fix (missing WiX extension flag). DESIGN.md created (holistic development mandate). BirthBiologic onboarded.
2026-04-29 UI_GAPS.md created. Holistic development principle formalized.
2026-05-12 macOS agent Phase 1 deployed from Mikes-MacBook-Air. Code signing issue on Apple Silicon noted.
2026-05-15 Dead WebSocket write-half bug fixed. Temperature struct field name mismatch fixed.
2026-05-16 Watchdog bugs fixed (sc.exe fallback, suppress_until, hypervisor detection). /feature-request skill created.
2026-05-17 Syncro PSA Integration added to roadmap (P1) after Howard /feature-request. Office power failure recovery — all VMs recovered.
2026-05-18 Multi-tenancy architecture (ADR-001) decided. 5 SPEC documents created (SPEC-001 through SPEC-006).
2026-05-19 4-bug fix for AD2 crash loop. MSP360 backup integration completed (6 fixes). Clickable CPU/Memory gauge cards + process drill-down modal.
2026-05-23 /rmm-audit pass. Agent optimization Phases 1A-3. Auto-version bump mechanism. MSRV bumped to 1.85. Fleet at 0.6.29.
2026-05-24 Linux tray IPC + GTK (PR #13+#14) and peer-cred authz (PR #14) merged. PR #21 (ReadWritePaths fix) merged. Build pipeline split into per-platform scripts. Pluto known-hosts pinned. Fleet converged to 0.6.38.
2026-05-31 Roadmap reconciliation (17 corrections — roadmap understated built state). MSPBackups mapping/verify UI + dev-admin impersonation UI deployed (dashboard v0.2.32). BUG-008/013/014 status corrected to fixed. SPEC-021 (logged-in user domain detection) written after Howard feature request.
2026-06-01 BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with 0x116 VIDEO_TDR_FAILURE (nvlddmkm.sys); GuruConnect cleared on three grounds. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1/SF-2; merged to main (0ec55cf), agent versioned 0.6.51.
2026-06-02 Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh. Webhook wired to dispatch build-server.sh with change-gate + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed.
2026-06-04 Channel state confirmed via live Postgres query: GURU-5070 agents.update_channel = 'beta' (explicit per-agent override). Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux. BUG-020 (duplicate/ghost tray icons) fixed in commit 137dd85 to beta: per-session single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event (fix #3 dormant — coord todo 25fdf31a). Verified by Grok + Code Review Agent.
2026-06-07 Backup-alert quality pass shipped. FU1 (summarize_backup_error + create_or_update_alert refresh) + FU2 (exclude PlanTypes 8/13 from alerting/compliance): false backup_failed 15 -> 2 fleet-wide (commits 779f7f6 + b82c010). backup_storage_low alert type removed entirely (DataCopied/TotalData is not destination capacity; 5 -> 0 false alerts; resolve_all_backup_storage_alerts).
2026-06-07 Credential inheritance deployed to production (server v0.3.45). Hierarchical propagation (Global → Client → Site), is_inheritable, /effective endpoints. Dashboard: clickable alert severity badges + client filtering. SPEC-028 offboarding wizard specification created (835 lines). FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management."
2026-06-07 Role-aware offline alerting + alert ignore/mute shipped. Offline sweep evaluator (60s interval; offline.rs): server-only agent_offline; migration 054 agents.role_override; site (>=50%+>=3) -> mass_offline_site; fleet (>=10) -> mass_offline_fleet. Warm-up restart guard (code review caught + fixed last_seen < started_at spec defect). Alert ignore/mute (perma-silence): migration 055 alert_mutes + muted status; dedup_key-keyed; reason required; gates create_or_update_alert + create_check_alert bypass; dashboard UI pending.
2026-06-07 UI gap batch + enrollment audit (GURU-BEAST-ROG session). get_agent upgraded to AgentWithDetails. Six new/updated server endpoints. Enrollment audit endpoint + per-row revoke. Dashboard: AgentDetail Crashes tab + version history, SiteDetail Enrollment tab, InstallReports + Discovery pages. Hotfix 6faa382 (role_override missing from new SQL — all GET /api/agents/:id returned 500).
2026-06-11 Physical server migration executed. Ubuntu VM at 172.16.3.30 retired; physical box took the same IP. Production down <27 min (06:53-07:20 MST). 162/212 agents reconnected within 15 min. Old VM parked at .46 as rollback anchor. Server binary path changed from /usr/local/bin/gururmm-server to /opt/gururmm/gururmm-server. Ubuntu 26.04 + PostgreSQL 18. 7-day whale data backfilled (~3.4 GB, lossless, 0 pool-timeouts). Three build-env gaps fixed on new box (sccache, Pluto key, signing tools).
2026-06-11 Ghost agent root-caused. device_id storage fragile (single file, wiped by cleanup.ps1). 11 duplicate-hostname agents found (9 same-site ghosts; 2 cross-site non-merge). Durable agent identity spec created (specs/durable-agent-identity/; multi-AI reviewed). Phase 1 Task 1 shipped (agent v0.6.62): multi-location device_id (registry + ProgramData on Win; /etc + /var/lib on Linux; /Library + mirror on macOS) with self-heal; cleanup.ps1 whitelisted. Channel pin regression fixed: moved GURU-5070 beta override from agent-level to site "Mike's Car" (survives re-enrollment).
2026-06-11 Agent-comms-durability Phase 1 shipped (agent v0.6.63 / server v0.3.68). Spec created (specs/agent-comms-durability/; multi-AI reviewed, Gemini + Grok). Root cause: NAT conntrack asymmetry — server->agent pushes black-holed in idle conntrack gaps. Fix: agent CommandAck on receipt (migration 058 acked_at); agent dedup cache (re-report, never double-run); server reaper re-delivery (migration 059 delivery_attempts, cap 10; capability gate via partial index); pending commands re-offered on every heartbeat. Installer hardened (Start-Process). Build-pipeline dubious-ownership fixed. Canary (Goldstein + Peaceful Spirit): 6 agents verified, PST-SERVER ACK latency 16 ms. Fleet promoted 0.6.63 stable; 168 agents converged in ~3 min, 182 online throughout.

Compilation Notes

  • 2026-05-26 recompile: Added the Capabilities / Feature Set section, synthesized from authoritative artifacts at live main (cd27a59). Corrected: command execution contexts, temperature monitoring (BUG-001 is resolved, not pending), Entra-only SSO, added user-inventory/discovery/VM-detection/safe-rollout surfaces. Changelogs are an unreliable capability source — stop at agent v0.6.22; migrations + commit log are authoritative.
  • Tunnel subsystem (verified against live main): agent side substantially built; server side is a dead-code skeleton (not declared in api/mod.rs, no routes, WS handler logs "not yet implemented"). Confirmed, not unverified.
  • macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). build-mac.sh is a stub as of 2026-05-24. [unverified whether automated pipeline includes macOS]
  • Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23). [unverified — may still be unfixed on new box]
  • Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24. [now superseded by comms-durability Phase 1 — heartbeat re-offer path should resolve this]
  • 2026-06-02 recompile: Folded in BSOD detection, server build webhook wiring, build channel default beta, versions updated. Migration count updated 46 -> 48.
  • 2026-06-04 recompile: Corrected GURU-5070 channel state. Stable fleet pinned at 0.6.47. BUG-020 documented.
  • 2026-06-07 recompile: Folded in backup-alert quality pass, credential inheritance, offline alerting + mute, UI gap batch + enrollment audit. Updated migration count to 55+ (054/055 confirmed).
  • 2026-06-11 recompile (GURU-5070/claude-main): Full recompile. Added: (1) Physical server migration (Ubuntu 26.04, PG 18, binary at /opt/gururmm, old VM .46 rollback anchor). (2) Durable agent identity (ghost root cause, spec, Phase 1 Task 1 durable device_id, channel pin fix). (3) Agent comms durability Phase 1 full detail (spec, slices A/B/C, deployment, fleet rollout, canary verification). (4) New capabilities: Audit Log (migration 056), Systemic Log Feedback Intelligence (migration 057), comms durability architecture. (5) Updated fleet size (~215 enrolled, 168-182 online). (6) Updated agent/server versions to 0.6.63/0.3.68. (7) Build pipeline: server IS auto-deployed by webhook (correction to earlier assumption); parallel build prototype on Pluto (3.51x, deferred integration). (8) Channel/promotion model documented in detail. (9) Downloads layout documented. (10) Updated server binary path + SSH details for physical box. (11) Added new anti-patterns (installer & operator, reaper false-fail, agent-level channel pin). Added ADR-11 (durability-first command delivery). Migration count updated to 59. Patterns/History preserved verbatim except new entries added.
  • clients/cascades-tucson — RECEPTIONIST-PC enrolled (site CascadesTucson)
  • systems/gururmm-build — Physical server at 172.16.3.30 on Jupiter subnet; GuruRMM API 3001, ClaudeTools API 8001, Coord API, MariaDB, PostgreSQL, build pipeline; migrated from VM to physical box 2026-06-11
  • systems/jupiter — Unraid host at 172.16.3.20; virsh host for all VMs (Pluto/Claude-Builder, Unifi, OwnCloud); Docker: Gitea port 3000, NPM, Seafile; NPM is the public TLS terminator for rmm-api.azcomputerguru.com (forwards to .30:80)
  • systems/pluto — Windows build server (MSI, WiX) at 172.16.3.36