Files
claudetools/wiki/projects/gururmm.md

95 KiB

type, name, display_name, last_compiled, compiled_by, aliases, sources, backlinks
type name display_name last_compiled compiled_by aliases sources backlinks
project gururmm GuruRMM 2026-06-22 Howard-Home/claude-main
guru-rmm
gururmm@main: server/src/api/*.rs (REST API surface, ~37 route modules)
gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs, device_id.rs)
gururmm@main: server/migrations/*.sql (60 migrations — feature checkpoints through 060_alert_mutes_agent_id_index)
gururmm@main: server/migrations/048_bsod_events.sql
gururmm@main: server/migrations/056_audit_log.sql
gururmm@main: server/migrations/057_log_signatures.sql
gururmm@main: server/migrations/058_command_acked_at.sql
gururmm@main: server/migrations/059_command_delivery_attempts.sql
gururmm@main: server/migrations/060_alert_mutes_agent_id_index.sql
gururmm@main: agent/src/bsod.rs
gururmm@main: server/src/api/updates.rs (promote/rollback endpoints)
gururmm@main: server/src/api/mod.rs (route surface, 2026-06-22)
gururmm@main: deploy/build-pipeline/webhook-handler.py
gururmm@main: deploy/build-pipeline/build-server.sh
gururmm@main: docs/FEATURE_ROADMAP.md (BUG-018..022 + SPEC-021 status)
gururmm@main: docs/BUILD.md (build + pre-merge verification guide)
gururmm@main: git log origin/main -40 (2026-06-22)
gururmm@main: commit 1dce66d (BUG-021 dep-pin)
gururmm@main: commit cea87d4 (BUG-018 202+bg)
gururmm@main: commit 0fa65f5 (Event Log Watch UI)
gururmm@main: commit 3de9faf (command_type cmd alias + NAK)
gururmm@main: commit 4027c86 (enrollment modal UX)
gururmm@main: commit f0a4b7f (BSOD dedup key)
gururmm@main: commit 95ef901 (MSI EXDEV fix)
gururmm@main: commit 137dd85 (BUG-020 tray fix)
gururmm@main: docs/specs/, docs/FEATURE_ROADMAP.md, docs/BUILD.md
projects/msp-tools/guru-rmm/CONTEXT.md
projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md
projects/msp-tools/guru-rmm/docs/UI_GAPS.md
projects/msp-tools/guru-rmm/docs/ARCHITECTURE_DECISIONS.md
projects/msp-tools/guru-rmm/docs/tech-stack.md
projects/msp-tools/guru-rmm/docs/DESIGN.md
projects/msp-tools/guru-rmm/docs/BUILD.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/shape.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/plan.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/references.md
projects/msp-tools/guru-rmm/specs/agent-comms-durability/standards.md
.claude/memory/reference_gururmm_server.md
.claude/memory/reference_gururmm_api.md
.claude/memory/gururmm-development-principles.md
.claude/memory/feedback_gururmm_agent_parity.md
.claude/memory/reference_pluto_build_server.md
.claude/memory/project_mac_gururmm_setup_pending.md
.claude/memory/feedback_gururmm_build_channel_default.md
.claude/memory/reference_gururmm.md
.claude/memory/feedback_gururmm_build_verification.md
.claude/memory/rmm-dashboard-beta-before-main.md
credentials.md
session-logs/2025-12-15-session.md
session-logs/2025-12-20-session.md
session-logs/2026-04-19-session.md
session-logs/2026-04-21-session.md
session-logs/2026-04-29-session.md
session-logs/2026-05-12-guru-rmm-macos-agent-phase1.md
session-logs/2026-05-15-session.md
session-logs/2026-05-16-session.md
session-logs/2026-05-17-session.md
session-logs/2026-05-19-gururmm-backup-fixes.md
session-logs/2026-05-19-session.md
session-logs/2026-05-21-session.md
session-logs/2026-05-23-session.md
session-logs/2026-05-24-session.md
session-logs/2026-05-24-GURU-KALI-session.md
session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md
session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md
session-logs/2026-06-07-mike-gururmm-offboarding-spec.md
session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
session-logs/2026-06-07-mike-gururmm-ui-gaps-enrollment.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-host-migration-cutover.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-backfill-build-pipeline.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-promotion-ghost-identity.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability-phase1.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-12-mike-beast-parallel-windows-build.md
projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-12-mike-rmm-command-type-cmd-misdiagnosis.md
session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md
session-logs/2026-06/2026-06-18-mike-sync-submodule-autoheal-and-thoughts-rescue.md
session-logs/2026-06/2026-06-21-howard-gururmm-bug-018-019.md
session-logs/2026-06/2026-06-21-howard-rmm-bug018-eventlogwatch-ui.md
session-logs/2026-06/2026-06-21-howard-gururmm-features-audit-submodule-fix.md
session-logs/2026-06/2026-06-21-mike-rmm-build-skill-errorlog-lint-coord-purge.md
clients/cascades-tucson
systems/gururmm-build
systems/jupiter
systems/pluto

GuruRMM

Summary

GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 270 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints.

Current version: agent 0.6.67 (stable) / server v0.3.74+ as of 2026-06-22. Fleet at ~270 enrolled / ~178 typically online; all metrics flowing, alerts firing with 0 legacy null dedup_keys, Windows build green at v0.6.67 (marker 1dce66d on origin/main).

BUG-021 (FIXED, commit 1dce66d, 2026-06-22): The legacy wave (Rust 1.77 for Win7/Server 2008 R2) was resolving deps fresh and pulling edition2024 crates. Fixed by pinning getrandom 0.3.1 + zeroize 1.8.1 below edition2024 in agent Cargo.toml. NOT a toolchain bump — bumping would drop legacy support. Windows build went green at 02:19, v0.6.67.

BUG-018 (FIXED, commit cea87d4): DELETE /api/agents/:id now returns 202 Accepted, marks the agent deleting (synchronous existence-check), disconnects the live WS, and purges all cascade rows in a background tokio task. Fleet listings and stats exclude 'deleting' rows; the agent disappears from the UI immediately.

Event Log Watch management UI (SHIPPED, commit 0fa65f5): /event-log-watches CRUD is wired into the dashboard — list, create, edit, delete, and enable-toggle watches for global and per-agent scope via a new pages/EventLogWatches.tsx in the FunctionRail + omnibox.

Agent-comms-durability Phase 1 shipped 2026-06-11: Commands black-holed at NAT'd sites no longer falsely fail. See Architecture section for full detail.

Physical server migration completed 2026-06-11: New physical box at 172.16.3.30 running Ubuntu 26.04 + PostgreSQL 18. Old VM at .46 decommissioned 2026-06-12.

Backup-alert quality pass shipped 2026-06-07: False backup_failed alerts reduced 15 -> 2 fleet-wide.

Role-aware offline alerting + alert ignore/mute shipped 2026-06-07: Server-only agent_offline alerts; site/fleet mass-offline aggregation; alert_mutes perma-silence.

See also: wiki/projects/guru-rmm.md is a redirect tombstone pointing here (slug disambiguation).

Repo: azcomputerguru/gururmm on Gitea (internal: http://172.16.3.20:3000). The copy at projects/msp-tools/guru-rmm is a git submodule tracking the active repo; its pinned gitlink intentionally lags origin/main. Development uses git -C projects/msp-tools/guru-rmm or worktrees off origin/main to access live code. Commits/pushes go to Gitea from the server (.30) or via GIT_ASKPASS with the vaulted Gitea API token. Howard is authorized for merges/deploys (cleared 2026-06-21 by Mike).

Goal: Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system.


Recent Work

2026-06-22 — BUG-021 Fixed + v0.6.67 Stable; Fleet Verified (Howard-Home)

  • BUG-021: Legacy build (Rust 1.77 via $CARGO +1.77 --features legacy) was pulling wit-bindgen (edition2024) through fresh dep resolution. Fixed by pinning getrandom 0.3.1 + zeroize 1.8.1 in agent Cargo.toml below edition2024. Commit 1dce66d (current origin/main HEAD). Windows build went green at 2026-06-22 02:19, v0.6.67.
  • Fleet verified (same session): ~270 agents / ~178 online; metrics rows flowing (~2531/15 min); all alerts have non-null dedup_keys (0 legacy nulls); per-agent endpoints all HTTP 200 on a test agent.
  • BUG-022 filed + fixed (PR #45): WatchdogEvent WS variant was dead code — defined + server-handled but the watchdog process has no WS connection and never produced it. Removed the orphaned path (variant, payload, insert_watchdog_event, match arm, watchdog_events::* re-export). Server-compiled clean (48.9s). Empty watchdog_events table left in place (no DROP migration) to avoid collision with in-flight PR migration numbers. PR #45 (fix/bug-022-watchdog-event-deadcode, 4eb5054) + PR #46 (docs) await Howard/Mike merge. Merging #45 = fleet build+deploy.
  • Roadmap verified: BUG-018 (cea87d4) and Event Log Watch UI (0fa65f5) confirmed on main; FEATURE_ROADMAP updated to reflect Fixed status.

2026-06-21 — Feature Batch, Audit, BUG-018 202+bg, Event Log Watch UI, Submodule Fix (Howard-Home + Mike GURU-5070)

Howard-Home sessions (three spans):

  • BUG-019 FIXED (commit 66a7f4e, merged to main by Mike): Container agent no longer performs in-place binary self-update (silent downgrade on container recreate / rollback-artifact log noise). agent/src/updater/mod.rs perform_update() now early-returns inside a container (is_docker_container()), reporting UpdateStatus::Failed with a clear message. Structured update_method=image deferred to SPEC-023. Verified on Linux build (v0.6.67 beta, 0 errors, 54s).
  • BUG-018 FIXED (commit cea87d4, merged): Handler redesign to 202 + background purge (see Summary). Migration 061 (FK indexes on five previously-unindexed FK columns, speeds the bg purge) on PR #41, pending merge. The root cause of the original HTTP 000 was cascade-delete volume (millions of metrics/logs rows), not missing indexes — the HIGH-volume tables were already indexed on agent_id.
  • Event Log Watch management UI (SHIPPED, 0fa65f5): /event-log-watches CRUD page (list, create, edit, delete, enable-toggle, global + per-agent scope). Built as a standalone page (not an AgentDetail panel) to avoid concurrently-edited AgentDetail. tsc -b + vite build clean. Merged + now on origin/main.
  • Enrollment modal UX fix (SHIPPED to prod, 4027c86): Enrollment modals (EnrollmentModal + EnrollAgentModal) now have a top-right close button, click-outside/Esc dismissal, and trigger a list refresh on close. Built from a branch worktree -> beta -> tested on rmm-beta -> promoted to prod with promote-dashboard.sh --confirm.
  • SPEC-021 + BUG-020 follow-up (PR #40, mig 063, 9171f84): Logged-in-user domain + account-type detection (agent metrics + server + dashboard). BUG-020 watchdog tray-teardown wiring (calling terminate_all from policy-disable/uninstall path). Migration renumbered 060 -> 063 to clear collision with BUG-019's 060.
  • MSP360 "Open in MSP360" deep-link (PR #42, mig 062): Button on backup-tab plan card deep-linking to the MSP360 console for a specific agent's backup plan. Backend: backup_provider_console_id field (migration 062) + provider-links.ts mapping. PR awaits merge.
  • Event Log Watch policy-clobber HIGH fix (PR #43, 432b434): Assigning or unassigning any policy silently wiped the connected agent's Event Log Watch monitoring. Third instance of a config-push clobber class already fixed at two other sites. Fixed by reusing watch_rules_for_agent (retargeted from &AppState to &PgPool) — single source of truth across all three push sites. No migration. PR awaits merge.
  • Audit cleanup (PR #44): Low findings from the 2026-06-21 /rmm-audit pass (DoS cap_field batch, SSE revocation bypass + JWT-in-URL, console.log stubs, ContextTree.tsx missing isError). Awaiting merge.
  • sync.sh submodule-clobber root-cause fix: sync.sh line 339 git submodule update --init ran unconditionally on every sync, re-checking out the intentionally-lagging pinned gitlink in detached HEAD and discarding live work. Fixed with a git -C "$ppath" rev-parse --git-dir populate-only guard — only initializes genuinely-missing submodules. Committed to claudetools main; verified guru-rmm now stays on origin/main @ ed8cad3 through a full sync.
  • Dashboard beta-before-main rule established: Dashboard changes go to beta first, before main. Beta auto-builds from origin/main via the webhook; a feature branch is previewed with a manual git worktree build + rsync to /var/www/gururmm/dashboard-beta. Promotion to prod = promote-dashboard.sh --confirm. (Saved as memory rmm-dashboard-beta-before-main.)

Mike GURU-5070 session:

  • BUG-019 merged to main (fast-forward ed8cad3 -> 66a7f4e; CI bump 8b5e0dc). Linux build confirmed green (v0.6.67, 54s, 0 errors).
  • gururmm-build skill created: bash .claude/skills/gururmm-build/scripts/verify.sh server|agent|dashboard|migrations for local pre-merge verification. Does NOT trigger the prod build (webhook-on-merge is the trigger). docs/BUILD.md added to guru-rmm repo (build guide). Memory feedback_gururmm_build_verification captures the key rules.
  • Howard cleared for GuruRMM merges/deploys (Mike decision). Mike still owns architecture/direction; prepared + verified branches no longer bottleneck on Mike to land.

2026-06-15 — BSOD Dedup Key Fix, MSI EXDEV Fix, Dashboard Offline Dedup, sync.sh Auto-Heal (Mike GURU-5070 + GURU-BEAST-ROG)

  • BSOD alert dedup key fixed (commit f0a4b7f, server v0.3.73): Dedup key changed from bsod:<agent>:<dump_sha256> (unique per crash, so recurring BSODs spawned new alert rows and bypassed per-dump mutes) to bsod:<agent>:<bugcheck_code> (stable across recurrences). Now a single perma-mute silences all future recurrences of the same bugcheck. The live duplicate alerts on MSI were resolved via psql, and a stable-key mute was inserted for the recurring 0x116 VIDEO_TDR_FAILURE on agent a685af29.
  • MSI EXDEV fix (commit 95ef901, server v0.3.74): Site-MSI builder was staging the signed MSI in std::env::temp_dir() (/tmp, a tmpfs) then rename-ing it to /opt/gururmm/downloads (root LV) — a cross-device rename that fails with EXDEV (os error 18). Every site-specific MSI install request returned 500. Fixed by staging in downloads_dir (same filesystem as the final path), matching the pattern the signed-EXE path already used.
  • Duplicate offline server alerts removed (commit b6ed564): Dashboard Triage view was showing server agents twice in offline status — fixed.
  • sync.sh auto-heal (Phase 1, 2026-06-15): resolve_submodule_collisions() helper added to sync.sh. On git submodule update abort (untracked files would be overwritten), it enumerates the precise colliding paths (intersection of untracked files and paths tracked by the target gitlink commit), moves only those aside to <file>.synced-aside-<UTCstamp>, then retries once. Preserved 94 orphaned RMM_THOUGHTS lines (2026-06-08 re-grounding pass) that a blind --force would have discarded. Phase 2 (populate-only guard) added 2026-06-21.
  • logs/analyze switched from Ollama to Claude API (commit c869e4d): Log analysis endpoint now calls the Claude API instead of Ollama-on-Beast.
  • 500-error-leak fix (commit 58c1a96): Closed the remaining 17 sites that returned raw DB errors in 500 bodies, routing them through internal_err. No new DB detail is exposed to callers.

2026-06-12 — Beast Parallel Windows Build (Lever A) + command_type "cmd" Misdiagnosis Fix (Mike GURU-5070)

Beast parallel Windows build (lever A, commit cfbdb59):

  • Rewrote run_remote_build() in build-windows.sh to drive concurrent SSH sessions from bash instead of a serial cmd /c chain. Two waves: WAVE 1 = 5 concurrent stable builds (amd64, x86, debug, tray, cleanup); WAVE 2 = 2 concurrent legacy-1.77 builds (l_amd64, l_x86). MSI build overlaps Wave 2. Per-job exit-code aggregation; Pluto fallback preserved.
  • Result: Beast parallel+warm ~336s (cargo 319s); ~46% faster than Beast serial cold (~622s); ~3.8x faster than Pluto serial (~1269s).
  • Key fix: Per-variant --target-dir is mandatory. Both stable variants sharing target/ serialized on cargo's build-directory lock. Fixed: amd64=target/release, x86=target/x86, legacy-amd64=target/legacy-x64, legacy-x86=target/legacy-x86.
  • Dropped cargo +1.77 fetch pre-resolve entirely. It resolved the full dep graph and pulled wit-bindgen (edition2024, unsupported by Cargo 1.77.2) — rc=101 on both hosts. The legacy builds scope deps via --features legacy (excludes wit-bindgen) when run as actual builds; the pre-fetch was both unnecessary and the failure point.
  • Live verification: v0.6.64 (first live Beast build, 622s cold), v0.6.66 (parallel fix, 336s). All 8 artifacts signed OK.

command_type "cmd" fix (commit 3de9faf, agent v0.6.66 stable):

  • Root cause of "PST agents can't receive commands": CommandType enum had no cmd variant. Server-sent command_type:"cmd" failed serde parse; agent logged-and-silently-dropped the message (no ACK, no result) — indistinguishable from a NAT black-hole. 7-round adversarial multi-AI quorum + packet captures were used to isolate the network path before the discriminator was found: the one working command used powershell, all failures used cmd.
  • Fix 1: #[serde(alias = "cmd")] on CommandType::Shell — accepts "cmd" as an alias for Shell (runs cmd.exe, for endpoints without PowerShell).
  • Fix 2: NAK on unparseable command — handle_server_message now sends CommandAck + error CommandResult on whole-message parse failure (if the payload.id is recoverable), instead of silently dropping. Fails fast rather than black-holing.
  • Reverted: the earlier "evict non-delivering connections" server change (commit 80df458) — its motivating premise was this cmd phantom; with valid commands acking in ms it only added spurious reconnect churn.
  • Promoted to stable fleet-wide. wiki/clients/peaceful-spirit.md and .claude/commands/rmm.md corrected to remove the wrong "use command_type: cmd" advice.

2026-06-11 — Physical Server Migration (Cutover + Backfill + Build Pipeline)

Cutover (morning ~06:53-07:20 MST):

  • Stopped all VM writers, fresh-restored gururmm/guruconnect (PG 18) and claudetools (MariaDB 11.8) on the new physical box via direct SSH to the VM.
  • VM released 172.16.3.30; new box took 172.16.3.30 + fresh management IP 172.16.3.47 via self-confirming detached netplan apply (auto-reverts if gateway unreachable — prevents a bricked network config).
  • All nine stack services active; 162/212 agents reconnected within 15 min. Fleet broadcast sent; 7 migration locks released.
  • Two production issues fixed in-window: (1) nginx had stale config missing /ws location — systemctl reload nginx fixed it and agents reconnected; (2) Prometheus failed due to WAL copied mid-write — moved wal/chunks_head aside, persisted blocks healthy (~2h of samples lost during downtime anyway).

Backfill + Build Pipeline (afternoon session):

  • 7-day whale data (metrics 1,189,924 rows + agent_logs 2,262,938 rows; ~3.4 GB total) backfilled VM->new box via direct SSH \copy stream in ~2.5 min. Load verified lossless; SSD sustained 186-214 MB/s writes with ZERO pool-timeouts while 212 live agents wrote concurrently.
  • Three build-env gaps found and fixed after first push-triggered build: (1) sccache not installed (copied v0.8.2 from VM); (2) root had no Pluto SSH key (copied guru's id_ed25519 to /root/.ssh); (3) /etc/gururmm-signing.env + jsign + Java missing (installed Java + jsign-7.1.jar; recreated jsign wrapper).
  • Parallel Windows build prototyped (7 Start-Job isolated builds, 3.51x speedup 28.9m -> 8.2m, 15 rustc vs 1). Replaced by the bash-driven lever-A approach in the 2026-06-12 session.
  • First clean build on new box: signed Windows agent v0.6.61 beta. Independently verified (Authenticode, Arizona Computer Guru LLC via Azure Trusted Signing, RFC3161 timestamp, sha256 OK).
  • Old VM decommissioned 2026-06-12: virsh domain + disk removed after stability soak; no longer exists.

2026-06-11 — Durable Agent Identity (Ghost Agent Spec + Phase 1 Task 1)

Ghost agent root cause:

  • device_id is a random UUIDv4 stored in ONE file (%ProgramData%\GuruRMM\.device-id / /var/lib/gururmm/.device-id). The agent's own cleanup.ps1 (Steps 6+7) deleted both ProgramData and HKLM\SOFTWARE\GuruRMM, wiping it. A past hardware->random scheme change also orphaned agents (11 ghosts with old win-/hw- device_ids). Server dedups ONLY on device_id at WS-connect; no reconciliation. 11 hostnames have duplicates (9 same-site true ghosts; 2 — MSI, SERVER — are different machines at different sites and must NOT be merged). 32 tables FK agent_id.
  • Context: channel pin regression discovered (GURU-5070 still on 0.6.59 after v0.6.61 promotion) because the beta override lived on the orphaned ghost enrollment. Fix: set update_channel='beta' at site "Mike's Car" (survives re-enrollment); resolve_agent_channel = agent.or(site).or(client).or('stable').

Spec + Phase 1 Task 1:

  • Multi-AI design review (Gemini + Grok, strong convergence): durable multi-location device_id (primary) + hardware fingerprint (reconciliation guard, not key). Identity model: agent/src/device_id.rs rewritten — HKLM registry + ProgramData on Windows; /etc + /var/lib on Linux; /Library + mirror on macOS — with self-heal, plus cleanup.ps1 whitelisting the identity files. Committed 0b81d33 (v0.6.62). Linux build green.
  • Spec: specs/durable-agent-identity/ (shape/plan/references/standards). Phase 1 (Task 1 shipped) stops ~99% of new ghosts. Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for 9 existing ghosts) per spec after soak.

2026-06-11 — Agent Comms Durability Phase 1 (Spec + Slices A/B/C + Deploy + Fleet)

Root cause (confirmed in code, ~95% AI-validated): NAT conntrack asymmetry. Commands are a server->agent PUSH over the persistent WS; agent->server heartbeats refresh the UDR conntrack and replies ride back. A server->agent command push in an idle gap hits a torn-down conntrack and is silently dropped (server sender.send() buffers without error for ~15 min). The server's 30s WS ping (server->agent) CANNOT revive a dead conntrack; only agent->server can. Server had no half-open detection (read loop ignores Pong). Agent already heartbeats every 30s — so the conntrack floor is sub-30s — meaning durable delivery nets are the correct fix, not keepalive alone.

Spec: specs/agent-comms-durability/ (shape/plan/references/standards). Multi-AI reviewed (Gemini 3.1-pro + Grok 4.3, 2 rounds, strong convergence). Phased: Phase 1 = durable delivery (shipped); Phase 2 = live TTY (planned); Phase 3 = AIMD keepalive + bulk->HTTPS + half-open eviction (planned).

Slice A (dormant server foundation — committed b93f2ef): Migration 058 adds commands.acked_at TIMESTAMPTZ. Server-side CommandAck handler arm and mark_command_acked() in db. Additive + dormant — no behavior change until agents send ACKs.

Slice B (agent — committed 45870b1, deployed 2026-06-11 as v0.6.63): AgentMessage::CommandAck { command_id } added. Agent sends CommandAck on RECEIPT of every dispatched command (before execution). CommandExecutor gained a FIFO-bounded recent-results cache (64 entries, 256 KB cap each): re-reports cached result for an already-completed command; ignores an in-flight duplicate; never double-runs. Result is cached BEFORE deregistering the running task so a re-delivery in the complete()/send() gap always finds it cached (re-report) rather than not-running (re-run).

Slice C (server — committed ca1657b, deployed 2026-06-11 as v0.3.68): Migration 059 adds delivery_attempts INTEGER NOT NULL DEFAULT 0 + partial index idx_commands_agent_acked ON commands(agent_id) WHERE acked_at IS NOT NULL (capability gate). db::requeue_undelivered_commands returns an un-acked command past a 60s ACK deadline to pending (re-delivery, never failed). db::fail_timed_out_commands rewritten into three safe cases — can no longer falsely-fail a deliverable command. mark_command_running increments delivery_attempts (cap 10). Reconnect block refactored into shared redispatch_pending_commands helper (send_to-based); called on EVERY heartbeat — so a black-holed push is re-delivered on the next agent beat (which just refreshed the NAT conntrack) without needing a reconnect.

Capability gate (version-independent): Reaper only re-delivers for agents that have demonstrably ACK'd at least once (EXISTS(... acked_at IS NOT NULL)). Old non-ACK agents keep the legacy fail-on-timeout path and can never double-run. Safe to deploy at any point during gradual agent rollout.

Deploy + fleet rollout:

  • Installer bug fixed (5c0d004): Start-Process -Wait -PassThru + exit code check (old & $stagingPath install 2>&1 | Out-Null swallowed output + surfaced misleading NativeCommandError). Agent CLI ANSI log garbage fixed (parse before logging init, file-only for one-shot CLI).
  • Server auto-deployed by webhook as v0.3.68 (commit 7376340). Migrations 058+059 applied. All durability code live server-side.
  • Build pipeline dubious-ownership fixed: git config --system --add safe.directory /home/guru/gururmm (/etc/gitconfig; webhook runs builds as root on guru-owned repo).
  • Canary: clients Goldstein + Peaceful Spirit set to update_channel='beta'; all 6 canary agents updated 0.6.61->0.6.62->0.6.63 within ~24 min. PST-SERVER (UDR Ultra, the original affected agent) verified: hostname; whoami returned exit 0, acked_at stamped, ack_latency 16 ms, delivery_attempts 1. Full ACK durability chain proven live.
  • Fleet promotion: POST /api/updates/rollouts/0.6.63/promote {"os":"windows","arch":"amd64"} -> 6 files re-tagged stable. 168 agents on 0.6.63 in ~3 min; online count held at 182 throughout (no mass dropout).

2026-06-07 — UI Gap Batch + Enrollment Audit (GURU-BEAST-ROG session)

API changes:

  • get_agent handler now returns AgentWithDetails — includes client_id and client_name via JOIN query
  • New endpoints: GET /api/agents/:id/bsod-events, GET /api/agents/:id/version-history
  • New endpoints: GET /api/install-reports, GET /api/install-reports/:id
  • New endpoint: GET /api/discovery/all-devices
  • DELETE /api/agents/:id/key redesigned as dual-revoke: nulls agents.api_key_hash (legacy Mode 1/2) AND sets enrolled_agents.revoked = TRUE (modern agk_ Mode 3)
  • New endpoint: GET /api/sites/:id/enrolled-agents — full enrollment audit with duplicate-hostname annotation
  • New endpoint: POST /api/enrolled-agents/:id/revoke — per-enrollment targeted revocation with cascade

Dashboard additions:

  • AgentDetail: Crashes tab (BSOD events), version history table in Updates tab
  • Dashboard.tsx: fleet stats wired to /agents/stats endpoint (was computed client-side)
  • SiteDetail: Revoke Key button + Enrollment tab with full audit table, status badges, duplicate-hostname warnings, per-row revoke
  • New pages: InstallReports (/install-reports), Discovery (/discovery)
  • FunctionRail: nav links for Install Reports + Fleet Discovery

Hotfix (production 500s): get_agent_with_details_by_id SQL was missing a.role_override (migration 054 added it as non-optional). All GET /api/agents/:id calls returned 500 until hotfix commit 6faa382.

Key commits: 5854c63 (UI gap batch), 6faa382 (hotfix), 49a7109 (enrollment audit batch), 00129af (fix route wiring)


2026-06-07 — Credential Inheritance Deployment & Offboarding Spec

  • Server v0.3.45 deployed with credential inheritance (Global → Client → Site, is_inheritable flag, de-duplication by (credential_type, label), /effective endpoints).
  • Dashboard: clickable alert severity badges with client filtering; offline badge scopes to client-specific agents.
  • SPEC-028 offboarding wizard specification created (835 lines) — 6-step site + 5-step client offboarding modals, data export (temp tokens, 1h expiry), typed confirmation, audit logging, cascade deletion. Implementation pending.
  • FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section.

Capabilities / Feature Set

Synthesized from authoritative artifacts (API routes, agent modules, 60 migrations through migration 060, roadmap, commit log) — not from session logs alone. See Compilation Notes.

Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to interrupted. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).

Monitoring & Telemetry

  • Core metrics per interval (policy-tunable): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win GetLastInputInfo, Linux xprintidle), public/WAN IP (cached, multi-service fallback). Cross-platform via sysinfo.
  • Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Linux via /sys/class/thermal; sysinfo::Components fallback. Windows LibreHardwareMonitor removed 2026-05-27 (CVE-2020-14979); Windows thermal collection pending a safe replacement strategy.
  • Process drill-down: top 10 by CPU + top 10 by memory in every metrics payload (cross-platform).
  • Network state: per-interface IPv4/IPv6, MAC, derived CIDR subnets — sent on change only.
  • Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01): agent/src/bsod.rs polls C:\Windows\Minidump for new .dmp files (5-min filetime poll), parses kernel dump header at fixed DUMP_HEADER64/DUMP_HEADER32 offsets (bugcheck code @0x38, 4 parameters, FILETIME @0xFA8 — the minidump crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references System event log (WER 1001 / Kernel-Power 41) for Report Id + faulting driver, deduplicates via C:\ProgramData\GuruRMM\bsod-seen.json watermark. Sends AgentMessage::BsodEvent. Server: migration 048 + server/src/db/bsod_events.rs + ws handler — always-Critical alert deduplicated by (agent_id, bugcheck_code) (key changed 2026-06-15 from dump_sha256 to bugcheck_code so a mute silences all recurrences). Dashboard Crashes tab shipped 2026-06-07. Phase 2/3 deferred: BSOD in Alerts stream, fetch_bsod_dump on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).

Remote Execution

  • Command types: shell, powershell, python, raw script{interpreter}, and claude_task. Options: timeout_seconds, elevated. Also accepts cmd as an alias for shell (added 2026-06-12, commit 3de9faf). Unknown/unparseable commands now NAK immediately with an error result (fail-fast, not silent-drop).
  • Execution context (041_add_command_context): system (default — Session 0 / service SYSTEM) or user_session (runs in the active logged-on user's desktop session via WTS token impersonation: WTSQueryUserToken + CreateProcessAsUserW + per-user env block). Windows-only; requires an active session.
  • Commands individually cancellable (POST /commands/:id/cancel) and aborted on disconnect. Auditable history; status running/completed/failed/timeout/interrupted.
  • Script library (017_scripts): stored scripts dispatched to agents with args/env/timeout/run_as_user; per-run history.
  • Agent Comms Durability (Phase 1, shipped 2026-06-11 — agent v0.6.63 / server v0.3.68):
    • Problem: server->agent command pushes black-holed at NAT'd sites (UDR Ultra) when conntrack gaps tear down the inbound direction. Agent heartbeats work (agent->server refreshes conntrack); reaper falsely failed black-holed commands after timeout.
    • Fix: agent sends CommandAck { command_id } on RECEIPT (migration 058 acked_at). Agent deduplicates by command_id — re-reports cached result, ignores in-flight duplicate, never double-runs. Server reaper re-delivers an un-acked command past 60s ACK deadline (returns to pending) instead of failing it. Migration 059 adds delivery_attempts counter (cap 10) + partial index for capability gate. Pending commands re-offered on (re)connect AND on every heartbeat. Verified at PST-SERVER: acked_at stamped, ack_latency 16 ms, delivery_attempts 1.
    • Capability gate: reaper only re-delivers for agents that have ACK'd at least once (idx_commands_agent_acked partial index). Old agents keep legacy fail-on-timeout path; safe mid-rollout.
    • Code anchors: server/src/db/commands.rs (requeue_undelivered_commands, fail_timed_out_commands rewrite, DELIVERY_ACK_DEADLINE_SECS=60, MAX_DELIVERY_ATTEMPTS=10), server/src/ws/mod.rs (redispatch_pending_commands, CommandAck handler), agent/src/transport/websocket.rs + agent/src/commands/mod.rs (ACK send + recent-results cache, RECENT_CAP=64, MAX_CACHED_RESULT_BYTES=256 KB).
    • Phase 2 (live TTY — rides warm WS, seq/resume, cadence switch) and Phase 3 (Adaptive keepalive, bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper) are PLANNED, not shipped.

Inventory & Discovery

  • Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
  • VM / hypervisor / container detection (032/033): is_virtual_machine, hypervisor_type, vm_uuid, is_hypervisor + hosted VM UUIDs, is_container, is_unraid.
  • User/group inventory (037-040): local + domain + Azure AD accounts (enabled, pw-never-expires, last-logon, is_admin, AD email/UPN/dept), domain-join classification (none/ad/aad/hybrid), domain name, M365 tenant ID, domain-controller detection (is_dc), group membership. Policy-scheduled (default 24h).
  • Network discovery: server-dispatched scans — TCP probes over configurable ranges/ports, ARP MAC, reverse DNS, basic OS fingerprint; devices stream back and are persisted/editable.

Patch / Agent Update Management

  • Self-updater: server sends version + URL + SHA256; agent downloads, verifies checksum, atomically swaps binary, restarts, and auto-rolls back to a kept backup if it fails to reconnect (~180s window).
  • Auto-update gated on effective policy auto_update (channel + defer_hours). Force via POST /agents/:id/update. Update channels at agent/site/client (026).
  • Channel model: new builds tag beta by default; agents default to stable channel; promotion to stable is a DELIBERATE re-tag of the .channel sidecar. scanner.rs get_latest_version: stable agents get latest STABLE-tagged binary; beta gets absolute-latest. resolve_agent_channel = agent.or(site).or(client).or('stable').
  • Promotion API: POST /api/updates/rollouts/:version/promote body {"os","arch"} — re-tags all .channel files for the version, records update_rollouts row, force-rescans. Rollback: POST /api/updates/rollouts/:version/rollback.
  • Safe-rollout (046): update_rollouts/health-metrics/events tables + /updates/rollouts promote/rollback. Health-gated automation is written-but-unwired (Phase 2); promotion is manual via API.
  • Container guard (BUG-019, shipped 66a7f4e, 2026-06-21): Agent inside a container early-returns from perform_update() with UpdateStatus::Failed + clear message, skipping the in-place binary swap that caused silent downgrades on container recreate. Full image-update path (SPEC-023) is not yet implemented.
  • Legacy fleet build support (two-wave parallel build): Agent ships both a stable-toolchain (modern x86_64 + x86, debug, tray, cleanup) wave and a legacy wave (Rust 1.77, --features legacy, for Win7/Server 2008 R2 endpoints). The gururmm-build skill wraps pre-merge verification. IMPORTANT: the legacy wave resolves deps fresh on Rust 1.77 — dep pins to avoid edition2024 pulls are required (BUG-021 lesson: pin getrandom 0.3.1 + zeroize 1.8.1).

Policy & Configuration Management

  • Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via ConfigUpdate. Effective policy queryable per scope.
  • Checks engine (018/019): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win sc.exe, Linux systemctl). Policy-attached check templates (024) with push-to-agent sync. On-demand run-checks.
  • Remote registry (Windows, winreg): agent supports enumerate/read/write (typed)/create/delete. HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but are not routed yet [verify].
  • Event Log Watch (migration 047): Server-configured rules for monitoring Windows Event Log channels by event ID / level. Agent reports matching events. Dashboard management UI at /event-log-watches shipped 2026-06-21 (0fa65f5). CRUD + enable/disable for global and per-agent scope. HIGH fix (PR #43, pending merge): policy assign/unassign no longer wipes connected agent's Event Log Watch monitoring (fixed by reusing watch_rules_for_agent shaper as single source of truth across all config-push sites).

Alerting & Watchdog

  • Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (022) with effective resolution; per-client email settings (020). Maintenance mode (021) to suppress alerting per scope.
  • Watchdog: separate supervising process (polls GuruRMMAgent every 30s, restart backoff) + launches/reaps the tray into active user sessions via WTS. The watchdog reports to the server via the REST POST /watchdog-alert endpoint (fires after 3 consecutive failed restarts). This is the only supported watchdog alert path.
  • [CORRECTED 2026-06-22 per BUG-022]: The WatchdogEvent WebSocket variant was dead code — defined in the agent transport enum and fully handled server-side (insert_watchdog_event, watchdog_events table), but the watchdog is a separate companion process with no WebSocket connection and never had a producer. Granular per-restart WS events were never implemented and the path has been removed (PR #45, pending merge). The empty watchdog_events table remains in the DB (no DROP migration, to avoid migration-number collision). watchdog_events = 0 all-time in a healthy fleet is correct and expected. Granular REST-based watchdog visibility is a future RMM_THOUGHTS item.
  • BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; fix #3 wiring in PR #40, pending merge): Root cause: TrayLauncher tracked launches only in an in-memory HashMap<sid,HANDLE> that resets on watchdog restart; no single-instance guard in the tray; terminate_all hard-killed via TerminateProcess skipping NIM_DELETE (ghost). Fix (commit 137dd85): (1) per-session Local\GuruRMM_Tray single-instance mutex; (2) launcher reconciliation via WTSEnumerateProcessesW; (3) graceful Global\GuruRMM_TrayShutdown_{sid} event -> 3s wait -> TerminateProcess fallback. Fix #3 (wiring terminate_all into watchdog policy-disable/uninstall path) is in PR #40 (SPEC-021+BUG-020 branch).
  • Role-aware offline alerting (shipped 2026-06-07): Scheduled offline-sweep evaluator (60s tokio interval; server/src/alerts/offline.rs). Server-only agent_offline alerts. Classifier: os_product_type IN {2,3} -> server; else os_name/os_version ~/server/i -> server; else manual role_override (migration 054). Site rule (>=50% + >=3) -> mass_offline_site; fleet rule (>=10) -> mass_offline_fleet. Dashboard: servers elevated individually; workstations collapsed. PUT /api/agents/:id/role-override. Known gap: os_product_type populated on only ~16/168 agents; os_name is the workhorse classifier.
  • Alert ignore/mute — perma-silence (server only, shipped 2026-06-07; dashboard pending): alert_mutes table (migration 055) + muted alert status. Keyed on dedup_key. Permanent until un-ignored; reason required (400 if missing). Gate at create_or_update_alert AND create_check_alert bypass. POST /api/alerts/:id/mute + /unmute. Dashboard Ignore/Muted/Un-ignore NOT yet built.

Credentials Management

  • Encrypted credentials vault (016): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate /reveal decrypt endpoint. Known HIGH item: /reveal ownership-scope check — [verify current state].
  • Credential inheritance (deployed 2026-06-07): Opt-in hierarchical cascade with is_inheritable flag (Global → Client → Site). De-duplication by (credential_type, label), most-specific scope wins. /effective endpoints return inherited + direct credentials with inherited_from indicator.

Audit Log (Migration 056)

  • Append-only audit_log table (migration 056): (id, occurred_at, user_id, action, target_type, target_id, detail JSONB). Actor preserved with SET NULL on user delete. Seeded by credential reveals (credential.reveal); generic shape for future sensitive operations. Indexes on occurred_at DESC, user_id, (target_type, target_id).

Systemic Log Feedback Intelligence (Migration 057)

  • Per-line provenance + deterministic fingerprint added to agent_logs table (nullable; new rows populate, old rows stay NULL): agent_version, signature_hash BIGINT, normalizer_version SMALLINT. Fingerprint computed server-side at ingest (server/src/fingerprint.rs).
  • Durable aggregate log_signatures table: one row per distinct normalized error signature (occurrence_count, affected_agents, sample_message, first/last_seen, normalizer_version). Long-retention (raw logs short-retention); alerting + dashboards can read this.
  • log_signature_versions table: per-signature x agent_version rollup — answers "which versions does signature X appear on?" (version regression correlation).
  • Log analysis (logs/analyze): switched 2026-06-15 from Ollama-on-Beast to the Claude API (commit c869e4d).

Backup Integration (MSP360 / MSPBackups)

  • Multi-provider config (034/035) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (044) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
  • Alert quality pass (2026-06-07): Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from alerting/compliance. MSP360 message JSON decoded via summarize_backup_error. create_or_update_alert refreshes title/message/severity on re-trigger. Fleet: false backup_failed 15 -> 2; survivors (AD1, LAB-Becky) are genuine.
  • backup_storage_low alert type REMOVED (2026-06-07): DataCopied/TotalData measures backup-dataset completeness, not destination capacity. Produced 5 fleet-wide false alerts. resolve_all_backup_storage_alerts clears stragglers. Genuine destination-capacity alerting deferred.
  • MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded): 8, 13.
  • MSP360 API scope (confirmed 2026-06-07): monitoring-only tier; management paths (plan delete, manual run, storage config) return 404; plan management remains MSP360-console tasks.
  • MSP360 "Open in MSP360" deep-link (PR #42, migration 062, pending merge): "Open in MSP360" button on the backup-tab plan card deep-linking to the MSP360 console for the specific agent's backup plan. Backend: backup_provider_console_id field (migration 062) + lib/provider-links.ts mapping.
  • Key functions: derive_backup_status, error_is_benign, summarize_backup_error, resolve_orphaned_backup_alerts, resolve_all_backup_storage_alerts, evaluate_plan (BACKUP_STALE).

Remote Access (Tunnel)

  • Agent side substantially built (TunnelManager state machine; Open/Close/Data). Server side is a dead-code skeleton — not declared in api/mod.rs, no /tunnel routes, WS handler logs "not yet implemented." Not production-ready. Live TTY design will supersede the skeleton (Phase 2 of agent-comms-durability spec).

Identity / Multi-tenancy / Security

  • Auth: JWT (login/register/me); agents auth over WS via per-agent API key + hardware device_id.
  • Microsoft Entra ID SSO (OAuth2/OIDC + PKCE), gated on server config. Google planned per SPEC-008 but not implemented [verify].
  • Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin user impersonation (/auth/impersonate/:id). Backend + dashboard UI shipped 2026-05-31.
  • Enrollment & keys (012): per-agent key issuance on first run, site API keys (regenerable), site-specific MSI with SITEKEY injected at download, public install-report ingestion. Legacy PowerShell agent path for Server 2008 R2.
  • Logs: agent log upload (periodic + on-demand), per-agent events (042), fleet log view, AI-assisted log analysis (/logs/analyze) — AI-optional per locked decision.
  • 500-error-leak fixed (commit 58c1a96, 2026-06-15): All server error paths now route through internal_err; raw DB error strings are no longer returned to callers.

Enrollment Detail (migration 012 + 2026-06-07 audit)

The enrolled_agents table is the authoritative enrollment history log:

Field Notes
site_id Site the agent enrolled under
agent_id Set on WebSocket connect
hostname Hostname at enrollment time
agent_key_hash Hash of the agk_ enrollment key (not surfaced in API)
enrolled_at Timestamp of initial enrollment
last_seen Updated on each WS connection
ip_address IP at last connection
os_version OS at enrollment time
revoked Boolean; set TRUE by revocation endpoints

WS auth modes: Mode 3 (modern, agk_ keys): checks enrolled_agents.agent_key_hash first. Mode 1/2 (legacy): falls back to agents.api_key_hash.

Revocation — dual-layer (fixed 2026-06-07): DELETE /api/agents/:id/key (bulk) revokes BOTH layers. POST /api/enrolled-agents/:id/revoke (targeted) per-enrollment with cascade.

Enrollment modal UX (fixed prod 2026-06-21, commit 4027c86): "Add Devices" and "Enroll an Agent" modals now have a top-right close button, close on click-outside/Esc, and trigger a site/client list refresh on close.

Platform coverage

  • Cross-platform (Win/Linux/macOS): metrics, network state, hardware/software/service inventory, user/group + DC/domain detection, checks (service checks Win+Linux), discovery, self-update, scripts/commands.
  • Windows-only: user_session command context (WTS), remote registry, tray-into-session, watchdog SCM supervision, BSOD detection, legacy fleet (Win7/Server 2008 R2) agent via Rust 1.77 --features legacy.
  • macOS: agent deployed (Phase 1); tray is a stub; automated Mac build pipeline is an intentional stub (no build host) [verify before claiming CI Mac releases].

Architecture

Components

Component Location Tech State
Server 172.16.3.30:3001, systemd gururmm-server.service (User=root, binary /opt/gururmm/gururmm-server, EnvironmentFile /opt/gururmm/.env) Rust, Axum deployed, production (physical box, Ubuntu 26.04, PG 18; migrated 2026-06-11)
Dashboard https://rmm.azcomputerguru.com, nginx at /var/www/gururmm/dashboard/ React + TypeScript + Vite, shadcn/ui, Tailwind CSS v4 deployed, production
Agent (Windows) Endpoints, installed as GuruRMMAgent Windows service via WiX MSI Rust, Windows MSVC deployed; stable fleet on 0.6.67
Agent (Linux) Endpoints, systemd gururmm-agent, binary /usr/local/bin/gururmm-agent Rust, musl static deployed
Agent (macOS) Endpoints, LaunchDaemon com.azcomputerguru.gururmm-agent.plist Rust, aarch64/x86_64 Phase 1 deployed 2026-05-12; code signing issue on Apple Silicon
Tray (Windows) System tray, named pipe IPC Rust deployed
Tray (Linux) System tray, Unix socket IPC, libappindicator/GTK Rust, GTK deployed 2026-05-24 (PR #13+#14)
Tray (macOS) Menu bar Rust stub/TODO (issue #18)
PostgreSQL DB localhost:5432 on 172.16.3.30, database gururmm PostgreSQL 18 deployed (migrated from PG VM, backfilled 2026-06-11)
Coord API 172.16.3.30:8001/api/coord FastAPI (part of ClaudeTools API) deployed
Build pipeline 172.16.3.30:9000 webhook + /opt/gururmm/ scripts Python (webhook-handler.py), Bash deployed; builds agents AND server (server auto-deploy wired 2026-06-02)
Windows build — Beast (PRIMARY) GURU-BEAST-ROG, i9-14900K (24c/32t), RTX 4090; reached over Tailscale at tailnet IP 100.101.122.4 as user guru Rust MSVC, WiX 4.x primary Windows build host — build-windows.sh tries Beast first; parallel build ~336s
Windows build — Pluto (FALLBACK) 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid), Xeon E5-2695 v3 8c/16t Rust MSVC, WiX v4 fallback only — used if Beast is unreachable/down OR its build fails (`attempt_build beast
Old VM (decommissioned) was 172.16.3.46 Ubuntu (original) DELETED 2026-06-12 after stability soak — virsh domain + disk removed; no longer exists

Key Files & Repos

  • Active repo: azcomputerguru/gururmmhttp://172.16.3.20:3000/azcomputerguru/gururmm
  • Submodule working tree: projects/msp-tools/guru-rmm — gitlink intentionally lags origin/main; read live code with git -C projects/msp-tools/guru-rmm show origin/main:<path> or worktrees off origin/main. Push via GIT_ASKPASS with vaulted Gitea API token (services/gitea-howard.sops.yaml).
  • Server binary: /opt/gururmm/gururmm-server on 172.16.3.30 (physical box; prior VM had /usr/local/bin/gururmm-server)
  • Server config: /opt/gururmm/.env (EnvironmentFile for systemd; DATABASE_URL here; root-only read)
  • Source checkout on server: /home/guru/gururmm (build-server.sh: git reset --hard origin/main -> cargo build -> deploy with backup+rollback)
  • Agent binary (Linux): /usr/local/bin/gururmm-agent
  • Agent config (Linux/macOS): /etc/gururmm/agent.toml (root, mode 600); macOS uses /usr/local/etc/gururmm/site.plist
  • Agent registry (Windows): HKLM\SOFTWARE\GuruRMM\SiteId (written by MSI); HKLM\SOFTWARE\GuruRMM\DeviceId (written by agent, added Phase 1 Task 1)
  • Windows service name: GuruRMMAgent (NOT gururmm-agent)
  • Downloads dir: /var/www/gururmm/downloads/ on 172.16.3.30 — base binaries gururmm-agent-<plat>-<ver>.exe + .channel + .sha256; -latest symlinks; per-site signed cache gururmm-agent-site-<site_id>-<plat>-<ver>.exe (built+jsign-signed on demand)
  • Webhook handler: /opt/gururmm/webhook-handler.py (port 9000 localhost; nginx :80 /webhook/ -> :9000; systemd gururmm-webhook)
  • Build scripts: /opt/gururmm/build-shared.sh, build-linux.sh, build-windows.sh, build-mac.sh (split 2026-05-24; build-agents.sh is compat wrapper)
  • Server build script: /opt/gururmm/build-server.sh (dispatched by webhook on server/ changes; change-gate last-built-commit-server; backup current binary; auto-rollback if fails is-active)
  • Per-platform SHA tracking: /opt/gururmm/last-built-commit-{linux,windows,mac,server}
  • Build log (Linux): /var/log/gururmm-build-linux.log
  • Build log (Windows): /var/log/gururmm-build-windows.log
  • Build log (Server): /var/log/gururmm-build-server.log
  • API (internal): http://172.16.3.30:3001
  • API (external): https://rmm-api.azcomputerguru.com (grey-cloud, no Cloudflare — agents connect here; Cloudflare only proxies the dashboard rmm.azcomputerguru.com)
  • Agent ingress path: agent -> site gateway -> pfSense (172.16.0.1, DNAT wan:443 -> NPM .20:18443) -> NPM (Docker on Jupiter .20, TLS terminate) -> origin nginx (.30:80, /ws -> 127.0.0.1:3001) -> Rust server :3001
  • Dashboard: https://rmm.azcomputerguru.com
  • DB URL: postgres://gururmm:<pw>@localhost:5432/gururmm (pw in /opt/gururmm/.env; do not hardcode here)
  • Vault paths: infrastructure/gururmm-server.sops.yaml (API creds, DB password, sudo password), infrastructure/gururmm-server-physical (SSH ed25519 key)
  • SSH to prod: ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30 (ed25519, key-only; sudo pw in vault infrastructure/gururmm-server.sops.yaml credentials.password)
  • Pre-merge verification skill: bash .claude/skills/gururmm-build/scripts/verify.sh server|agent|dashboard|migrations

Repo Structure

gururmm/
├── agent/          Rust agent (managed endpoints)
│   └── src/
│       ├── bsod.rs         Windows BSOD/kernel-crash detection
│       ├── commands/mod.rs CommandExecutor + recent-results cache (ACK dedup)
│       ├── device_id.rs    Durable multi-location device_id (Phase 1 Task 1)
│       ├── ipc.rs          Unix socket IPC (Linux); Windows named pipe
│       ├── transport/
│       │   ├── mod.rs      AgentMessage enum (incl. CommandAck; WatchdogEvent REMOVED)
│       │   └── websocket.rs WS transport; execute_command ACK + dedup; cmd-alias NAK
│       ├── tunnel/         TunnelManager state machine
│       ├── metrics/        sysinfo collection + temps
│       ├── registry_ops/   Windows registry read/write
│       ├── updater/        Self-update handler (container guard added BUG-019)
│       └── main.rs         systemd unit template generation
├── server/         Rust/Axum API server
│   └── src/
│       ├── api/            REST handlers (~37 route modules; updates.rs: promote/rollback)
│       ├── alerts/         Alerting modules (offline.rs: offline_sweep + mass_offline)
│       ├── db/             Database layer (sqlx); commands.rs (requeue_undelivered, fail_timed_out rewrite)
│       │                   watchdog_events.rs (doc-only stub — table exists, path removed per BUG-022 PR #45)
│       ├── fingerprint.rs  Log signature normalization + hash
│       ├── ws/             WebSocket handler (CommandAck, redispatch_pending_commands)
│       └── mspbackups/     MSP360 backup integration
├── dashboard/      React/TypeScript UI
│   └── src/pages/
│       └── EventLogWatches.tsx  CRUD management UI (shipped 0fa65f5)
├── tray/           System tray binary
├── installer/      WiX v4 MSI (gururmm-agent.wxs); cleanup.ps1 (preserves device_id)
├── deploy/
│   └── build-pipeline/ webhook-handler.py, build-*.sh, build-server.sh
├── scripts/        Build/ops scripts
└── docs/           FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md,
                    DESIGN.md, BUILD.md, specs/, HOST_MIGRATION_RUNBOOK.md, RMM_THOUGHTS.md

Development

Current Focus

As of 2026-06-22 (agent 0.6.67 stable / server v0.3.74+):

  • BUG-022 (PR #45, pending merge): Remove dead WatchdogEvent WS path. Merging PR #45 (fix/bug-022-watchdog-event-deadcode) = fleet build+deploy. Empty watchdog_events table stays until a future consolidated cleanup migration.
  • Open PRs awaiting merge (migration order matters — merge in order: 060 already on main; 061 -> 062 -> 063):
    • PR #40 SPEC-021 + BUG-020 (migration 063, renumbered from 060): logged-in-user domain/account-type detection + watchdog tray-teardown wiring. Also needs Pluto-signed-MSI agent build + fleet rollout.
    • PR #41 BUG-018 FK indexes (migration 061): FK indexes on five previously-unindexed cascade children (speeds the BUG-018 background purge). BUG-018 handler itself is already merged (cea87d4).
    • PR #42 MSP360 deep-link (migration 062): "Open in MSP360" button on backup tab plan card.
    • PR #43 Event Log Watch policy-clobber HIGH fix: no migration; merge anytime.
    • PR #44 Audit cleanup (low findings from 2026-06-21 audit).
    • PR #45 BUG-022 watchdog dead code.
    • PR #46 docs (BUG-021/022 roadmap status updates + RMM_THOUGHTS).
  • Agent-comms-durability Phase 2 (live TTY, planned): Extend WS message enum with TTY stream frames (stdin/stdout/stderr, resize, seq). "Activate" cadence switch. Resume from seq on mid-session drop. Single-use time-bounded session token; max one interactive session per agent; full audit. Estimated 3-5 days separate effort.
  • Agent-comms-durability Phase 3 (planned): Adaptive keepalive (AIMD, persist interval), bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper (last_inbound track + evict >~90s; treat missed Pong as close).
  • Durable agent identity Phase 1 Tasks 2-3 (pending): Task 2 = hardware-fingerprint capture. Task 3 = dashboard "probable duplicate" surfacing (read-only). Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for the 9 existing ghosts) pending soak.
  • SPEC-028 offboarding wizard (specification complete, 835 lines; implementation pending): Site + client offboarding workflows, data export, typed confirmation, audit logging.
  • BSOD detection Phase 2/3 (deferred): Dashboard Crashes tab shipped 2026-06-07. Remaining: BSOD in Alerts stream (issue #10), fetch_bsod_dump on-demand upload, full ~350-entry bugcheck name table.
  • Linux fleet unit drift: Auto-updater replaces binary but does NOT refresh systemd unit file. Pre-BUG-016-fix agents have new binary + old unit (missing StateDirectory=gururmm). Needs ops-script pass.
  • Alert mute dashboard (Task 5) -- NOT started: Ignore button + required-reason prompt; Muted filter; Un-ignore control. Server endpoints ready.
  • MSP360 backup Phase 2 (not started): Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint, monitoring-only API tier confirmed).
  • Tray IPC + peer authorization — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
  • Security audit backlog: credentials/:id/reveal horizontal privilege escalation (HIGH), update_rollouts.promoted_by UUID vs users.id int mismatch (schema quirk from migration 046).
  • SPEC-021 (PR #40): Logged-in-user domain + account-type detection (agent + server + dashboard). Pending merge.

Patterns & Anti-Patterns

Anti-patterns — never repeat:

Pattern What Went Wrong
useMemo with stable deps for data-dependent values queryClient is stable, memo never recomputes after queries resolve. Use useQuery instead.
CSS variable text colors inside the sidebar Sidebar bg is hardcoded dark; CSS vars flip in light mode. Use text-white explicitly inside sidebar.
Deploying without stopping the server first "text file busy" kernel error. Always systemctl stop before cp.
Building without DATABASE_URL sqlx compile-time macros fail. DATABASE_URL is in /opt/gururmm/.env (or /home/guru/.cargo/env on dev).
DB migrations without inserting into _sqlx_migrations Server crashes on start. Must insert SHA-384 checksum manually.
WiX MSI builds on Linux WiX requires msi.dll. MSI must be built on Pluto or Beast (Windows).
Manual builds via SSH All builds go through webhook-handler.py. Never SSH and run cargo build + artifact copy manually.
TOML/config for agent endpoint or site_id Server URL compiled into binary, site_id baked into MSI. No runtime config files for these values.
path.find('\\') in #[cfg(windows)] files Compiles on Linux silently, fails on Pluto MSVC with unterminated char literal. Use '\\\\'.
STATUS_BADGE_CLASSES Record const Vite/Rollup may optimize away the lookup. Use explicit getStatusBadgeClass() if/else function.
SSH heredoc for TypeScript edits Shell strips double-quote characters. Edit locally in submodule, push to Gitea, pull on server.
Restart-Service GuruRMMAgent -Force in command scripts Kills agent before it can report result. Commands stay forever running. Use scheduled task with delay instead.
sudo -u guru git in systemd build context git rejects repo as dubious ownership when running as root on guru-owned repo. Use git config --system --add safe.directory instead (applies to all users/envs — fixed 2026-06-11).
Self-updating running bash script bash reads line-by-line from disk; replacing mid-execution silently skips remaining blocks.
+1.77 legacy builds without --ignore-rust-version Fail MSRV check after adding rust-version to Cargo.toml. Add --ignore-rust-version to legacy build lines only.
StrictHostKeyChecking=no for Pluto SSH Replaced with pinned known-hosts at /opt/gururmm/pluto_known_hosts. MITM would compromise build artifacts.
CRLF line endings in migration files sqlx SHA-384 checksum mismatch causes server crash on start. .gitattributes + core.autocrlf=false + pre-commit hook prevents this.
Dead WebSocket write half WS write fails, send task dies, receive loop keeps agent in ConnectedAgents with dead write half. Commands silently fail. Fix: tokio::select! monitoring both tasks.
Using the minidump crate for Windows kernel dumps The crate only parses Breakpad MDMP format, not Windows kernel PAGEDU64 dumps. Parse DUMP_HEADER64/DUMP_HEADER32 at fixed offsets directly (validated against real dumps).
Build classification defaulting to stable New agent builds should default to beta; promotion to stable is an explicit re-tag of the <binary>.channel sidecar. Defaulting stable races the auto-update fleet ahead before any beta soak.
Webhook dispatching only agent builds The webhook historically triggered agent builds but never build-server.sh; server changes silently went unbuilt until manual intervention. Fixed 2026-06-02 — server build is dispatched alongside agent builds, gated by last-built-commit-server.
Auto-update-on-connect + default-stable tagging racing the fleet If a build is tagged stable (even briefly by mistake), agents on the stable channel auto-update immediately. Once fleet has updated, rolling back requires a new build. Default beta, soak, promote explicitly.
Channel pin at agent-level for a beta canary Agent-level update_channel is keyed to agent_id and lost on re-enrollment. Use site-level or client-level override (survives re-enroll).
Installer using & $stagingPath install 2>&1 | Out-Null under $ErrorActionPreference='Stop' Swallows output; surfaces a misleading NativeCommandError on any non-zero exit. Use Start-Process -Wait -PassThru + check $proc.ExitCode. Fixed 2026-06-11 (commit 5c0d004).
Reaper flipping un-acked commands to failed Causes false-failure for commands black-holed in NAT conntrack gaps. The reaper must only fail commands that were ACK'd but overran their real execution timeout. Un-acked commands must be requeued to pending.
command_type: "cmd" in API commands cmd was not a valid CommandType variant — agent silently dropped the whole message (no ACK, no result). Now aliased to Shell (commit 3de9faf), but check command_type first whenever a command appears un-acked. Valid types: shell, powershell, python, script, claude_task (+ cmd alias).
cargo +1.77 fetch as a parallel-build pre-resolve Resolves the full dep graph, which on Rust 1.77.2 pulls wit-bindgen (edition2024) and fails with rc=101 on both build hosts. Removed: the legacy builds scope deps via --features legacy and don't need a pre-resolve.
Concurrent parallel builds sharing one --target-dir Cargo's per-build-dir lock forces them to run serially. Each variant must have its own target dir (amd64=target/release, x86=target/x86, legacy-amd64=target/legacy-x64, legacy-x86=target/legacy-x86). Fixed 2026-06-12 (commit cfbdb59).
Pinning new legacy-wave deps without checking edition requirements BUG-021: getrandom 0.3.2 / zeroize 1.9.x require edition2024, which Cargo 1.77 cannot handle. Always verify the edition compatibility of any new dep version before adding it to the legacy build. Pin to getrandom 0.3.1 + zeroize 1.8.1 (or later confirmed-compatible versions).
Asserting build status from a point-in-time log snapshot Build logs are append-only; a prior FAILED line stays in the tail even after a later successful run. Check the LIVE last-built-commit-<platform> marker vs origin/main HEAD to confirm the current build status.
Config-push clobber at multiple independent sites The server has multiple code paths that push per-agent configuration (policy assign, policy unassign, reconnect, etc.). If the rule-shaper is not the single source of truth, adding a new push site will silently omit config (Event Log Watches were wiped by policy assign/unassign — third instance of this class). Always reuse the shared shaper; never duplicate the rule-mapping logic.
Reading the stale submodule working tree for code The submodule gitlink intentionally lags origin/main. Always use git -C projects/msp-tools/guru-rmm show origin/main:<path> or a worktree off origin/main to read current code. Reading the working tree gives stale results and led to false audit findings.

Good patterns:

  • Platform parity rule — any agent feature goes on Windows + Linux + macOS in the same commit. If a real implementation isn't feasible, add a working stub + // TODO(platform): <os> — <reason>. No silent no-ops.
  • Per-platform last-built-commit tracking — Linux builds succeed and record progress independently of Windows builds.
  • Holistic feature development — every feature ships backend + API + dashboard UI + docs together. Backend-only features are rejected.
  • sqlx offline mode — compile-time query validation requires DB reachable or offline cache present.
  • RuntimeDirectory=gururmm in systemd unit — systemd-native way to give agent writable /run/gururmm/ for IPC socket.
  • Registry-first path resolution — read HKLM:\SOFTWARE\GuruRMM for install dir, fall back to service PathName, then hardcoded default.
  • interrupt_running_commands() at reconnect — flips all status='running' commands for reconnecting agent to status='interrupted'.
  • Build change-gate + backup/rollback in build-server.sh — skips rebuild when server/ is unchanged; backs up previous binary; restores it if the new binary fails is-active.
  • Server's own root RMM agent for privileged ops — the server (172.16.3.30) runs the GuruRMM Linux agent as root (hostname gururmm); it can re-tag .channel sidecars and trigger build-server.sh without SSH or sshpass.
  • GURU-5070 site as permanent beta-channel canary — site "Mike's Car" has update_channel='beta' (survives agent re-enrollment). Gets every new beta build; stable fleet protected by explicit update_rollouts pin. (Channel pin on site, not agent — agent-level overrides are lost on re-enroll.)
  • Capability gate via partial index, not version-string compare — CI auto-bumps versions ([ci-version-bump]); pinning behavior to a fixed version string is fragile. Use a DB-observable capability marker (e.g., acked_at IS NOT NULL) that is set when the agent actually demonstrates the capability.
  • Two independently-deployable slices for cross-component features — agent slice and server slice can be deployed in any order; the server slice uses a capability gate so old agents keep legacy behavior.
  • Worktrees off origin/main for concurrent sessions — isolates each session's work from shared dirty checkout. Commit + push by explicit SHA (git push origin <sha>:refs/heads/<branch>) to avoid HEAD/branch-ref races. Standard pattern when multiple sessions touch the same submodule.
  • gururmm-build skill for pre-merge verificationbash .claude/skills/gururmm-build/scripts/verify.sh <component> runs the same compile checks the server will. Does NOT trigger production build. Merging to main is the only prod trigger.
  • Dashboard beta-before-main rule — dashboard changes go to beta (auto-built from main) first. Preview a feature branch without merging: git worktree add --detach <wt> origin/<branch> + npm install + npm run build + rsync dist/ /var/www/gururmm/dashboard-beta/. Promote with promote-dashboard.sh --confirm.

Build & Deploy

CRITICAL: Never trigger builds manually via SSH. All builds go through the webhook pipeline.

Gitea push to main
  -> webhook-handler.py (172.16.3.30:9000 via nginx :80 /webhook/, parallel threads per platform)
    -> build-shared.sh   (auto-version bump, git sync — runs once)
    -> build-linux.sh    (cargo build on .30; log: /var/log/gururmm-build-linux.log)
    -> build-windows.sh  (SSH -> Beast PRIMARY (guru@100.101.122.4 over Tailscale)
                          || fall back to Pluto Administrator@172.16.3.36;
                          each via its own pinned known-hosts
                          TWO-WAVE PARALLEL BUILD (see below)
                          sign-windows.sh (jsign + Azure Trusted Signing, /etc/gururmm-signing.env)
                          SCP artifacts back; log: /var/log/gururmm-build-windows.log)
    -> build-mac.sh      (stub — no build machine configured yet)
    -> build-server.sh   (gated: skips if no server/ changes since last-built-commit-server;
                          backs up current binary; builds + deploys; auto-rollback if fails is-active;
                          log: /var/log/gururmm-build-server.log)
  -> artifacts -> /var/www/gururmm/downloads/ with sha256 + -latest symlinks + .channel (beta)
  -> per-platform last-built-commit files updated
  -> systemctl restart gururmm-agent (local agent on .30)

Two-wave parallel Windows build (lever A, wired 2026-06-12, commit cfbdb59):

  • WAVE 1 (5 concurrent, stable toolchain): s_amd64 (target/release), s_x86 (target/x86), s_debug (target/debug-agent), s_tray (tray/target), s_cleanup (installer/cleanup/target)
  • WAVE 2 (2 concurrent, Rust 1.77 legacy, after Wave 1): l_amd64 (target/legacy-x64), l_x86 (target/legacy-x86) — uses $CARGO +1.77 --features legacy --ignore-rust-version
  • MSI build overlaps Wave 2 (depends only on Wave 1 artifacts)
  • Per-job exit-code aggregation; if Beast fails, Pluto is tried with the same structure
  • Beast timing: ~336s cargo phase (319s), ~46% faster than Beast serial cold (~622s), ~3.8x vs Pluto serial (~1269s)

[CRITICAL] Legacy wave dep-pin gotcha (BUG-021): The legacy wave resolves deps fresh with cargo +1.77 fetch... NO — cargo fetch is REMOVED from the pipeline (it was the failure point: Cargo 1.77.2 chokes on edition2024 deps like wit-bindgen when resolving the full dep graph). Legacy builds scope deps via --features legacy and do NOT need a pre-fetch. However: if any dep added to Cargo.toml ships a new patch that requires edition2024, the legacy build will fail. Pin known-problematic deps: getrandom = "0.3.1" (not 0.3.2+), zeroize = "1.8.1" (not 1.9.x). Verify before bumping any dep that the new version's MSRV is compatible with Rust 1.77.

Auto-version: build-shared.sh diffs agent/, server/, dashboard/ against last built SHA. For each changed component, bumps patch version in Cargo.toml or package.json, commits [ci-version-bump], pushes. Webhook skips builds where all commits are version bumps.

Build channel classification: New agent builds are tagged beta by default. Promotion to stable is an explicit step via the API: POST /api/updates/rollouts/:version/promote {"os":"windows","arch":"amd64"} — re-tags all .channel files for that version, records update_rollouts row, triggers scanner rescan. Rollback: POST /api/updates/rollouts/:version/rollback. Agents on the stable channel only receive the latest stable-tagged binary.

Downloads layout:

  • Base binaries: gururmm-agent-<plat>-<ver>.exe + .channel (beta/stable) + .sha256
  • Latest symlinks: gururmm-agent-<plat>-latest.exe + .channel + .sha256
  • Per-site signed cache: gururmm-agent-site-<site_id>-<plat>-<ver>.exe (built+jsign-signed on demand at install-request time; staged in downloads_dir, NOT /tmp — EXDEV fix 95ef901)

Dashboard channels — BETA-FIRST:

Channel URL Web root Updates
beta https://rmm-beta.azcomputerguru.com /var/www/gururmm/dashboard-beta auto on push — build-dashboard.sh (change-gated on last-built-commit-dashboard)
prod https://rmm.azcomputerguru.com /var/www/gururmm/dashboard explicit only — sudo /opt/gururmm/promote-dashboard.sh --confirm (backs up prod; --rollback restores)

Dashboard changes go to beta BEFORE main. To preview a feature branch without merging: build its dashboard/ in a git worktree add --detach <wt> origin/<branch>, npm install --no-audit --no-fund && npm run build, then rsync -a --delete dist/ /var/www/gururmm/dashboard-beta/. Do NOT touch /opt/gururmm/last-built-commit-dashboard. Do NOT hand-rsync into the prod web root.

DB migrations — applied automatically on server restart (sqlx), but must have correct SHA-384 checksum in _sqlx_migrations or server crashes. Migration number collisions across branches are a real hazard — renumber before merging if two branches both add NNN_*.sql. Migration numbers confirmed on origin/main: 001-060 (060_alert_mutes_agent_id_index). Pending branches: 061 (BUG-018 FK indexes), 062 (MSP360), 063 (SPEC-021).

Windows build hosts — Beast PRIMARY, Pluto FALLBACK. build-windows.sh runs attempt_build beast || attempt_build pluto: Beast is tried first; Pluto is used only if Beast is unreachable/down or its build fails. Both build the same C:\gururmm checkout; host keys at /opt/gururmm/{beast,pluto}_known_hosts.

Beast (GURU-BEAST-ROG) — PRIMARY:

  • Physical workstation, i9-14900K (24c/32t), RTX 4090, 127.8 GB RAM.
  • Reached from .30 over Tailscale (installed on .30) at tailnet IP 100.101.122.4, user guru.
  • SSH: ssh -o UserKnownHostsFile=/opt/gururmm/beast_known_hosts guru@100.101.122.4
  • Rust MSVC toolchain, WiX 4.x (WiX v6+ pulls OSMF and breaks the build), Gitea clone at C:\gururmm\.
  • On a different LAN (Wi-Fi 10.2.51.228) + tailnet — reachability depends on Beast being up on the tailnet.

Pluto (172.16.3.36) — FALLBACK:

  • Windows Server 2019 VM on Jupiter (Unraid)
  • SSH: ssh -o UserKnownHostsFile=/opt/gururmm/pluto_known_hosts Administrator@172.16.3.36
  • Rust stable 1.95.0 + 1.77 pinned for legacy builds
  • VS Build Tools (MSVC), sccache at C:\sccache, WiX v4, Gitea clone at C:\gururmm\
  • Xeon E5-2695 v3 @2.3GHz, 8c/16t KVM VM; serial build ~1269s

Auto-update delivery:

  • Server scans every 300s; dispatches update command on agent heartbeat
  • Gated on effective policy auto_update (default on when policy is null)
  • Agent: downloads to PrivateTmp, verifies SHA-256, replaces binary, restarts service
  • Force-trigger: POST /api/agents/:id/update
  • As of agent-comms-durability Phase 1: update dispatches piggyback on heartbeat re-offer path (same redispatch_pending_commands), so NAT'd agents receive updates reliably

Active State

Fleet (as of 2026-06-22):

  • ~270 enrolled agents total; ~178 typically online
  • Stable channel: 0.6.67 windows/amd64 (Windows build green 2026-06-22 02:19, marker 1dce66d)
  • Metrics flowing (~2531 rows/15 min); alerts firing with 0 legacy null dedup_keys; per-agent API endpoints all HTTP 200
  • Beta channel: site "Mike's Car" (103c10b9-c1de-4dd8-b382-b8362ed3143e) has update_channel='beta' (persists across re-enrollment). All GURU-5070 machines are on this site.

Enrolled clients/sites (2026-05-24 baseline; no removals since):

Client Type Sites Notable agents
AZ Computer Guru (internal) Internal DF Server Storage, Howard-VM, Main Office, Mike's Car, Mikes House Jupiter, PLUTO, gururmm, GURU-KALI, GURU-5070, Mikes-MacBook-Air.local, ACG-DC16, NEPTUNE, ix.azcomputerguru.com
BirthBiologic Corporate Main Office BB-SERVER
Cascades of Tucson Corporate CascadesTucson 27 agents — CS-SERVER, RECEPTIONIST-PC, ASSISTMAN-PC, MDIRECTOR-PC, MEMRECEPT-PC, and ~22 others
Dataforth Corp Corporate D1 AD2, DF-GAGETRAK
Goldstein (verify) (verify) Canary client for beta channel
Grabb & Durando Law Office Corporate Main Office GND-SERVER
Instrumental Music Center Corporate IMCMain IMC1
Key, Paul Residential Home KEY-MEDIA
Peaceful Spirit Residential Bridgette Home, Country Club, Mara Home BridgettePSHomeComputer, PST-SERVER (UDR Ultra, comms-durability verification), PST-SERVER2, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew
Safesite Corporate Glendale DESKTOP-3USU20B
Sombra Residential LLC Corporate main office DESKTOP-UQRN4K3, Server2013
Stamback Septic Corporate StambackSeptic DESKTOP-BTR2AM3, StambackLaptopNew
Swanson, Len Residential Home LAS-GAMER

API auth:

  • POST /api/auth/login -> JWT (~24h)
  • Creds: vault infrastructure/gururmm-server.sops.yaml -> credentials.gururmm-api.admin-email / admin-password; or via bash .claude/scripts/rmm-auth.sh
  • Key endpoints: GET /api/agents, POST /api/agents/:id/command, GET /api/commands/:id, POST /api/agents/:id/update, POST /api/updates/rollouts/:version/promote
  • Command fields: command_type (shell/powershell/python/script/claude_task/cmd alias), command (script text, JSON-encoded), optional contextsystem (default) or user_session (Windows WTS), plus timeout_seconds/elevated.

Dashboard — complete and working: Agents management (delete now returns 202 + background purge), Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis (Claude API), Alerts (clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO, Policies Dashboard (all tabs), Registry editor (read-only via HTTP), MSP360 backup status + mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance, AgentDetail Crashes tab + version history, fleet stats from /agents/stats, SiteDetail Revoke Key + Enrollment audit tab, Install Reports page, Fleet Discovery page, Event Log Watches management page (/event-log-watches, shipped 0fa65f5).

Dashboard — incomplete (see UI_GAPS.md):

  • Watchdog alerts UI — blocked on 2 missing server routes
  • BSOD in Alerts stream (Phase 2 deferred)
  • Tunnel session management (server skeleton "not yet implemented"; Phase 2 live TTY design will supersede)
  • Offboarding wizard UI (SPEC-028 complete; implementation pending)
  • Alert mute UI — Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard NOT started)
  • is_connected reflects real connectivity (null fleet-wide cosmetic bug; gap in API/dashboard layer)
  • Documentation / user guide / inline help tooltips (P3, not started)
  • MSP360 "Open in MSP360" deep-link (PR #42, dashboard portion pending merge)

Open Gitea issues:

  • #10 — BSOD detection Phase 2/3 (dashboard Alerts stream + fetch_bsod_dump + full bugcheck table)
  • #15 — Pipeline tray build (publish tray binary to downloads)
  • #16 — Windows IPC peer authz
  • #17 — logind console user resolution
  • #18 — macOS tray
  • #19 — subscriber broadcast

Security backlog (HIGH):

  • credentials/:id/reveal — horizontal privilege escalation (no ownership scope check)

Key Architecture Decisions (LOCKED)

These decisions are locked. Do not reverse without explicit user approval.

  1. Per-agent enrollment keys — MSI contains server URL + site_id only. Agent calls POST /api/enroll on first run; server issues unique per-agent key stored hashed. Enables revocation, clone detection, audit trail.
  2. Site-specific MSI generation — Universal base MSI from CI; dashboard endpoint generates site-specific MSI with site_id baked in via WiX property -> HKLM\SOFTWARE\GuruRMM\SiteId.
  3. No TOML/config for endpoints — Server URL compiled into binary. No runtime config files for server URL or site_id.
  4. Policy inheritance chain — global -> site -> client -> agent. Server computes merged effective policy and pushes via ConfigUpdate WebSocket message.
  5. Platform parity rule — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible.
  6. Watchdog as separate process — Main agent cannot reliably restart itself after a crash. The watchdog reports via REST (POST /watchdog-alert) — it has no WebSocket connection.
  7. Build pipeline is the only path to production — Enforces signing, checksum generation, consistent artifact layout.
  8. Multi-tenancy identity model (ADR-001) — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1.
  9. Holistic feature development (DESIGN.md) — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
  10. AI-optional operation — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
  11. Durability-first command delivery — The durable Postgres commands row is the source of truth; the WebSocket is only a delivery hint. An un-acked command is re-deliverable, never reaped. Idempotent end-to-end by command_id so re-delivery never double-executes. (Established 2026-06-11, agent-comms-durability spec.)

History Highlights

Date Event
2025-12-15 Project genesis: Windows service + Linux installer + site code auth + build server. DB migrated from Jupiter Docker to local PostgreSQL.
2026-04-19 Full drill-down navigation, auto-install on first run, Pluto build VM setup started.
2026-04-21 MSI build fix (missing WiX extension flag). DESIGN.md created (holistic development mandate). BirthBiologic onboarded.
2026-04-29 UI_GAPS.md created. Holistic development principle formalized.
2026-05-12 macOS agent Phase 1 deployed from Mikes-MacBook-Air. Code signing issue on Apple Silicon noted.
2026-05-15 Dead WebSocket write-half bug fixed. Temperature struct field name mismatch fixed.
2026-05-16 Watchdog bugs fixed (sc.exe fallback, suppress_until, hypervisor detection). /feature-request skill created.
2026-05-17 Syncro PSA Integration added to roadmap (P1) after Howard /feature-request. Office power failure recovery — all VMs recovered.
2026-05-18 Multi-tenancy architecture (ADR-001) decided. 5 SPEC documents created (SPEC-001 through SPEC-006).
2026-05-19 4-bug fix for AD2 crash loop. MSP360 backup integration completed (6 fixes). Clickable CPU/Memory gauge cards + process drill-down modal.
2026-05-23 /rmm-audit pass. Agent optimization Phases 1A-3. Auto-version bump mechanism. MSRV bumped to 1.85. Fleet at 0.6.29.
2026-05-24 Linux tray IPC + GTK (PR #13+#14) and peer-cred authz (PR #14) merged. PR #21 (ReadWritePaths fix) merged. Build pipeline split into per-platform scripts. Pluto known-hosts pinned. Fleet converged to 0.6.38.
2026-05-31 Roadmap reconciliation (17 corrections — roadmap understated built state). MSPBackups mapping/verify UI + dev-admin impersonation UI deployed (dashboard v0.2.32). BUG-008/013/014 status corrected to fixed. SPEC-021 (logged-in user domain detection) written after Howard feature request.
2026-06-01 BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with 0x116 VIDEO_TDR_FAILURE (nvlddmkm.sys); GuruConnect cleared on three grounds. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1/SF-2; merged to main (0ec55cf), agent versioned 0.6.51.
2026-06-02 Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh. Webhook wired to dispatch build-server.sh with change-gate + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed.
2026-06-04 Channel state confirmed via live Postgres query: GURU-5070 agents.update_channel = 'beta' (explicit per-agent override). Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux. BUG-020 (duplicate/ghost tray icons) fixed in commit 137dd85 to beta: per-session single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event (fix #3 dormant — now in PR #40).
2026-06-07 Backup-alert quality pass shipped. FU1 (summarize_backup_error + create_or_update_alert refresh) + FU2 (exclude PlanTypes 8/13 from alerting/compliance): false backup_failed 15 -> 2 fleet-wide (commits 779f7f6 + b82c010). backup_storage_low alert type removed entirely (DataCopied/TotalData is not destination capacity; 5 -> 0 false alerts; resolve_all_backup_storage_alerts).
2026-06-07 Credential inheritance deployed to production (server v0.3.45). Hierarchical propagation (Global → Client → Site), is_inheritable, /effective endpoints. Dashboard: clickable alert severity badges + client filtering. SPEC-028 offboarding wizard specification created (835 lines). FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management."
2026-06-07 Role-aware offline alerting + alert ignore/mute shipped. Offline sweep evaluator (60s interval; offline.rs): server-only agent_offline; migration 054 agents.role_override; site (>=50%+>=3) -> mass_offline_site; fleet (>=10) -> mass_offline_fleet. Warm-up restart guard (code review caught + fixed last_seen < started_at spec defect). Alert ignore/mute (perma-silence): migration 055 alert_mutes + muted status; dedup_key-keyed; reason required; gates create_or_update_alert + create_check_alert bypass; dashboard UI pending.
2026-06-07 UI gap batch + enrollment audit (GURU-BEAST-ROG session). get_agent upgraded to AgentWithDetails. Six new/updated server endpoints. Enrollment audit endpoint + per-row revoke. Dashboard: AgentDetail Crashes tab + version history, SiteDetail Enrollment tab, InstallReports + Discovery pages. Hotfix 6faa382 (role_override missing from new SQL — all GET /api/agents/:id returned 500).
2026-06-11 Physical server migration executed. Ubuntu VM at 172.16.3.30 retired; physical box took the same IP. Production down <27 min (06:53-07:20 MST). 162/212 agents reconnected within 15 min. Old VM at .46 as rollback anchor, decommissioned 2026-06-12. Server binary path changed from /usr/local/bin/gururmm-server to /opt/gururmm/gururmm-server. Ubuntu 26.04 + PostgreSQL 18. 7-day whale data backfilled (~3.4 GB, lossless, 0 pool-timeouts). Three build-env gaps fixed on new box (sccache, Pluto key, signing tools).
2026-06-11 Ghost agent root-caused. device_id storage fragile (single file, wiped by cleanup.ps1). 11 duplicate-hostname agents found (9 same-site ghosts; 2 cross-site non-merge). Durable agent identity spec created (specs/durable-agent-identity/; multi-AI reviewed). Phase 1 Task 1 shipped (agent v0.6.62): multi-location device_id (registry + ProgramData on Win; /etc + /var/lib on Linux; /Library + mirror on macOS) with self-heal; cleanup.ps1 whitelisted. Channel pin regression fixed: moved GURU-5070 beta override from agent-level to site "Mike's Car" (survives re-enrollment).
2026-06-11 Agent-comms-durability Phase 1 shipped (agent v0.6.63 / server v0.3.68). Spec created (specs/agent-comms-durability/; multi-AI reviewed, Gemini + Grok). Root cause: NAT conntrack asymmetry — server->agent pushes black-holed in idle conntrack gaps. Fix: agent CommandAck on receipt (migration 058 acked_at); agent dedup cache (re-report, never double-run); server reaper re-delivery (migration 059 delivery_attempts, cap 10; capability gate via partial index); pending commands re-offered on every heartbeat. Installer hardened (Start-Process). Build-pipeline dubious-ownership fixed. Canary (Goldstein + Peaceful Spirit): 6 agents verified, PST-SERVER ACK latency 16 ms. Fleet promoted 0.6.63 stable; 168 agents converged in ~3 min, 182 online throughout.
2026-06-12 Beast parallel Windows build (lever A) wired into production pipeline (commit cfbdb59). Two-wave parallel SSH: 5 concurrent stable + 2 concurrent legacy-1.77. Beast timing: ~336s (vs 622s serial, 1269s Pluto). Per-variant target-dir mandatory. Dropped cargo +1.77 fetch pre-resolve (edition2024 failure). v0.6.66 built clean on Beast in 336s. Old VM (.46) decommissioned (virsh domain + disk removed).
2026-06-12 command_type "cmd" root-caused as silent-drop (no such variant in CommandType enum). 7-round adversarial multi-AI quorum + packet captures. Fixed: #[serde(alias = "cmd")] on Shell + NAK on unparseable command (commit 3de9faf). Reverted spurious eviction change (80df458). v0.6.66 promoted to stable fleet-wide.
2026-06-15 BSOD dedup key changed from dump hash to bugcheck code (f0a4b7f, server v0.3.73) — mutes now silence all recurrences of same bugcheck, not just one dump. MSI EXDEV fix (95ef901, server v0.3.74) — site MSI builder staged in /tmp causing cross-device rename failure; fixed to stage in downloads_dir. Duplicate offline server alerts in Triage removed (b6ed564). sync.sh Phase 1 auto-heal (resolve_submodule_collisions) — rescued 94 orphaned RMM_THOUGHTS lines. logs/analyze switched to Claude API (c869e4d). 500-error-leak fully closed (58c1a96).
2026-06-18 sync.sh submodule auto-heal verified fleet-wide after earlier 2026-06-15 fix. RMM_THOUGHTS 2026-06-08 re-grounding pass + 8 Raw sections rescued and staged.
2026-06-21 BUG-019 (container self-update guard) fixed and merged to main (66a7f4e, v0.6.67 beta). BUG-018 (DELETE 202+bg, cea87d4) merged. Event Log Watch UI shipped (0fa65f5). Enrollment modal UX fix merged to prod (4027c86). sync.sh populate-only guard (Phase 2) fixed submodule-clobber root cause. Howard cleared for GuruRMM merges/deploys (Mike decision). gururmm-build skill + docs/BUILD.md created. Five PRs opened (#40-#44) covering SPEC-021, BUG-018 FK indexes, MSP360 deeplink, Event Log Watch policy-clobber HIGH fix, audit cleanup.
2026-06-22 BUG-021 (legacy build dep-pin getrandom+zeroize) fixed on main (1dce66d). Windows build green at v0.6.67, 2026-06-22 02:19. Fleet verified: ~270 agents / ~178 online, 0 legacy null dedup_keys. BUG-022 filed + fixed (PR #45): removed dead WatchdogEvent WS path (no producer — watchdog has no WS connection; REST watchdog-alert is the only real path).

Compilation Notes

  • 2026-05-26 recompile: Added the Capabilities / Feature Set section, synthesized from authoritative artifacts at live main (cd27a59). Corrected: command execution contexts, temperature monitoring (BUG-001 is resolved, not pending), Entra-only SSO, added user-inventory/discovery/VM-detection/safe-rollout surfaces. Changelogs are an unreliable capability source — stop at agent v0.6.22; migrations + commit log are authoritative.
  • Tunnel subsystem (verified against live main): agent side substantially built; server side is a dead-code skeleton (not declared in api/mod.rs, no routes, WS handler logs "not yet implemented"). Confirmed, not unverified.
  • macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). build-mac.sh is a stub as of 2026-05-24. [unverified whether automated pipeline includes macOS]
  • Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23). [unverified — may still be unfixed on new box]
  • Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24. [now superseded by comms-durability Phase 1 — heartbeat re-offer path should resolve this]
  • 2026-06-02 recompile: Folded in BSOD detection, server build webhook wiring, build channel default beta, versions updated. Migration count updated 46 -> 48.
  • 2026-06-04 recompile: Corrected GURU-5070 channel state. Stable fleet pinned at 0.6.47. BUG-020 documented.
  • 2026-06-07 recompile: Folded in backup-alert quality pass, credential inheritance, offline alerting + mute, UI gap batch + enrollment audit. Updated migration count to 55+ (054/055 confirmed).
  • 2026-06-11 recompile (GURU-5070/claude-main): Full recompile. Added: (1) Physical server migration (Ubuntu 26.04, PG 18, binary at /opt/gururmm, old VM .46 rollback anchor). (2) Durable agent identity (ghost root cause, spec, Phase 1 Task 1 durable device_id, channel pin fix). (3) Agent comms durability Phase 1 full detail (spec, slices A/B/C, deployment, fleet rollout, canary verification). (4) New capabilities: Audit Log (migration 056), Systemic Log Feedback Intelligence (migration 057), comms durability architecture. (5) Updated fleet size (~215 enrolled, 168-182 online). (6) Updated agent/server versions to 0.6.63/0.3.68. (7) Build pipeline: server IS auto-deployed by webhook (correction to earlier assumption); parallel build prototype on Pluto (3.51x, deferred integration). (8) Channel/promotion model documented in detail. (9) Downloads layout documented. (10) Updated server binary path + SSH details for physical box. (11) Added new anti-patterns (installer & operator, reaper false-fail, agent-level channel pin). Added ADR-11 (durability-first command delivery). Migration count updated to 59. Patterns/History preserved verbatim except new entries added.
  • 2026-06-22 recompile (Howard-Home/claude-main): Delta from 2026-06-11: (1) Fleet updated to ~270 enrolled / ~178 online, agent v0.6.67. (2) BUG-021 (legacy wave dep-pin, commit 1dce66d), BUG-018 (DELETE 202+bg, cea87d4), BUG-019 (container guard, 66a7f4e) all fixed on main. (3) Event Log Watch UI shipped (0fa65f5). (4) Enrollment modal UX fix to prod (4027c86). (5) Watchdog section corrected per BUG-022: WatchdogEvent WS path was dead code (no producer; watchdog has no WS connection); removed in PR #45. REST watchdog-alert is the only supported watchdog alert path. (6) Beast parallel two-wave Windows build documented (lever A, 336s, ~3.8x vs Pluto). (7) BUG-021 dep-pin gotcha documented (getrandom 0.3.1 + zeroize 1.8.1). (8) command_type "cmd" alias + NAK documented. (9) Event Log Watch policy-clobber HIGH fix (PR #43). (10) MSP360 deep-link (PR #42, mig 062). (11) SPEC-021 (PR #40, mig 063). (12) sync.sh submodule-clobber root-cause fix (populate-only guard). (13) BSOD dedup key changed to bugcheck code. (14) MSI EXDEV fix. (15) 500-error-leak fix. (16) logs/analyze switched to Claude API. (17) gururmm-build skill + docs/BUILD.md. (18) Howard cleared for merges/deploys. (19) Dashboard beta-before-main rule. (20) Migration count updated to 60 (origin/main), with 061-063 on pending branches. (21) Old VM confirmed deleted (2026-06-12). (22) Open PRs #40-#46 documented with merge-order. (23) New anti-patterns added (command_type cmd, cargo +1.77 fetch, target-dir concurrency, BUG-021 dep-pin, stale snapshot build status, config-push clobber, stale submodule working tree). (24) New good patterns added (worktrees for concurrency, gururmm-build skill, beta-before-main).
  • clients/cascades-tucson — RECEPTIONIST-PC enrolled (site CascadesTucson)
  • systems/gururmm-build — Physical server at 172.16.3.30 on Jupiter subnet; GuruRMM API 3001, ClaudeTools API 8001, Coord API, MariaDB, PostgreSQL, build pipeline; migrated from VM to physical box 2026-06-11
  • systems/jupiter — Unraid host at 172.16.3.20; virsh host for all VMs (Pluto/Claude-Builder, Unifi, OwnCloud); Docker: Gitea port 3000, NPM, Seafile; NPM is the public TLS terminator for rmm-api.azcomputerguru.com (forwards to .30:80)
  • systems/pluto — Windows build server (MSI, WiX) at 172.16.3.36