wiki: compile gururmm (full) — agent-comms-durability Phase 1, channel/promotion model, webhook auto-deploys server, fleet/version refresh (0.6.63/0.3.68)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,27 +2,36 @@
|
||||
type: project
|
||||
name: gururmm
|
||||
display_name: GuruRMM
|
||||
last_compiled: 2026-06-07
|
||||
compiled_by: GURU-BEAST-ROG/discord-bot
|
||||
last_compiled: 2026-06-11
|
||||
compiled_by: GURU-5070/claude-main
|
||||
aliases:
|
||||
- guru-rmm
|
||||
sources:
|
||||
- "gururmm@main: server/src/api/*.rs (REST API surface, ~30 route modules)"
|
||||
- "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs)"
|
||||
- "gururmm@main: server/migrations/*.sql (55+ migrations — feature checkpoints, incl. 048_bsod_events, 054_agent_role_override, 055_alert_mutes)"
|
||||
- "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/"
|
||||
- "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)"
|
||||
- "gururmm@main: server/src/api/*.rs (REST API surface, ~37 route modules)"
|
||||
- "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs, device_id.rs)"
|
||||
- "gururmm@main: server/migrations/*.sql (59 migrations — feature checkpoints through 059_command_delivery_attempts)"
|
||||
- "gururmm@main: server/migrations/048_bsod_events.sql"
|
||||
- "gururmm@main: server/migrations/056_audit_log.sql"
|
||||
- "gururmm@main: server/migrations/057_log_signatures.sql"
|
||||
- "gururmm@main: server/migrations/058_command_acked_at.sql"
|
||||
- "gururmm@main: server/migrations/059_command_delivery_attempts.sql"
|
||||
- "gururmm@main: agent/src/bsod.rs"
|
||||
- "gururmm@main: server/src/api/updates.rs (promote/rollback endpoints)"
|
||||
- "gururmm@main: deploy/build-pipeline/webhook-handler.py"
|
||||
- "gururmm@main: deploy/build-pipeline/build-server.sh"
|
||||
- "gururmm@main: commit 137dd85 (BUG-020 tray fix: single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event)"
|
||||
- "gururmm@main: commit 137dd85 (BUG-020 tray fix)"
|
||||
- "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/"
|
||||
- "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)"
|
||||
- projects/msp-tools/guru-rmm/CONTEXT.md
|
||||
- projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md
|
||||
- projects/msp-tools/guru-rmm/docs/UI_GAPS.md
|
||||
- projects/msp-tools/guru-rmm/docs/ARCHITECTURE_DECISIONS.md
|
||||
- projects/msp-tools/guru-rmm/docs/tech-stack.md
|
||||
- projects/msp-tools/guru-rmm/docs/DESIGN.md
|
||||
- projects/msp-tools/guru-rmm/specs/agent-comms-durability/shape.md
|
||||
- projects/msp-tools/guru-rmm/specs/agent-comms-durability/plan.md
|
||||
- projects/msp-tools/guru-rmm/specs/agent-comms-durability/references.md
|
||||
- projects/msp-tools/guru-rmm/specs/agent-comms-durability/standards.md
|
||||
- .claude/memory/reference_gururmm_server.md
|
||||
- .claude/memory/reference_gururmm_api.md
|
||||
- .claude/memory/gururmm-development-principles.md
|
||||
@@ -54,6 +63,11 @@ sources:
|
||||
- session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md
|
||||
- session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md
|
||||
- session-logs/2026-06-07-mike-gururmm-ui-gaps-enrollment.md
|
||||
- projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-host-migration-cutover.md
|
||||
- projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-backfill-build-pipeline.md
|
||||
- projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-promotion-ghost-identity.md
|
||||
- projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability.md
|
||||
- projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability-phase1.md
|
||||
backlinks:
|
||||
- clients/cascades-tucson
|
||||
- systems/gururmm-build
|
||||
@@ -65,99 +79,136 @@ backlinks:
|
||||
|
||||
## Summary
|
||||
|
||||
GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 55 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints.
|
||||
GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 215 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints.
|
||||
|
||||
**Current version:** agent 0.6.54 (beta) / 0.6.47 (stable) / server 0.3.45 as of 2026-06-07. Fleet on stable target 0.6.47 (pinned 2026-05-28); GURU-5070 is the lone beta agent (explicit per-agent override), running 0.6.54 and auto-riding each new beta build. Note: committed changelogs are stale (stop at agent v0.6.22 / server v0.3.1) — migrations + commit log are the authoritative feature record, not changelogs.
|
||||
**Current version:** agent 0.6.63 (stable) / server 0.3.68 as of 2026-06-11. Fleet converged to 0.6.63 (~168-182 typically online); ~20 stragglers on older versions auto-update on reconnect. Changelogs are stale (stop at agent v0.6.22) — migrations + commit log are the authoritative feature record.
|
||||
|
||||
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010` on main). `backup_storage_low` alert type removed entirely — the `DataCopied/TotalData` ratio measures backup-dataset completeness, not destination capacity, and produced 5 fleet-wide false alerts. See Backup Integration section for full detail.
|
||||
**Agent-comms-durability Phase 1 shipped 2026-06-11:** Commands black-holed at NAT'd sites (e.g., UDR Ultra) no longer falsely fail. Agent sends a `CommandAck` on receipt; server reaper re-delivers un-acked commands (never fails them); pending commands re-offered on every heartbeat (rides the warm conntrack). Verified: PST-SERVER (Peaceful Spirit, behind UDR Ultra) returned a command with `acked_at` stamped and `ack_latency 16 ms`. Fleet of 168 agents converged to 0.6.63 in ~3 min. Migrations 058 (`acked_at`) + 059 (`delivery_attempts`) applied on server restart. See Architecture section for full detail.
|
||||
|
||||
**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07 (second session):** Scheduled offline-sweep evaluator (60s tokio interval, server-only). `agent_offline` alerts for servers only; classifier: `os_product_type` 2/3, else `os_name`/`os_version` ~/server/i, else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to a representative offline agent. Warm-up restart guard. Dashboard: servers elevated in triage individually; workstations collapsed. Alert mute (perma-silence, migration 055 `alert_mutes` + `muted` status): keyed on `dedup_key`, permanent until un-ignored, reason required; gates both `create_or_update_alert` and `create_check_alert` bypass. Dashboard Ignore/Muted/Un-ignore UI NOT yet built. Commits f1cdf5d/30e4f23/21d63bd/3eedf91 (offline), 29c405e/a120e71 (mute). Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier.
|
||||
**Physical server migration completed 2026-06-11:** The kitchen-sink Ubuntu VM at 172.16.3.30 was retired. A physical box took the same IP (`172.16.3.30`) running Ubuntu 26.04 + PostgreSQL 18. Production back online within 27 min (maintenance window). Old VM parked at 172.16.3.46 with data pristine as rollback anchor. Server binary moved from `/usr/local/bin/gururmm-server` (old VM) to `/opt/gururmm/gururmm-server` (new box). See also: [[gururmm-build]].
|
||||
|
||||
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation: on-disk directory is `guru-rmm` hyphenated; wiki and Gitea repo use `gururmm` no-hyphen).
|
||||
**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide. `backup_storage_low` alert type removed entirely. See Backup Integration section.
|
||||
|
||||
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active `azcomputerguru/gururmm` repo; the pinned pointer normally lags `main` (expected). Development happens in the submodule working tree and changes are committed and pushed to Gitea from there.
|
||||
**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07:** Server-only `agent_offline` alerts; site/fleet mass-offline aggregation; `alert_mutes` perma-silence (dashboard UI still pending).
|
||||
|
||||
**Goal:** Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system — not bolted-together products.
|
||||
**See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation).
|
||||
|
||||
**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active repo. Development happens in the submodule working tree; commits/pushes go to Gitea from there (GURU-5070 cannot push to gururmm Gitea — use the server at 172.16.3.30 or 172.16.3.47).
|
||||
|
||||
**Goal:** Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system.
|
||||
|
||||
---
|
||||
|
||||
## Recent Work
|
||||
|
||||
### 2026-06-11 — Physical Server Migration (Cutover + Backfill + Build Pipeline)
|
||||
|
||||
**Cutover (morning ~06:53-07:20 MST):**
|
||||
- Stopped all VM writers, fresh-restored gururmm/guruconnect (PG 18) and claudetools (MariaDB 11.8) on the new physical box via direct SSH to the VM.
|
||||
- VM released 172.16.3.30; new box took 172.16.3.30 + fresh management IP 172.16.3.47 via self-confirming detached `netplan apply` (auto-reverts if gateway unreachable — prevents a bricked network config).
|
||||
- All nine stack services active; 162/212 agents reconnected within 15 min. Fleet broadcast sent; 7 migration locks released.
|
||||
- Two production issues fixed in-window: (1) nginx had stale config missing `/ws` location — `systemctl reload nginx` fixed it and agents reconnected; (2) Prometheus failed due to WAL copied mid-write — moved `wal`/`chunks_head` aside, persisted blocks healthy (~2h of samples lost during downtime anyway).
|
||||
|
||||
**Backfill + Build Pipeline (afternoon session):**
|
||||
- 7-day whale data (metrics 1,189,924 rows + agent_logs 2,262,938 rows; ~3.4 GB total) backfilled VM->new box via direct SSH `\copy` stream in ~2.5 min. Load verified lossless; SSD sustained 186-214 MB/s writes with ZERO pool-timeouts while 212 live agents wrote concurrently.
|
||||
- Three build-env gaps found and fixed after first push-triggered build: (1) sccache not installed (copied v0.8.2 from VM); (2) root had no Pluto SSH key (copied guru's id_ed25519 to /root/.ssh); (3) `/etc/gururmm-signing.env` + jsign + Java missing (installed Java + jsign-7.1.jar; recreated jsign wrapper).
|
||||
- Parallel Windows build prototyped (7 Start-Job isolated builds, 3.51x speedup 28.9m -> 8.2m, 15 rustc vs 1). Integration into production `build-windows.sh` DEFERRED (design captured in runbook).
|
||||
- First clean build on new box: signed Windows agent v0.6.61 beta. Independently verified (Authenticode, Arizona Computer Guru LLC via Azure Trusted Signing, RFC3161 timestamp, sha256 OK).
|
||||
|
||||
---
|
||||
|
||||
### 2026-06-11 — Durable Agent Identity (Ghost Agent Spec + Phase 1 Task 1)
|
||||
|
||||
**Ghost agent root cause:**
|
||||
- `device_id` is a random UUIDv4 stored in ONE file (`%ProgramData%\GuruRMM\.device-id` / `/var/lib/gururmm/.device-id`). The agent's own `cleanup.ps1` (Steps 6+7) deleted both ProgramData and HKLM\SOFTWARE\GuruRMM, wiping it. A past hardware->random scheme change also orphaned agents (11 ghosts with old `win-`/`hw-` device_ids). Server dedups ONLY on device_id at WS-connect; no reconciliation. 11 hostnames have duplicates (9 same-site true ghosts; 2 — MSI, SERVER — are different machines at different sites and must NOT be merged). 32 tables FK agent_id.
|
||||
- Context: channel pin regression discovered (GURU-5070 still on 0.6.59 after v0.6.61 promotion) because the beta override lived on the orphaned ghost enrollment. Fix: set `update_channel='beta'` at site "Mike's Car" (survives re-enrollment); `resolve_agent_channel` = `agent.or(site).or(client).or('stable')`.
|
||||
|
||||
**Spec + Phase 1 Task 1:**
|
||||
- Multi-AI design review (Gemini + Grok, strong convergence): durable multi-location device_id (primary) + hardware fingerprint (reconciliation guard, not key). Identity model: `agent/src/device_id.rs` rewritten — HKLM registry + ProgramData on Windows; /etc + /var/lib on Linux; /Library + mirror on macOS — with self-heal, plus `cleanup.ps1` whitelisting the identity files. Committed 0b81d33 (v0.6.62). Linux build green.
|
||||
- Spec: `specs/durable-agent-identity/` (shape/plan/references/standards). Phase 1 (Task 1 shipped) stops ~99% of new ghosts. Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for 9 existing ghosts) per spec after soak.
|
||||
|
||||
---
|
||||
|
||||
### 2026-06-11 — Agent Comms Durability Phase 1 (Spec + Slices A/B/C + Deploy + Fleet)
|
||||
|
||||
**Root cause (confirmed in code, ~95% AI-validated):** NAT conntrack asymmetry. Commands are a server->agent PUSH over the persistent WS; agent->server heartbeats refresh the UDR conntrack and replies ride back. A server->agent command push in an idle gap hits a torn-down conntrack and is silently dropped (server `sender.send()` buffers without error for ~15 min). The server's 30s WS ping (server->agent) CANNOT revive a dead conntrack; only agent->server can. Server had no half-open detection (read loop ignores Pong). Agent already heartbeats every 30s — so the conntrack floor is sub-30s — meaning durable delivery nets are the correct fix, not keepalive alone.
|
||||
|
||||
**Spec:** `specs/agent-comms-durability/` (shape/plan/references/standards). Multi-AI reviewed (Gemini 3.1-pro + Grok 4.3, 2 rounds, strong convergence). Phased: Phase 1 = durable delivery (shipped); Phase 2 = live TTY (planned); Phase 3 = AIMD keepalive + bulk->HTTPS + half-open eviction (planned).
|
||||
|
||||
**Slice A (dormant server foundation — committed b93f2ef):** Migration 058 adds `commands.acked_at TIMESTAMPTZ`. Server-side `CommandAck` handler arm and `mark_command_acked()` in db. Additive + dormant — no behavior change until agents send ACKs.
|
||||
|
||||
**Slice B (agent — committed 45870b1, deployed 2026-06-11 as v0.6.63):** `AgentMessage::CommandAck { command_id }` added. Agent sends CommandAck on RECEIPT of every dispatched command (before execution). `CommandExecutor` gained a FIFO-bounded recent-results cache (64 entries, 256 KB cap each): re-reports cached result for an already-completed command; ignores an in-flight duplicate; never double-runs. Result is cached BEFORE deregistering the running task so a re-delivery in the complete()/send() gap always finds it cached (re-report) rather than not-running (re-run).
|
||||
|
||||
**Slice C (server — committed ca1657b, deployed 2026-06-11 as v0.3.68):** Migration 059 adds `delivery_attempts INTEGER NOT NULL DEFAULT 0` + partial index `idx_commands_agent_acked ON commands(agent_id) WHERE acked_at IS NOT NULL` (capability gate). `db::requeue_undelivered_commands` returns an un-acked command past a 60s ACK deadline to `pending` (re-delivery, never failed). `db::fail_timed_out_commands` rewritten into three safe cases — can no longer falsely-fail a deliverable command. `mark_command_running` increments `delivery_attempts` (cap 10). Reconnect block refactored into shared `redispatch_pending_commands` helper (send_to-based); called on EVERY heartbeat — so a black-holed push is re-delivered on the next agent beat (which just refreshed the NAT conntrack) without needing a reconnect.
|
||||
|
||||
**Capability gate (version-independent):** Reaper only re-delivers for agents that have demonstrably ACK'd at least once (`EXISTS(... acked_at IS NOT NULL)`). Old non-ACK agents keep the legacy fail-on-timeout path and can never double-run. Safe to deploy at any point during gradual agent rollout.
|
||||
|
||||
**Deploy + fleet rollout:**
|
||||
- Installer bug fixed (5c0d004): `Start-Process -Wait -PassThru` + exit code check (old `& $stagingPath install 2>&1 | Out-Null` swallowed output + surfaced misleading NativeCommandError). Agent CLI ANSI log garbage fixed (parse before logging init, file-only for one-shot CLI).
|
||||
- Server auto-deployed by webhook as v0.3.68 (commit 7376340). Migrations 058+059 applied. All durability code live server-side.
|
||||
- Build pipeline dubious-ownership fixed: `git config --system --add safe.directory /home/guru/gururmm` (/etc/gitconfig; webhook runs builds as root on guru-owned repo).
|
||||
- Canary: clients Goldstein + Peaceful Spirit set to `update_channel='beta'`; all 6 canary agents updated 0.6.61->0.6.62->0.6.63 within ~24 min. PST-SERVER (UDR Ultra, the original affected agent) verified: `hostname; whoami` returned exit 0, `acked_at` stamped, `ack_latency 16 ms`, `delivery_attempts 1`. Full ACK durability chain proven live.
|
||||
- Fleet promotion: `POST /api/updates/rollouts/0.6.63/promote {"os":"windows","arch":"amd64"}` -> 6 files re-tagged stable. 168 agents on 0.6.63 in ~3 min; online count held at 182 throughout (no mass dropout).
|
||||
|
||||
---
|
||||
|
||||
### 2026-06-07 — UI Gap Batch + Enrollment Audit (GURU-BEAST-ROG session)
|
||||
|
||||
**API changes:**
|
||||
- `get_agent` handler now returns `AgentWithDetails` — includes `client_id` and `client_name` via JOIN query (was previously missing client context)
|
||||
- `get_agent` handler now returns `AgentWithDetails` — includes `client_id` and `client_name` via JOIN query
|
||||
- New endpoints: `GET /api/agents/:id/bsod-events`, `GET /api/agents/:id/version-history`
|
||||
- New endpoints: `GET /api/install-reports`, `GET /api/install-reports/:id`
|
||||
- New endpoint: `GET /api/discovery/all-devices`
|
||||
- `DELETE /api/agents/:id/key` redesigned as dual-revoke: nulls `agents.api_key_hash` (legacy Mode 1/2) AND sets `enrolled_agents.revoked = TRUE` (modern agk_ Mode 3). Previous implementation only touched the legacy field, leaving modern agents able to reconnect.
|
||||
- New endpoint: `GET /api/sites/:id/enrolled-agents` — full enrollment audit with duplicate-hostname annotation (scoped to active/non-revoked rows)
|
||||
- New endpoint: `POST /api/enrolled-agents/:id/revoke` — per-enrollment targeted revocation with cascade to `agents.api_key_hash`
|
||||
- `DELETE /api/agents/:id/key` redesigned as dual-revoke: nulls `agents.api_key_hash` (legacy Mode 1/2) AND sets `enrolled_agents.revoked = TRUE` (modern agk_ Mode 3)
|
||||
- New endpoint: `GET /api/sites/:id/enrolled-agents` — full enrollment audit with duplicate-hostname annotation
|
||||
- New endpoint: `POST /api/enrolled-agents/:id/revoke` — per-enrollment targeted revocation with cascade
|
||||
|
||||
**Dashboard additions:**
|
||||
- AgentDetail: Crashes tab (BSOD events), version history table in Updates tab
|
||||
- Dashboard.tsx: fleet stats wired to `/agents/stats` endpoint (was computed client-side)
|
||||
- SiteDetail: Revoke Key button + Enrollment tab with full audit table, status badges, duplicate-hostname warnings, per-row revoke with confirm dialog
|
||||
- SiteDetail: Revoke Key button + Enrollment tab with full audit table, status badges, duplicate-hostname warnings, per-row revoke
|
||||
- New pages: InstallReports (`/install-reports`), Discovery (`/discovery`)
|
||||
- FunctionRail: nav links for Install Reports + Fleet Discovery
|
||||
|
||||
**Hotfix (production 500s):**
|
||||
- `get_agent_with_details_by_id` SQL was missing `a.role_override` — migration 054 added `role_override: String` (non-optional) to `AgentWithDetails` and the pre-054 SQL template was used for the new function. `sqlx::FromRow` panics at runtime when a required field has no column. All `GET /api/agents/:id` calls returned 500 until hotfix commit `6faa382`.
|
||||
**Hotfix (production 500s):** `get_agent_with_details_by_id` SQL was missing `a.role_override` (migration 054 added it as non-optional). All `GET /api/agents/:id` calls returned 500 until hotfix commit `6faa382`.
|
||||
|
||||
**Key commits:** `5854c63` (UI gap batch), `6faa382` (hotfix: role_override in SQL), `49a7109` (enrollment audit batch), `00129af` (fix: wire revoke_agent_key_handler to route)
|
||||
|
||||
**Tunnel status:** Server `tunnel.rs` is a dead-code skeleton logging "not yet implemented" — deferred; needs xterm.js design + WS protocol spec before implementation.
|
||||
**Key commits:** `5854c63` (UI gap batch), `6faa382` (hotfix), `49a7109` (enrollment audit batch), `00129af` (fix route wiring)
|
||||
|
||||
---
|
||||
|
||||
### 2026-06-07 — Credential Inheritance Deployment & Offboarding Spec
|
||||
|
||||
**Deployed to production:**
|
||||
- Server v0.3.45 with credential inheritance feature enabled
|
||||
- Credential inheritance allows hierarchical credential propagation (Global → Client → Site) with opt-in `is_inheritable` flag
|
||||
- De-duplication logic by (credential_type, label) with most-specific-scope-wins resolution
|
||||
- `/effective` endpoints validated for clients and sites showing proper inheritance and conflict resolution
|
||||
|
||||
**Dashboard UI enhancements:**
|
||||
- Clickable alert severity badges in ClientExceptionsBand component
|
||||
- Badge clicks filter /alerts page by severity + client_id for scoped alert viewing
|
||||
- Offline badge filters /agents page to show client-specific offline agents only
|
||||
- Deep-linking support via URL parameters for client filtering on Alerts and Agents pages
|
||||
|
||||
**Specification work:**
|
||||
- SPEC-028 offboarding wizard specification created (835 lines)
|
||||
- Covers site and client offboarding workflows with 6-step and 5-step modals respectively
|
||||
- Includes data export, dependency analysis, typed confirmation, audit logging, and cascading deletions
|
||||
- FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section covering offboarding/onboarding features
|
||||
|
||||
**Key features of offboarding spec:**
|
||||
- Multi-step modal workflow with clear progression
|
||||
- Pre-flight dependency checks (alerts, pending commands, active connections)
|
||||
- Comprehensive data export (credentials, policies, network devices, audit trail) with temp tokens and 1-hour expiry
|
||||
- Typed name confirmation for destructive final step
|
||||
- Immutable audit_logs table for compliance and traceability
|
||||
- Enforced cascade deletion validation for clients
|
||||
- Server v0.3.45 deployed with credential inheritance (Global → Client → Site, `is_inheritable` flag, de-duplication by (credential_type, label), `/effective` endpoints).
|
||||
- Dashboard: clickable alert severity badges with client filtering; offline badge scopes to client-specific agents.
|
||||
- SPEC-028 offboarding wizard specification created (835 lines) — 6-step site + 5-step client offboarding modals, data export (temp tokens, 1h expiry), typed confirmation, audit logging, cascade deletion. Implementation pending.
|
||||
- FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section.
|
||||
|
||||
---
|
||||
|
||||
## Capabilities / Feature Set
|
||||
|
||||
*Synthesized from authoritative artifacts (API routes, agent modules, 48 migrations, roadmap, commit log) at live `main` — not from session logs. See Compilation Notes.*
|
||||
*Synthesized from authoritative artifacts (API routes, agent modules, 59 migrations, roadmap, commit log) — not from session logs. See Compilation Notes.*
|
||||
|
||||
Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to `interrupted`. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible).
|
||||
|
||||
### Monitoring & Telemetry
|
||||
- Core metrics per interval (policy-tunable per section): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win `GetLastInputInfo`, Linux `xprintidle`), public/WAN IP (cached, multi-service fallback). Cross-platform via `sysinfo`.
|
||||
- Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Windows via bundled **LibreHardwareMonitor** + WMI (`ohw.rs`); Linux via `/sys/class/thermal`; `sysinfo::Components` fallback.
|
||||
- Core metrics per interval (policy-tunable): CPU %, memory %/bytes, disk %/bytes, network rx/tx deltas, uptime, logged-in user, user idle time (Win `GetLastInputInfo`, Linux `xprintidle`), public/WAN IP (cached, multi-service fallback). Cross-platform via `sysinfo`.
|
||||
- Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Linux via `/sys/class/thermal`; `sysinfo::Components` fallback. **Windows LibreHardwareMonitor removed 2026-05-27 (CVE-2020-14979); Windows thermal collection pending a safe replacement strategy.**
|
||||
- Process drill-down: top 10 by CPU + top 10 by memory in every metrics payload (cross-platform).
|
||||
- Network state: per-interface IPv4/IPv6, MAC, derived CIDR subnets — sent on change only.
|
||||
- **Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01):** Windows-only `agent/src/bsod.rs` polls `C:\Windows\Minidump` for new `.dmp` files (5-min filetime poll), parses the kernel dump header at fixed `DUMP_HEADER64`/`DUMP_HEADER32` offsets (bugcheck code @0x38, 4 parameters @0x40/48/50/58, FILETIME @0xFA8 — the `minidump` crate parses only Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references the System event log (WER 1001 / Kernel-Power 41) for Report Id and faulting driver, deduplicates via a `C:\ProgramData\GuruRMM\bsod-seen.json` watermark (first run baselines existing dumps as seen, alerts on none), and sends `AgentMessage::BsodEvent`. Server: migration `048_bsod_events.sql` + `server/src/db/bsod_events.rs` + `ws/mod.rs` handler inserts the row and raises an **always-Critical** alert, deduplicated by unique `(agent_id, dump_sha256)` + alert `dedup_key`. Verified end-to-end against a real `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys) on GURU-5070. Phase 2/3 deferred: dashboard "Crashes" tab + BSOD in Alerts stream, `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
||||
- **Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01):** `agent/src/bsod.rs` polls `C:\Windows\Minidump` for new `.dmp` files (5-min filetime poll), parses kernel dump header at fixed `DUMP_HEADER64`/`DUMP_HEADER32` offsets (bugcheck code @0x38, 4 parameters, FILETIME @0xFA8 — the `minidump` crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references System event log (WER 1001 / Kernel-Power 41) for Report Id + faulting driver, deduplicates via `C:\ProgramData\GuruRMM\bsod-seen.json` watermark. Sends `AgentMessage::BsodEvent`. Server: migration 048 + `server/src/db/bsod_events.rs` + ws handler — always-Critical alert deduplicated by `(agent_id, dump_sha256)`. Verified against a real `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys). Dashboard Crashes tab shipped 2026-06-07. Phase 2/3 deferred: BSOD in Alerts stream, `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
||||
|
||||
### Remote Execution
|
||||
- Command types: `shell`, `powershell`, `python`, raw `script{interpreter}`, and `claude_task`. Options: `timeout_seconds`, `elevated`.
|
||||
- **Execution context** (`041_add_command_context`): `system` (default — Session 0 / service SYSTEM) **or `user_session`** (runs in the active logged-on user's desktop session via WTS token impersonation: `WTSQueryUserToken` + `CreateProcessAsUserW` + per-user env block). Windows-only; requires an active session.
|
||||
- **Execution context** (`041_add_command_context`): `system` (default — Session 0 / service SYSTEM) or `user_session` (runs in the active logged-on user's desktop session via WTS token impersonation: `WTSQueryUserToken` + `CreateProcessAsUserW` + per-user env block). Windows-only; requires an active session.
|
||||
- Commands individually cancellable (`POST /commands/:id/cancel`) and aborted on disconnect. Auditable history; status running/completed/failed/timeout/interrupted.
|
||||
- Script library (`017_scripts`): stored scripts dispatched to agents with args/env/timeout/`run_as_user`; per-run history.
|
||||
- **Agent Comms Durability (Phase 1, shipped 2026-06-11 — agent v0.6.63 / server v0.3.68):**
|
||||
- Problem: server->agent command pushes black-holed at NAT'd sites (UDR Ultra) when conntrack gaps tear down the inbound direction. Agent heartbeats work (agent->server refreshes conntrack); reaper falsely failed black-holed commands after timeout.
|
||||
- Fix: agent sends `CommandAck { command_id }` on RECEIPT (migration 058 `acked_at`). Agent deduplicates by command_id — re-reports cached result, ignores in-flight duplicate, never double-runs. Server reaper re-delivers an un-acked command past 60s ACK deadline (returns to `pending`) instead of failing it. Migration 059 adds `delivery_attempts` counter (cap 10) + partial index for capability gate. Pending commands re-offered on (re)connect AND on every heartbeat. Verified at PST-SERVER: `acked_at` stamped, `ack_latency 16 ms`, `delivery_attempts 1`.
|
||||
- Capability gate: reaper only re-delivers for agents that have ACK'd at least once (`idx_commands_agent_acked` partial index). Old agents keep legacy fail-on-timeout path; safe mid-rollout.
|
||||
- Code anchors: `server/src/db/commands.rs` (`requeue_undelivered_commands`, `fail_timed_out_commands` rewrite, `DELIVERY_ACK_DEADLINE_SECS=60`, `MAX_DELIVERY_ATTEMPTS=10`), `server/src/ws/mod.rs` (`redispatch_pending_commands`, `CommandAck` handler), `agent/src/transport/websocket.rs` + `agent/src/commands/mod.rs` (ACK send + recent-results cache, `RECENT_CAP=64`, `MAX_CACHED_RESULT_BYTES=256 KB`).
|
||||
- Phase 2 (live TTY — rides warm WS, seq/resume, cadence switch) and Phase 3 (AIMD keepalive, bulk->HTTPS, half-open eviction sweeper) are PLANNED, not shipped.
|
||||
|
||||
### Inventory & Discovery
|
||||
- Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh.
|
||||
@@ -167,73 +218,75 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r
|
||||
|
||||
### Patch / Agent Update Management
|
||||
- Self-updater: server sends version + URL + SHA256; agent downloads, verifies checksum, atomically swaps binary, restarts, and **auto-rolls back** to a kept backup if it fails to reconnect (~180s window).
|
||||
- Auto-update gated on effective policy `auto_update` (channel + defer_hours; maintenance-window field received but not yet enforced [verify]). Force via `POST /agents/:id/update`. Update channels at agent/site/client (`026`).
|
||||
- Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Scaffolding only — promotion is manual; health-gated automation is written-but-unwired (Phase 2).**
|
||||
- Auto-update gated on effective policy `auto_update` (channel + defer_hours). Force via `POST /agents/:id/update`. Update channels at agent/site/client (`026`).
|
||||
- **Channel model:** new builds tag **beta** by default; agents default to **stable** channel; promotion to stable is a DELIBERATE re-tag of the `.channel` sidecar. `scanner.rs` `get_latest_version`: stable agents get latest STABLE-tagged binary; beta gets absolute-latest. `resolve_agent_channel` = `agent.or(site).or(client).or('stable')`.
|
||||
- **Promotion API:** `POST /api/updates/rollouts/:version/promote` body `{"os","arch"}` — re-tags all `.channel` files for the version, records `update_rollouts` row, force-rescans. Rollback: `POST /api/updates/rollouts/:version/rollback`.
|
||||
- Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Health-gated automation is written-but-unwired (Phase 2); promotion is manual via API.**
|
||||
|
||||
### Policy & Configuration Management
|
||||
- Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via `ConfigUpdate`. Effective policy queryable per scope.
|
||||
- Checks engine (`018`/`019`): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win `sc.exe`, Linux `systemctl`). Policy-attached check templates (`024`) with push-to-agent sync. On-demand `run-checks`.
|
||||
- Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but aren't routed yet [verify].**
|
||||
- Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but are not routed yet [verify].**
|
||||
|
||||
### Alerting & Watchdog
|
||||
- Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope.
|
||||
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. Full alert CRUD + ack/resolve.
|
||||
- **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Generates server-only `agent_offline` alerts based on `agent_role` classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054, `agents.role_override`). Site rule (>=50% of a site's servers offline AND >=3 absolute) -> `mass_offline_site` aggregate; fleet rule (>=10 servers offline) -> `mass_offline_fleet` aggregate; aggregates pinned to a representative offline agent with site/fleet `dedup_key` (avoids making `alerts.agent_id` nullable). Restart guard = warm-up window after boot only (NOT `last_seen < started_at`, which is permanently false in steady state -- code review caught and fixed this spec defect). Dashboard: offline servers individual + elevated in triage; offline workstations collapsed into a "N workstations offline" roll-up; role-override control on agent detail page. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier; inventory-less offline servers (e.g., SIF-SERVER, Server2013) auto-classify as workstations until manually overridden.
|
||||
- **Alert ignore/mute -- perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Mute keyed on `dedup_key` (universal recurring-condition id, always set). Permanent until un-ignored; reason required (400 if missing). Gate inserted at the top of `create_or_update_alert` AND the `create_check_alert` bypass so muted conditions write `status='muted'` and never email -- the active alert path is byte-for-byte unchanged. Transactional `mute_condition`/`unmute_condition`. `POST /api/alerts/:id/mute` + `/unmute`. Distinct from ack/resolve (which quiet only the current cycle). Dashboard Ignore button, Muted filter, and Un-ignore NOT yet built. New alert types: `agent_offline`, `mass_offline_site`, `mass_offline_fleet`. New status: `muted`. Key functions: `is_dedup_muted`, `mute_condition`, `unmute_condition` (db/alerts.rs); `offline_sweep`, `agent_role` (alerts/offline.rs).
|
||||
- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS.
|
||||
- **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Server-only `agent_offline` alerts. Classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`. Dashboard: servers elevated individually; workstations collapsed. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier.
|
||||
- **Alert ignore/mute — perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Keyed on `dedup_key`. Permanent until un-ignored; reason required (400 if missing). Gate at `create_or_update_alert` AND `create_check_alert` bypass. `POST /api/alerts/:id/mute` + `/unmute`. Dashboard Ignore/Muted/Un-ignore NOT yet built.
|
||||
|
||||
### Credentials Management
|
||||
- Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint (known HIGH item: `/reveal` ownership-scope check — [verify current state]).
|
||||
- **Credential inheritance (deployed 2026-06-07):** Opt-in hierarchical cascade with `is_inheritable` flag allowing credentials to propagate from Global → Client → Site. De-duplication by (credential_type, label) with most-specific scope winning. `/effective` endpoints merge and return inherited + direct credentials with `inherited_from` indicator.
|
||||
- Encrypted credentials vault (`016`): scoped global/client/site, typed (password, SSH key, SNMP), metadata-only by default with separate `/reveal` decrypt endpoint. Known HIGH item: `/reveal` ownership-scope check — [verify current state].
|
||||
- **Credential inheritance (deployed 2026-06-07):** Opt-in hierarchical cascade with `is_inheritable` flag (Global → Client → Site). De-duplication by (credential_type, label), most-specific scope wins. `/effective` endpoints return inherited + direct credentials with `inherited_from` indicator.
|
||||
|
||||
### Audit Log (Migration 056)
|
||||
- Append-only `audit_log` table (migration 056): `(id, occurred_at, user_id, action, target_type, target_id, detail JSONB)`. Actor preserved with `SET NULL` on user delete. Seeded by credential reveals (`credential.reveal`); generic shape for future sensitive operations. Indexes on `occurred_at DESC`, `user_id`, `(target_type, target_id)`.
|
||||
|
||||
### Systemic Log Feedback Intelligence (Migration 057)
|
||||
- Per-line provenance + deterministic fingerprint added to `agent_logs` table (nullable; new rows populate, old rows stay NULL): `agent_version`, `signature_hash BIGINT`, `normalizer_version SMALLINT`. Fingerprint computed server-side at ingest (`server/src/fingerprint.rs`).
|
||||
- Durable aggregate `log_signatures` table: one row per distinct normalized error signature (occurrence_count, affected_agents, sample_message, first/last_seen, normalizer_version). Long-retention (raw logs short-retention); alerting + dashboards can read this.
|
||||
- `log_signature_versions` table: per-signature x agent_version rollup — answers "which versions does signature X appear on?" (version regression correlation).
|
||||
|
||||
### Backup Integration (MSP360 / MSPBackups)
|
||||
- Multi-provider config (`034`/`035`) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (`044`) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31.
|
||||
- **Alert quality pass (2026-06-07, commit `779f7f6`):** Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from backup alerting and compliance evaluation (FU2 guard). MSP360 message JSON decoded into readable alert text via `summarize_backup_error` (FU1). `create_or_update_alert` now refreshes `title`/`message`/`severity` on re-trigger, also fixing a latent severity-escalation freeze where re-triggered alerts kept stale severity. Fleet result: false `backup_failed` alerts 15 -> 2; survivors (AD1: retention warning + file skips, LAB-Becky: no storage account configured) are genuine and self-describing.
|
||||
- **`backup_storage_low` alert type REMOVED (2026-06-07, commit `b82c010`):** `check_storage_threshold` computed `usage_percent = DataCopied / TotalData * 100`, but those MSP360 fields describe how much of the source dataset was uploaded, not the cloud destination's remaining capacity. Image/full-backup plans are naturally near 100% by design; the check produced 5 fleet-wide false alerts (3 Critical), the clearest being DF-HYPERV-B "100% Full" on a 4 GB Hyper-V plan. MSP360 Managed Backup does not expose true destination capacity in those fields. Removed `check_storage_threshold` call + function; added `resolve_all_backup_storage_alerts` (type-scoped UPDATE, idempotent, once-per-sync-tick after prune) to clear the 5 stale alerts. Fleet: 5 -> 0 `backup_storage_low` alerts verified. Genuine destination-capacity alerting deferred (would need MSP360 storage-accounts endpoint — separate feature, not scoped).
|
||||
- **Backup staleness:** Handled by the existing `BACKUP_STALE` evaluator (cadence-derived window + 7-day backstop + unit tests). A plan reporting status=success can still be `non_compliant/BACKUP_STALE` if its last run is past the window (e.g. an abandoned plan on the same agent as a healthy current plan). No new code required.
|
||||
- **Key functions** (`server/src/mspbackups/` + `server/src/db/`): `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE), `create_or_update_alert`.
|
||||
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded from alerting/compliance): 8, 13.
|
||||
- **Alert quality pass (2026-06-07):** Non-backup MSP360 PlanTypes (8=Restore, 13=Consistency-check) excluded from alerting/compliance. MSP360 message JSON decoded via `summarize_backup_error`. `create_or_update_alert` refreshes title/message/severity on re-trigger. Fleet: false `backup_failed` 15 -> 2; survivors (AD1, LAB-Becky) are genuine.
|
||||
- **`backup_storage_low` alert type REMOVED (2026-06-07):** `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity. Produced 5 fleet-wide false alerts. `resolve_all_backup_storage_alerts` clears stragglers. Genuine destination-capacity alerting deferred.
|
||||
- MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded): 8, 13.
|
||||
- MSP360 API scope (confirmed 2026-06-07): monitoring-only tier; management paths (plan delete, manual run, storage config) return 404; plan management remains MSP360-console tasks.
|
||||
- Key functions: `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE).
|
||||
|
||||
### Remote Access (Tunnel)
|
||||
- Agent side substantially built (`TunnelManager` state machine; Open/Close/Data). **Server side is a dead-code skeleton — not declared in `api/mod.rs`, no `/tunnel` routes, WS handler logs "not yet implemented." Not production-ready.**
|
||||
- Agent side substantially built (`TunnelManager` state machine; Open/Close/Data). **Server side is a dead-code skeleton — not declared in `api/mod.rs`, no `/tunnel` routes, WS handler logs "not yet implemented." Not production-ready.** Live TTY design will supersede the skeleton (Phase 2 of agent-comms-durability spec).
|
||||
|
||||
### Identity / Multi-tenancy / Security
|
||||
- Auth: JWT (login/register/me); agents auth over WS via per-agent API key + hardware device_id.
|
||||
- **Microsoft Entra ID SSO** (OAuth2/OIDC + PKCE), gated on server config. Multi-provider incl. Google is spec'd (SPEC-008) but **Google not implemented [verify]**.
|
||||
- Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin **user impersonation** (`/auth/impersonate/:id`). Backend present; dashboard UI shipped 2026-05-31.
|
||||
- **Microsoft Entra ID SSO** (OAuth2/OIDC + PKCE), gated on server config. Google planned per SPEC-008 but not implemented [verify].
|
||||
- Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin **user impersonation** (`/auth/impersonate/:id`). Backend + dashboard UI shipped 2026-05-31.
|
||||
- Enrollment & keys (`012`): per-agent key issuance on first run, site API keys (regenerable), site-specific MSI with SITEKEY injected at download, public install-report ingestion. Legacy PowerShell agent path for Server 2008 R2.
|
||||
- Logs: agent log upload (periodic + on-demand), per-agent events (`042`), fleet log view, AI-assisted log analysis (`/logs/analyze`) — AI-optional per locked decision.
|
||||
|
||||
### Enrollment Detail (migration 012 + 2026-06-07 audit)
|
||||
|
||||
The `enrolled_agents` table (migration 012) is the authoritative enrollment history log:
|
||||
The `enrolled_agents` table is the authoritative enrollment history log:
|
||||
|
||||
| Field | Notes |
|
||||
|---|---|
|
||||
| `site_id` | Site the agent enrolled under |
|
||||
| `agent_id` | Set on WebSocket connect (links enrollment record to the agent row) |
|
||||
| `agent_id` | Set on WebSocket connect |
|
||||
| `hostname` | Hostname at enrollment time |
|
||||
| `agent_key_hash` | Hash of the `agk_` enrollment key (intentionally hidden; not surfaced in API) |
|
||||
| `agent_key_hash` | Hash of the `agk_` enrollment key (not surfaced in API) |
|
||||
| `enrolled_at` | Timestamp of initial enrollment |
|
||||
| `last_seen` | Updated on each WS connection |
|
||||
| `ip_address` | IP at last connection |
|
||||
| `os_version` | OS at enrollment time |
|
||||
| `revoked` | Boolean; set TRUE by revocation endpoints |
|
||||
|
||||
**WS auth modes:**
|
||||
- **Mode 3** (modern, `agk_` keys): checks `enrolled_agents.agent_key_hash` first
|
||||
- **Mode 1/2** (legacy): falls back to `agents.api_key_hash`
|
||||
**WS auth modes:** Mode 3 (modern, `agk_` keys): checks `enrolled_agents.agent_key_hash` first. Mode 1/2 (legacy): falls back to `agents.api_key_hash`.
|
||||
|
||||
**Revocation — dual-layer (fixed 2026-06-07):**
|
||||
- `DELETE /api/agents/:id/key` (bulk): revokes BOTH `agents.api_key_hash` (Mode 1/2) AND sets `enrolled_agents.revoked = TRUE` for all enrollment records (Mode 3). Previously only touched the legacy field.
|
||||
- `POST /api/enrolled-agents/:id/revoke` (targeted): per-enrollment revocation with cascade to `agents.api_key_hash`.
|
||||
|
||||
**Enrollment audit endpoint:** `GET /api/sites/:id/enrolled-agents` — returns full enrollment history with `is_duplicate_hostname` annotation. Duplicate flag is scoped to active (non-revoked) rows only; a revoked + active pair is not flagged.
|
||||
|
||||
**Dashboard:** Enrollment tab on SiteDetail — audit table with status badges, duplicate-hostname warnings, and per-row Revoke button with confirmation dialog (shipped 2026-06-07).
|
||||
**Revocation — dual-layer (fixed 2026-06-07):** `DELETE /api/agents/:id/key` (bulk) revokes BOTH layers. `POST /api/enrolled-agents/:id/revoke` (targeted) per-enrollment with cascade.
|
||||
|
||||
### Platform coverage
|
||||
- **Cross-platform (Win/Linux/macOS):** metrics, network state, hardware/software/service inventory, user/group + DC/domain detection, checks (service checks Win+Linux), discovery, self-update, scripts/commands.
|
||||
- **Windows-only:** `user_session` command context (WTS), LibreHardwareMonitor temps, remote registry, tray-into-session, watchdog SCM supervision, BSOD detection.
|
||||
- **Windows-only:** `user_session` command context (WTS), remote registry, tray-into-session, watchdog SCM supervision, BSOD detection.
|
||||
- **macOS:** agent deployed (Phase 1); tray is a stub; automated Mac build pipeline is an intentional stub (no build host) [verify before claiming CI Mac releases].
|
||||
|
||||
---
|
||||
@@ -244,43 +297,45 @@ The `enrolled_agents` table (migration 012) is the authoritative enrollment hist
|
||||
|
||||
| Component | Location | Tech | State |
|
||||
|---|---|---|---|
|
||||
| Server | 172.16.3.30:3001, systemd `gururmm-server`, binary `/usr/local/bin/gururmm-server` | Rust, Axum | deployed, production |
|
||||
| Server | 172.16.3.30:3001, systemd `gururmm-server.service` (User=root, binary `/opt/gururmm/gururmm-server`, EnvironmentFile `/opt/gururmm/.env`) | Rust, Axum | deployed, production (physical box, Ubuntu 26.04, PG 18; migrated 2026-06-11) |
|
||||
| Dashboard | https://rmm.azcomputerguru.com, nginx at `/var/www/gururmm/dashboard/` | React + TypeScript + Vite, shadcn/ui, Tailwind CSS v4 | deployed, production |
|
||||
| Agent (Windows) | Endpoints, installed as `GuruRMMAgent` Windows service via WiX MSI | Rust, Windows MSVC | deployed; stable fleet on 0.6.47; GURU-5070 (beta) on 0.6.54 |
|
||||
| Agent (Windows) | Endpoints, installed as `GuruRMMAgent` Windows service via WiX MSI | Rust, Windows MSVC | deployed; stable fleet on 0.6.63 |
|
||||
| Agent (Linux) | Endpoints, systemd `gururmm-agent`, binary `/usr/local/bin/gururmm-agent` | Rust, musl static | deployed |
|
||||
| Agent (macOS) | Endpoints, LaunchDaemon `com.azcomputerguru.gururmm-agent.plist` | Rust, aarch64/x86_64 | Phase 1 deployed 2026-05-12; code signing issue on Apple Silicon |
|
||||
| Tray (Windows) | System tray, named pipe IPC | Rust | deployed |
|
||||
| Tray (Linux) | System tray, Unix socket IPC, libappindicator/GTK | Rust, GTK | deployed 2026-05-24 (PR #13+#14 merged) |
|
||||
| Tray (Linux) | System tray, Unix socket IPC, libappindicator/GTK | Rust, GTK | deployed 2026-05-24 (PR #13+#14) |
|
||||
| Tray (macOS) | Menu bar | Rust | stub/TODO (issue #18) |
|
||||
| PostgreSQL DB | localhost:5432 on 172.16.3.30, database `gururmm` | PostgreSQL | deployed |
|
||||
| PostgreSQL DB | localhost:5432 on 172.16.3.30, database `gururmm` | PostgreSQL 18 | deployed (migrated from PG VM, backfilled 2026-06-11) |
|
||||
| Coord API | 172.16.3.30:8001/api/coord | FastAPI (part of ClaudeTools API) | deployed |
|
||||
| Build pipeline | 172.16.3.30:9000 webhook + `/opt/gururmm/` scripts | Python (webhook-handler.py), Bash | deployed; split into per-platform scripts 2026-05-24; server build wired into webhook 2026-06-02 |
|
||||
| Pluto (Windows build VM) | 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid) | Rust MSVC, WiX v4 | operational |
|
||||
| Build pipeline | 172.16.3.30:9000 webhook + `/opt/gururmm/` scripts | Python (webhook-handler.py), Bash | deployed; builds agents AND server (server auto-deploy wired 2026-06-02) |
|
||||
| Pluto (Windows build VM) | 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid) | Rust MSVC, WiX v4 | operational; Xeon E5-2695 v3 8c/16t, sequential build ~23min (parallel prototype 3.51x speedup, deferred integration) |
|
||||
| Old VM (rollback anchor) | 172.16.3.46, powered on, data pristine | Ubuntu (original) | parked; do NOT delete until soak completes |
|
||||
|
||||
### Key Files & Repos
|
||||
|
||||
- **Active repo:** `azcomputerguru/gururmm` — http://172.16.3.20:3000/azcomputerguru/gururmm
|
||||
- **Submodule working tree:** `D:\claudetools\projects\msp-tools\guru-rmm` — tracks active repo; develop here and push to Gitea
|
||||
- **Server binary:** `/usr/local/bin/gururmm-server` on 172.16.3.30
|
||||
- **Submodule working tree:** `D:\claudetools\projects\msp-tools\guru-rmm` — tracks active repo; develop here and push to Gitea (push from 172.16.3.30 or 172.16.3.47 — GURU-5070 is not authorized)
|
||||
- **Server binary:** `/opt/gururmm/gururmm-server` on 172.16.3.30 (physical box; prior VM had `/usr/local/bin/gururmm-server`)
|
||||
- **Server config:** `/opt/gururmm/.env` (EnvironmentFile for systemd; DATABASE_URL here; root-only read)
|
||||
- **Source checkout on server:** `/home/guru/gururmm` (build-server.sh: `git reset --hard origin/main` -> `cargo build` -> deploy with backup+rollback)
|
||||
- **Agent binary (Linux):** `/usr/local/bin/gururmm-agent`
|
||||
- **Agent config (Linux/macOS):** `/etc/gururmm/agent.toml` (root, mode 600); macOS uses `/usr/local/etc/gururmm/site.plist`
|
||||
- **Agent registry (Windows):** `HKLM\SOFTWARE\GuruRMM\SiteId` (written by MSI)
|
||||
- **Agent registry (Windows):** `HKLM\SOFTWARE\GuruRMM\SiteId` (written by MSI); `HKLM\SOFTWARE\GuruRMM\DeviceId` (written by agent, added Phase 1 Task 1)
|
||||
- **Windows service name:** `GuruRMMAgent` (NOT `gururmm-agent`)
|
||||
- **Downloads dir:** `/var/www/gururmm/downloads/` on 172.16.3.30
|
||||
- **Webhook handler:** `/opt/gururmm/webhook-handler.py` (port 9000, systemd `gururmm-webhook`)
|
||||
- **Build scripts:** `/opt/gururmm/build-shared.sh`, `build-linux.sh`, `build-windows.sh`, `build-mac.sh` (split 2026-05-24; `build-agents.sh` is now a compat wrapper)
|
||||
- **Server build script:** `/opt/gururmm/build-server.sh` (now dispatched by webhook on `server/` changes; has change-gate marker + binary backup + auto-rollback)
|
||||
- **Per-platform SHA tracking:** `/opt/gururmm/last-built-commit-{linux,windows,mac}`
|
||||
- **Server build SHA tracking:** `/opt/gururmm/last-built-commit-server`
|
||||
- **Pluto known-hosts:** `/opt/gururmm/pluto_known_hosts` (pinned SSH keys; installed 2026-05-24)
|
||||
- **Downloads dir:** `/var/www/gururmm/downloads/` on 172.16.3.30 — base binaries `gururmm-agent-<plat>-<ver>.exe` + `.channel` + `.sha256`; `-latest` symlinks; per-site signed cache `gururmm-agent-site-<site_id>-<plat>-<ver>.exe` (built+jsign-signed on demand)
|
||||
- **Webhook handler:** `/opt/gururmm/webhook-handler.py` (port 9000 localhost; nginx :80 /webhook/ -> :9000; systemd `gururmm-webhook`)
|
||||
- **Build scripts:** `/opt/gururmm/build-shared.sh`, `build-linux.sh`, `build-windows.sh`, `build-mac.sh` (split 2026-05-24; `build-agents.sh` is compat wrapper)
|
||||
- **Server build script:** `/opt/gururmm/build-server.sh` (dispatched by webhook on `server/` changes; change-gate `last-built-commit-server`; backup current binary; auto-rollback if fails `is-active`)
|
||||
- **Per-platform SHA tracking:** `/opt/gururmm/last-built-commit-{linux,windows,mac,server}`
|
||||
- **Build log (Linux):** `/var/log/gururmm-build-linux.log`
|
||||
- **Build log (Windows):** `/var/log/gururmm-build-windows.log`
|
||||
- **Build log (Server):** `/var/log/gururmm-build-server.log`
|
||||
- **API (internal):** http://172.16.3.30:3001
|
||||
- **API (external):** https://rmm-api.azcomputerguru.com (Cloudflare)
|
||||
- **API (external):** https://rmm-api.azcomputerguru.com (Cloudflare -> NPM on Jupiter .20 -> .30:80)
|
||||
- **Dashboard:** https://rmm.azcomputerguru.com
|
||||
- **DB URL:** `postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm`
|
||||
- **Vault path:** `infrastructure/gururmm-server.sops.yaml`
|
||||
- **DB URL:** `postgres://gururmm:<pw>@localhost:5432/gururmm` (pw in `/opt/gururmm/.env`; do not hardcode here)
|
||||
- **Vault paths:** `infrastructure/gururmm-server.sops.yaml` (API creds), `infrastructure/gururmm-server-physical` (SSH ed25519 key, sudo pw)
|
||||
- **SSH to prod:** `ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30` (ed25519, key-only; sudo pw in vault `infrastructure/gururmm-server-physical`)
|
||||
|
||||
### Repo Structure
|
||||
|
||||
@@ -288,27 +343,34 @@ The `enrolled_agents` table (migration 012) is the authoritative enrollment hist
|
||||
gururmm/
|
||||
├── agent/ Rust agent (managed endpoints)
|
||||
│ └── src/
|
||||
│ ├── bsod.rs Windows BSOD/kernel-crash detection (DUMP_HEADER64/32 offset parser)
|
||||
│ ├── ipc.rs Unix socket IPC (Linux); Windows named pipe
|
||||
│ ├── tunnel/ TunnelManager state machine
|
||||
│ ├── metrics/ sysinfo collection + temps (LibreHardwareMonitor/WMI on Win, /sys/class/thermal on Linux) — BUG-001 resolved
|
||||
│ ├── registry_ops/ Windows registry read/write
|
||||
│ ├── updater/ Self-update handler
|
||||
│ └── main.rs systemd unit template generation
|
||||
│ ├── bsod.rs Windows BSOD/kernel-crash detection
|
||||
│ ├── commands/mod.rs CommandExecutor + recent-results cache (ACK dedup)
|
||||
│ ├── device_id.rs Durable multi-location device_id (Phase 1 Task 1)
|
||||
│ ├── ipc.rs Unix socket IPC (Linux); Windows named pipe
|
||||
│ ├── transport/
|
||||
│ │ ├── mod.rs AgentMessage enum (incl. CommandAck)
|
||||
│ │ └── websocket.rs WS transport; execute_command ACK + dedup
|
||||
│ ├── tunnel/ TunnelManager state machine
|
||||
│ ├── metrics/ sysinfo collection + temps
|
||||
│ ├── registry_ops/ Windows registry read/write
|
||||
│ ├── updater/ Self-update handler
|
||||
│ └── main.rs systemd unit template generation
|
||||
├── server/ Rust/Axum API server
|
||||
│ └── src/
|
||||
│ ├── api/ REST handlers (alerts.rs: mute/unmute; agents.rs: role-override)
|
||||
│ ├── alerts/ Alerting modules (offline.rs: offline_sweep, mass_offline detection, agent_role classifier, warm-up restart guard)
|
||||
│ ├── db/ Database layer (sqlx); db/bsod_events.rs, db/mspbackups.rs, db/alerts.rs (is_dedup_muted, mute_condition, unmute_condition, AlertStatus::Muted)
|
||||
│ ├── ws/ WebSocket handler (BsodEvent dispatch)
|
||||
│ └── mspbackups/ MSP360 backup integration (sync.rs: derive_backup_status, summarize_backup_error, resolve_all_backup_storage_alerts; client.rs: BackupPlan.plan_type)
|
||||
├── dashboard/ React/TypeScript UI (lib/agentRole.ts: server classifier; components/ExceptionStream.tsx: offline triage with workstation roll-up; pages/AgentDetail.tsx: role-override control)
|
||||
│ ├── api/ REST handlers (~37 route modules; updates.rs: promote/rollback)
|
||||
│ ├── alerts/ Alerting modules (offline.rs: offline_sweep + mass_offline)
|
||||
│ ├── db/ Database layer (sqlx); commands.rs (requeue_undelivered, fail_timed_out rewrite)
|
||||
│ ├── fingerprint.rs Log signature normalization + hash
|
||||
│ ├── ws/ WebSocket handler (CommandAck, redispatch_pending_commands)
|
||||
│ └── mspbackups/ MSP360 backup integration
|
||||
├── dashboard/ React/TypeScript UI
|
||||
├── tray/ System tray binary
|
||||
├── installer/ WiX v4 MSI (gururmm-agent.wxs)
|
||||
├── installer/ WiX v4 MSI (gururmm-agent.wxs); cleanup.ps1 (now preserves device_id)
|
||||
├── deploy/
|
||||
│ └── build-pipeline/ webhook-handler.py, build-*.sh, build-server.sh
|
||||
├── scripts/ Build/ops scripts
|
||||
└── docs/ FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md, DESIGN.md, specs/
|
||||
└── docs/ FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md,
|
||||
DESIGN.md, specs/, HOST_MIGRATION_RUNBOOK.md
|
||||
```
|
||||
|
||||
---
|
||||
@@ -317,26 +379,25 @@ gururmm/
|
||||
|
||||
### Current Focus
|
||||
|
||||
As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68):
|
||||
|
||||
- **Credential inheritance (deployed 2026-06-07):** Production server running v0.3.45 with full credential inheritance and de-duplication. `/effective` endpoints validated. Dashboard clickable alert badges and client-scoped filtering implemented.
|
||||
- **SPEC-028 offboarding wizard (specification complete):** 835-line spec created for site and client offboarding workflows. Includes data export, dependency analysis, typed confirmation, and audit logging. Roadmap updated with "Client & Site Lifecycle Management" section. Implementation pending.
|
||||
- **BUG-020 — tray duplicate/ghost icons (fixed to beta, 2026-06-04):** Commit `137dd85` shipped to main -> beta. Fix #1: per-session `Local\GuruRMM_Tray` single-instance mutex in the tray binary. Fix #2: `TrayLauncher` reconciliation via `WTSEnumerateProcessesW` (idempotent across watchdog restarts). Fix #3: graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback (so `NIM_DELETE` fires and ghost icon is cleaned). [NOTE: Fix #3 is implemented but dormant — `terminate_all` has no caller in the agent yet. Tracked in coord todo `25fdf31a` to wire into the watchdog policy-disable/uninstall path.]
|
||||
- **BSOD detection Phase 2/3 (deferred):** Dashboard "Crashes" tab shipped 2026-06-07 (AgentDetail Crashes tab + version history in Updates tab). Remaining: BSOD in Alerts stream (issue #10); `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map).
|
||||
- **Linux fleet unit drift:** Auto-updater replaces the binary but does NOT refresh the systemd unit file. Pre-BUG-016-fix Linux agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs an ops-script pass via `/rmm` or organic at next reinstall.
|
||||
- **Agent-comms-durability Phase 1 SHIPPED.** Remaining Phase 1 items: (a) `is_connected` reads null fleet-wide — cosmetic dashboard bug; gap is in API DTO/dashboard layer, not ws/mod.rs; (b) keepalive AIMD shortening (Task 4; deferred as optimization — durability nets are the actual fix).
|
||||
- **Phase 2 (live TTY, planned):** Extend WS message enum with TTY stream frames (stdin/stdout/stderr, resize, seq). "Activate" cadence switch (agent goes HOT on next heartbeat after tech opens live window). Resume from seq on mid-session drop. Single-use time-bounded session token; max one interactive session per agent; full audit. Estimated 3-5 days separate effort.
|
||||
- **Phase 3 (planned):** Adaptive keepalive (AIMD, persist interval), bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper (`last_inbound` track + evict >~90s; treat missed Pong as close).
|
||||
- **Durable agent identity Phase 1 Tasks 2-3 (pending):** Task 2 = hardware-fingerprint capture (agent `inventory.rs` baseboard serial + primary MAC; server migration `agents.hardware_fingerprint` + `hardware_signals JSONB` + `legacy_device_id`). Task 3 = dashboard "probable duplicate" surfacing (read-only). Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for the 9 existing ghosts) pending soak.
|
||||
- **Confirm Windows v0.6.62 build compiles + deploys** (validates new registry cfg(windows) code from durable-identity Phase 1 Task 1). If it failed, fix and re-push.
|
||||
- **Host migration soak:** drop `.47` (new box) + `.46` (VM) management IPs after stability soak; decommission old VM at `.46`. GURU-5070 Ethernet repair.
|
||||
- **SPEC-028 offboarding wizard (specification complete, 835 lines; implementation pending):** Site + client offboarding workflows, data export, typed confirmation, audit logging.
|
||||
- **BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):** Fix #3 (graceful `Global\GuruRMM_TrayShutdown_{sid}` event) is implemented but dormant — `terminate_all` has no caller. Coord todo `25fdf31a`: wire into watchdog policy-disable/uninstall path.
|
||||
- **BSOD detection Phase 2/3 (deferred):** Dashboard Crashes tab shipped 2026-06-07. Remaining: BSOD in Alerts stream (issue #10), `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table.
|
||||
- **Linux fleet unit drift:** Auto-updater replaces binary but does NOT refresh systemd unit file. Pre-BUG-016-fix agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs ops-script pass.
|
||||
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server.
|
||||
- **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt; Muted filter; Un-ignore control. Server endpoints ready.
|
||||
- **MSP360 backup Phase 2 (not started):** Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint, monitoring-only API tier confirmed).
|
||||
- **Parallel Windows build integration (DEFERRED, design captured in runbook):** 7-job Start-Job parallel build achieves 3.51x speedup on Pluto. Blocked on Windows orchestration friction (non-interactive PS, git host-key trust). Run via new-box `setsid` when integrated.
|
||||
- **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19).
|
||||
- **Auto-update reliability** — BB-SERVER and RECEPTIONIST-PC (Cascades) miss dispatch windows due to flaky WebSockets. Re-querying pending updates on reconnect: incomplete as of 2026-05-24.
|
||||
- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server (found in 2026-05-23 audit).
|
||||
- **MSP360 backup integration** — Alert quality pass shipped 2026-06-07 (commits `779f7f6` + `b82c010`): false `backup_failed` alerts 15 -> 2; `backup_storage_low` removed (structurally false signal; 5 -> 0 false alerts). Dashboard UI shipped 2026-05-31. Phase 2 (management) not started; genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint).
|
||||
- **MSP360 API scope (confirmed 2026-06-07):** Provider API token (vault `msp-tools/msp360-api.sops.yaml`) is monitoring-only at our tier: Companies/Users/Monitoring GET endpoints; every management path (plan delete, manual run, storage config) returns 404. White-labeled MSP360 agent has no `cbb.exe` at any standard path. Plan delete/run/storage remain MSP360-console tasks.
|
||||
- **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt on alert row; Muted filter on alerts page + agent Alerts tab; Un-ignore control; muted rows show Un-ignore instead of Ack/Resolve. Server endpoints ready: `POST /api/alerts/:id/mute` + `/unmute`.
|
||||
- **Tunnel session management (P2, deferred):** Server `tunnel.rs` is a dead-code skeleton that logs "not yet implemented"; wiring the route would expose a non-functional endpoint. Needs: xterm.js design, WS protocol spec for dashboard client, proper Phase 2 implementation. Estimated 3-5 days separate effort.
|
||||
- **SSH from BEAST to production broken:** `known_hosts` issue prevents SSH key auth from GURU-BEAST-ROG to `172.16.3.30`. Workaround during incident: JWT token + curl for endpoint diagnosis. Needs production server host key added to BEAST's `~/.ssh/known_hosts`.
|
||||
- **`revoke_agent_key` dead code:** Old single-layer handler is dead code now that the route points to `revoke_agent_key_handler`. Should be removed in a cleanup pass.
|
||||
- **Enrollment key display gap:** `enrolled_agents` table doesn't expose a key prefix — key hash is intentionally hidden. UI shows hostname, dates, IPs but not a redacted key identifier. Could add first-8-chars of hash as a non-sensitive display field in a future pass.
|
||||
- **Offline alerting -- classification gap:** SIF-SERVER (SifOidak.local DC), Server2013 (Sombra), and other inventory-less offline servers currently auto-classify as workstations (no `os_product_type`; `os_name` doesn't match ~/server/i for those names). Needs manual `role_override` via `PUT /api/agents/:id/role-override` -- Mike held on tagging them.
|
||||
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent (not all plans); makes agents with a dead old plan + healthy current plan look stale-but-green in the backup tab. Compliance domain evaluates all plans correctly. Not fixed this session — noted for future.
|
||||
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH).
|
||||
- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH). `update_rollouts.promoted_by` UUID vs `users.id` int mismatch (schema quirk from migration 046).
|
||||
- **`/backup-status` endpoint shape gap:** Returns only one plan per agent; agents with a dead old plan + healthy current plan look stale-but-green. Not fixed — noted for future.
|
||||
|
||||
### Patterns & Anti-Patterns
|
||||
|
||||
@@ -347,7 +408,7 @@ As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
| `useMemo` with stable deps for data-dependent values | queryClient is stable, memo never recomputes after queries resolve. Use `useQuery` instead. |
|
||||
| CSS variable text colors inside the sidebar | Sidebar bg is hardcoded dark; CSS vars flip in light mode. Use `text-white` explicitly inside sidebar. |
|
||||
| Deploying without stopping the server first | "text file busy" kernel error. Always `systemctl stop` before `cp`. |
|
||||
| Building without `DATABASE_URL` | sqlx compile-time macros fail. `DATABASE_URL` is in `/home/guru/.cargo/env`. |
|
||||
| Building without `DATABASE_URL` | sqlx compile-time macros fail. `DATABASE_URL` is in `/opt/gururmm/.env` (or `/home/guru/.cargo/env` on dev). |
|
||||
| DB migrations without inserting into `_sqlx_migrations` | Server crashes on start. Must insert SHA-384 checksum manually. |
|
||||
| WiX MSI builds on Linux | WiX requires `msi.dll`. MSI must be built on Pluto (Windows). |
|
||||
| Manual builds via SSH | All builds go through `webhook-handler.py`. Never SSH and run `cargo build` + artifact copy manually. |
|
||||
@@ -356,7 +417,7 @@ As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
| `STATUS_BADGE_CLASSES` Record const | Vite/Rollup may optimize away the lookup. Use explicit `getStatusBadgeClass()` if/else function. |
|
||||
| SSH heredoc for TypeScript edits | Shell strips double-quote characters. Edit locally in submodule, push to Gitea, pull on server. |
|
||||
| `Restart-Service GuruRMMAgent -Force` in command scripts | Kills agent before it can report result. Commands stay forever `running`. Use scheduled task with delay instead. |
|
||||
| `sudo -u guru git` in systemd build context | git rejects repo as dubious ownership when running as root on guru-owned repo. Use `safe.directory` config or `sudo -u guru git`. |
|
||||
| `sudo -u guru git` in systemd build context | git rejects repo as dubious ownership when running as root on guru-owned repo. Use `git config --system --add safe.directory` instead (applies to all users/envs — fixed 2026-06-11). |
|
||||
| Self-updating running bash script | bash reads line-by-line from disk; replacing mid-execution silently skips remaining blocks. |
|
||||
| `+1.77` legacy builds without `--ignore-rust-version` | Fail MSRV check after adding `rust-version` to Cargo.toml. Add `--ignore-rust-version` to legacy build lines only. |
|
||||
| `StrictHostKeyChecking=no` for Pluto SSH | Replaced with pinned known-hosts at `/opt/gururmm/pluto_known_hosts`. MITM would compromise build artifacts. |
|
||||
@@ -364,8 +425,11 @@ As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
| Dead WebSocket write half | WS write fails, send task dies, receive loop keeps agent in `ConnectedAgents` with dead write half. Commands silently fail. Fix: `tokio::select!` monitoring both tasks. |
|
||||
| Using the `minidump` crate for Windows kernel dumps | The crate only parses Breakpad MDMP format, not Windows kernel PAGEDU64 dumps. Parse `DUMP_HEADER64`/`DUMP_HEADER32` at fixed offsets directly (validated against real dumps). |
|
||||
| Build classification defaulting to stable | New agent builds should default to `beta`; promotion to `stable` is an explicit re-tag of the `<binary>.channel` sidecar. Defaulting stable races the auto-update fleet ahead before any beta soak. |
|
||||
| Webhook dispatching only agent builds | The webhook historically triggered `build-linux.sh`/`build-windows.sh`/`build-mac.sh` but never `build-server.sh`; server changes silently went unbuilt until manual intervention. Now fixed — server build is dispatched alongside agent builds, gated by `last-built-commit-server`. |
|
||||
| Auto-update-on-connect + default-stable tagging racing the fleet | If a build is tagged `stable` (even briefly by mistake), agents on the stable channel auto-update immediately on next heartbeat. Once fleet has updated, rolling back requires a new build — the prior binary is cleaned up. Default beta, soak, promote explicitly. |
|
||||
| Webhook dispatching only agent builds | The webhook historically triggered agent builds but never `build-server.sh`; server changes silently went unbuilt until manual intervention. Fixed 2026-06-02 — server build is dispatched alongside agent builds, gated by `last-built-commit-server`. |
|
||||
| Auto-update-on-connect + default-stable tagging racing the fleet | If a build is tagged `stable` (even briefly by mistake), agents on the stable channel auto-update immediately. Once fleet has updated, rolling back requires a new build. Default beta, soak, promote explicitly. |
|
||||
| Channel pin at agent-level for a beta canary | Agent-level `update_channel` is keyed to agent_id and lost on re-enrollment. Use site-level or client-level override (survives re-enroll). |
|
||||
| Installer using `& $stagingPath install 2>&1 \| Out-Null` under `$ErrorActionPreference='Stop'` | Swallows output; surfaces a misleading NativeCommandError on any non-zero exit. Use `Start-Process -Wait -PassThru` + check `$proc.ExitCode`. Fixed 2026-06-11 (commit 5c0d004). |
|
||||
| Reaper flipping un-acked commands to `failed` | Causes false-failure for commands black-holed in NAT conntrack gaps. The reaper must only fail commands that were ACK'd but overran their real execution timeout. Un-acked commands must be requeued to `pending`. |
|
||||
|
||||
**Good patterns:**
|
||||
|
||||
@@ -376,9 +440,11 @@ As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
- **`RuntimeDirectory=gururmm` in systemd unit** — systemd-native way to give agent writable `/run/gururmm/` for IPC socket.
|
||||
- **Registry-first path resolution** — read `HKLM:\SOFTWARE\GuruRMM` for install dir, fall back to service PathName, then hardcoded default.
|
||||
- **`interrupt_running_commands()` at reconnect** — flips all `status='running'` commands for reconnecting agent to `status='interrupted'`.
|
||||
- **Build change-gate + backup/rollback in `build-server.sh`** — skips rebuild when `server/` is unchanged (marker `last-built-commit-server`); backs up previous binary; restores it if the new binary fails `is-active`. Prevents unnecessary rebuilds and covers the BUG-003 no-rollback gap for server.
|
||||
- **Server's own root RMM agent for privileged ops** — the server (172.16.3.30) runs the GuruRMM Linux agent as root (hostname `gururmm`); it can read/write `/var/www/gururmm/downloads`, re-tag `.channel` sidecars, and trigger `build-server.sh` without SSH or `sshpass`.
|
||||
- **GURU-5070 as permanent beta-channel canary** — per-agent `update_channel = 'beta'` override (only agent in the fleet with an explicit channel; site/all-other-agents default to `NULL` = stable). Gets every new beta build immediately; stable fleet is protected by the explicit `update_rollouts` pin.
|
||||
- **Build change-gate + backup/rollback in `build-server.sh`** — skips rebuild when `server/` is unchanged; backs up previous binary; restores it if the new binary fails `is-active`.
|
||||
- **Server's own root RMM agent for privileged ops** — the server (172.16.3.30) runs the GuruRMM Linux agent as root (hostname `gururmm`); it can re-tag `.channel` sidecars and trigger `build-server.sh` without SSH or `sshpass`.
|
||||
- **GURU-5070 site as permanent beta-channel canary** — site "Mike's Car" has `update_channel='beta'` (survives agent re-enrollment). Gets every new beta build; stable fleet protected by explicit `update_rollouts` pin. (Channel pin on site, not agent — agent-level overrides are lost on re-enroll.)
|
||||
- **Capability gate via partial index, not version-string compare** — CI auto-bumps versions (`[ci-version-bump]`); pinning behavior to a fixed version string is fragile. Use a DB-observable capability marker (e.g., `acked_at IS NOT NULL`) that is set when the agent actually demonstrates the capability.
|
||||
- **Two independently-deployable slices for cross-component features** — agent slice and server slice can be deployed in any order; the server slice uses a capability gate so old agents keep legacy behavior.
|
||||
|
||||
### Build & Deploy
|
||||
|
||||
@@ -386,62 +452,69 @@ As of 2026-06-07 (agent 0.6.54 beta / 0.6.47 stable / server 0.3.45):
|
||||
|
||||
```
|
||||
Gitea push to main
|
||||
-> webhook-handler.py (172.16.3.30:9000, parallel threads per platform)
|
||||
-> webhook-handler.py (172.16.3.30:9000 via nginx :80 /webhook/, parallel threads per platform)
|
||||
-> build-shared.sh (auto-version bump, git sync — runs once)
|
||||
-> build-linux.sh (cargo build on .30; log: /var/log/gururmm-build-linux.log)
|
||||
-> build-windows.sh (SSH -> Pluto 172.16.3.36 via pinned known-hosts
|
||||
cargo build --release x64 MSVC + i686 MSVC
|
||||
+1.77 legacy builds with --ignore-rust-version
|
||||
WiX MSI build for site-specific base
|
||||
sign-windows.sh (jsign + Azure Trusted Signing)
|
||||
sign-windows.sh (jsign + Azure Trusted Signing, /etc/gururmm-signing.env)
|
||||
SCP artifacts back; log: /var/log/gururmm-build-windows.log)
|
||||
-> build-mac.sh (stub — no build machine configured yet)
|
||||
-> build-server.sh (gated: skips if no server/ changes since last-built-commit-server;
|
||||
backs up current binary; builds + deploys; auto-rollback if fails is-active;
|
||||
log: /var/log/gururmm-build-server.log)
|
||||
-> artifacts -> /var/www/gururmm/downloads/ with sha256 + -latest symlinks
|
||||
-> artifacts -> /var/www/gururmm/downloads/ with sha256 + -latest symlinks + .channel (beta)
|
||||
-> per-platform last-built-commit files updated
|
||||
-> systemctl restart gururmm-agent (local agent on .30)
|
||||
```
|
||||
|
||||
**Auto-version:** `build-shared.sh` diffs `agent/`, `server/`, `dashboard/` against last built SHA. For each changed component, bumps patch version in `Cargo.toml` or `package.json`, commits `[ci-version-bump]`, pushes. Webhook skips builds where all commits are version bumps.
|
||||
|
||||
**Build channel classification:** New agent builds are tagged `beta` by default (`build-windows.sh` and `build-linux.sh` fixed 2026-06-02; macOS already did this). Promotion to `stable` is an explicit step: `echo stable > /var/www/gururmm/downloads/<binary>.channel`. This is distinct from agents defaulting to the `stable` *channel* (correct and unchanged) — agents on the stable channel receive only the latest `stable`-tagged binary; beta agents receive the absolute-latest.
|
||||
**Build channel classification:** New agent builds are tagged `beta` by default. Promotion to `stable` is an explicit step via the API: `POST /api/updates/rollouts/:version/promote {"os":"windows","arch":"amd64"}` — re-tags all `.channel` files for that version, records `update_rollouts` row, triggers scanner rescan. Rollback: `POST /api/updates/rollouts/:version/rollback`. Agents on the stable channel only receive the latest `stable`-tagged binary.
|
||||
|
||||
**Dashboard channels — BETA-FIRST (2026-06-02):** mirrors the agent beta/stable model. Every push to `main` touching `dashboard/` auto-builds to **beta**; **prod is explicit-promote only**.
|
||||
**Downloads layout:**
|
||||
- Base binaries: `gururmm-agent-<plat>-<ver>.exe` + `.channel` (beta/stable) + `.sha256`
|
||||
- Latest symlinks: `gururmm-agent-<plat>-latest.exe` + `.channel` + `.sha256`
|
||||
- Per-site signed cache: `gururmm-agent-site-<site_id>-<plat>-<ver>.exe` (built+jsign-signed on demand at install-request time)
|
||||
|
||||
**Dashboard channels — BETA-FIRST:**
|
||||
|
||||
| Channel | URL | Web root | Updates |
|
||||
|---|---|---|---|
|
||||
| beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (now dispatched by the webhook alongside agent/server builds, change-gated on `last-built-commit-dashboard`) |
|
||||
| beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (change-gated on `last-built-commit-dashboard`) |
|
||||
| prod | https://rmm.azcomputerguru.com | `/var/www/gururmm/dashboard` | explicit only — `sudo /opt/gururmm/promote-dashboard.sh --confirm` (backs up prod; `--rollback` restores) |
|
||||
|
||||
**Do NOT hand-rsync into the prod web root** (the old `npm run build && rsync ... dashboard/` is superseded). One artifact serves both channels — the Vite build bakes in the absolute prod API URL (`rmm-api.azcomputerguru.com`), so beta uses shared prod data and is byte-identical to prod; beta is branded by an nginx-layer `sub_filter` BETA banner (`deploy/nginx/rmm-beta.conf`), so promotion is a plain rsync. **Serving/TLS:** second nginx vhost on `.30` (`server_name rmm-beta`, specific name beats prod `_`), Cloudflare `rmm-beta` A->``72.194.62.10` proxied (mirrors `rmm`), Jupiter NPM proxy host **id=11** -> `.30:80` presenting cert **id=10** (zone SSL mode is Full; if ever Full-Strict, beta needs its own SAN/cert).
|
||||
Do NOT hand-rsync into the prod web root. One artifact serves both channels — the Vite build bakes in the absolute prod API URL; beta uses an nginx-layer `sub_filter` BETA banner.
|
||||
|
||||
**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start.
|
||||
**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start. Migrations run automatically on server restart (sqlx), but the checksum must be correct.
|
||||
|
||||
**Pluto (172.16.3.36):**
|
||||
- Windows Server 2019 VM on Jupiter (Unraid)
|
||||
- SSH: `ssh -o UserKnownHostsFile=/opt/gururmm/pluto_known_hosts Administrator@172.16.3.36`
|
||||
- Rust stable 1.95.0 + 1.77 pinned for legacy builds
|
||||
- VS Build Tools (MSVC), sccache at `C:\sccache`, WiX v4, Gitea clone at `C:\gururmm\`
|
||||
- Xeon E5-2695 v3 @2.3GHz, 8c/16t KVM VM; sequential build ~23min; parallel prototype 3.51x (deferred)
|
||||
|
||||
**Auto-update delivery:**
|
||||
- Server scans every 300s; dispatches update command on agent heartbeat
|
||||
- Gated on effective policy `auto_update` (default on when policy is null)
|
||||
- Agent: downloads to PrivateTmp, verifies SHA-256, replaces binary, restarts service
|
||||
- Force-trigger: `POST /api/agents/:id/update`
|
||||
- **As of agent-comms-durability Phase 1:** update dispatches piggyback on heartbeat re-offer path (same `redispatch_pending_commands`), so NAT'd agents receive updates reliably
|
||||
|
||||
---
|
||||
|
||||
## Active State
|
||||
|
||||
**Fleet (as of 2026-06-04, live Postgres verified; no enrollment changes in 2026-06-07 session):**
|
||||
- 55 enrolled agents total
|
||||
- Stable channel: pinned at 0.6.47 windows/amd64 (promoted 2026-05-28); 0.6.46 linux. All 39 sites and 118 agents are on stable (channel NULL = stable default).
|
||||
- Beta channel: **GURU-5070 only** — per-agent `update_channel = 'beta'` override (site "Mike's Car" / `103c10b9-c1de-4dd8-b382-b8362ed3143e` has `update_channel = NULL`, so stable is the site default; GURU-5070 is the explicit per-agent exception). Beta has no `update_rollouts` pin — server dispatches the newest signed beta artifact straight from the build pipeline.
|
||||
- GURU-5070 running 0.6.54 (beta). Permanent canary; gets every new beta build immediately upon reconnect.
|
||||
**Fleet (as of 2026-06-11, post-0.6.63 rollout):**
|
||||
- ~215 enrolled agents total (growing)
|
||||
- Stable channel: 0.6.63 windows/amd64 (promoted 2026-06-11); 168 agents on 0.6.63 during rollout, 182 online throughout (no mass dropout). ~20 stragglers on older versions auto-update on reconnect.
|
||||
- Beta channel: site "Mike's Car" (`103c10b9-c1de-4dd8-b382-b8362ed3143e`) has `update_channel='beta'` (persists across re-enrollment). All GURU-5070 machines are on this site.
|
||||
- GURU-5070 live agent: `819df0c8`; ghost (orphaned prior enrollment): `c043d9ac` (offline, durable-identity fix in progress).
|
||||
|
||||
**Enrolled clients/sites (live API, 2026-05-24 baseline; no removals since):**
|
||||
**Enrolled clients/sites (2026-05-24 baseline; no removals since):**
|
||||
|
||||
| Client | Type | Sites | Notable agents |
|
||||
|---|---|---|---|
|
||||
@@ -449,35 +522,36 @@ Gitea push to main
|
||||
| BirthBiologic | Corporate | Main Office | BB-SERVER |
|
||||
| Cascades of Tucson | Corporate | CascadesTucson | 27 agents — CS-SERVER, RECEPTIONIST-PC, ASSISTMAN-PC, MDIRECTOR-PC, MEMRECEPT-PC, and ~22 others |
|
||||
| Dataforth Corp | Corporate | D1 | AD2, DF-GAGETRAK |
|
||||
| Goldstein | (verify) | (verify) | Canary client for 0.6.62/0.6.63 beta soak |
|
||||
| Grabb & Durando Law Office | Corporate | Main Office | GND-SERVER |
|
||||
| Instrumental Music Center | Corporate | IMCMain | IMC1 |
|
||||
| Key, Paul | Residential | Home | KEY-MEDIA |
|
||||
| Peaceful Spirit | Residential | Bridgette Home, Country Club, Mara Home | BridgettePSHomeComputer, PST-SERVER, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew |
|
||||
| Safesite | Corporate | Glendale | MSI |
|
||||
| Peaceful Spirit | Residential | Bridgette Home, Country Club, Mara Home | BridgettePSHomeComputer, PST-SERVER (UDR Ultra, comms-durability verification), PST-SERVER2, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew |
|
||||
| Safesite | Corporate | Glendale | DESKTOP-3USU20B (comms-durability affected) |
|
||||
| Sombra Residential LLC | Corporate | main office | DESKTOP-UQRN4K3, Server2013 |
|
||||
| Stamback Septic | Corporate | StambackSeptic | DESKTOP-BTR2AM3, StambackLaptopNew |
|
||||
| Swanson, Len | Residential | Home | LAS-GAMER |
|
||||
|
||||
**API auth:**
|
||||
- `POST /api/auth/login` -> JWT (~24h)
|
||||
- Creds: vault `infrastructure/gururmm-server.sops.yaml` -> `credentials.gururmm-api.admin-email` / `admin-password`
|
||||
- Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`
|
||||
- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context` — **`system`** (default; Session 0 / SYSTEM) or **`user_session`** (runs in the logged-on user's desktop session via WTS token impersonation; Windows-only, needs an active session), plus `timeout_seconds`/`elevated`. The agent does NOT run everything as LocalSystem — `user_session` is the per-user path (migration `041_add_command_context`, `agent/src/watchdog/wts.rs`).
|
||||
- Response: `stdout`, `stderr`, `exit_code`, `status` (running/completed/failed/timeout/interrupted)
|
||||
- Creds: vault `infrastructure/gururmm-server.sops.yaml` -> `credentials.gururmm-api.admin-email` / `admin-password`; or via `bash .claude/scripts/rmm-auth.sh`
|
||||
- Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`, `POST /api/updates/rollouts/:version/promote`
|
||||
- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context` — `system` (default) or `user_session` (Windows WTS), plus `timeout_seconds`/`elevated`.
|
||||
|
||||
**Dashboard — complete and working:**
|
||||
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (with clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO (Entra only — Google planned per SPEC-008, not implemented), Policies Dashboard (all tabs), Registry editor, MSP360 backup status card + agent<->backup mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance support, AgentDetail Crashes tab (BSOD events), AgentDetail version history in Updates tab, fleet stats from `/agents/stats`, SiteDetail Revoke Key button + Enrollment audit tab (with duplicate-hostname badge), Install Reports page, Fleet Discovery page.
|
||||
Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO, Policies Dashboard (all tabs), Registry editor (read-only via HTTP), MSP360 backup status + mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance, AgentDetail Crashes tab + version history, fleet stats from `/agents/stats`, SiteDetail Revoke Key + Enrollment audit tab, Install Reports page, Fleet Discovery page.
|
||||
|
||||
**Dashboard — incomplete (see UI_GAPS.md):**
|
||||
- Watchdog alerts UI — blocked on 2 missing server routes
|
||||
- BSOD in Alerts stream (Phase 2 deferred)
|
||||
- Tunnel session management (interactive terminal — server skeleton "not yet implemented"; needs xterm.js design + WS protocol spec)
|
||||
- Offboarding wizard UI (SPEC-028 complete, implementation pending)
|
||||
- Alert mute UI -- Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard Task 5 NOT started)
|
||||
- Tunnel session management (server skeleton "not yet implemented"; Phase 2 live TTY design will supersede)
|
||||
- Offboarding wizard UI (SPEC-028 complete; implementation pending)
|
||||
- Alert mute UI — Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard NOT started)
|
||||
- `is_connected` reflects real connectivity (null fleet-wide cosmetic bug; gap in API/dashboard layer)
|
||||
- Documentation / user guide / inline help tooltips (P3, not started)
|
||||
|
||||
**Open Gitea issues:**
|
||||
- #10 — BSOD detection Phase 2/3 (dashboard + fetch_bsod_dump + full bugcheck table)
|
||||
- #10 — BSOD detection Phase 2/3 (dashboard Alerts stream + fetch_bsod_dump + full bugcheck table)
|
||||
- #15 — Pipeline tray build (publish tray binary to downloads)
|
||||
- #16 — Windows IPC peer authz
|
||||
- #17 — logind console user resolution
|
||||
@@ -485,11 +559,8 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI
|
||||
- #19 — subscriber broadcast
|
||||
|
||||
**BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):**
|
||||
- Symptom: duplicate AND ghost `gururmm-tray.exe` tray icons. Live evidence: 5 stacked tray processes in Session 1 on GURU-5070 (one per watchdog restart over 6/1-6/2).
|
||||
- Root cause: `TrayLauncher` (`agent/src/watchdog/wts.rs`) tracked launches only in an in-memory `HashMap<sid,HANDLE>` that resets on watchdog restart (esp. agent auto-update), so it relaunched trays into sessions that already had one; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping the tray's `Drop` -> `NIM_DELETE` (ghost).
|
||||
- Fix (commit `137dd85`, gururmm@main -> beta): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW` (idempotent); (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> `TerminateProcess` fallback.
|
||||
- Verified: independent Grok review + Code Review Agent APPROVE.
|
||||
- Follow-up (coord todo `25fdf31a`): wire `terminate_all` graceful-shutdown into the watchdog policy-disable/uninstall path so fix #3 becomes active.
|
||||
- Root cause: `TrayLauncher` tracked launches only in an in-memory `HashMap<sid,HANDLE>` that resets on watchdog restart; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping `NIM_DELETE` (ghost).
|
||||
- Fix (commit `137dd85`): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW`; (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> TerminateProcess fallback. Fix #3 is dormant pending `terminate_all` wiring (coord todo `25fdf31a`).
|
||||
|
||||
**Security backlog (HIGH):**
|
||||
- `credentials/:id/reveal` — horizontal privilege escalation (no ownership scope check)
|
||||
@@ -511,6 +582,7 @@ These decisions are locked. Do not reverse without explicit user approval.
|
||||
8. **Multi-tenancy identity model (ADR-001)** — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1.
|
||||
9. **Holistic feature development (DESIGN.md)** — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected.
|
||||
10. **AI-optional operation** — GuruRMM must be fully functional without AI. AI features are enhancements, not requirements.
|
||||
11. **Durability-first command delivery** — The durable Postgres `commands` row is the source of truth; the WebSocket is only a delivery hint. An un-acked command is re-deliverable, never reaped. Idempotent end-to-end by `command_id` so re-delivery never double-executes. (Established 2026-06-11, agent-comms-durability spec.)
|
||||
|
||||
---
|
||||
|
||||
@@ -531,33 +603,34 @@ These decisions are locked. Do not reverse without explicit user approval.
|
||||
| 2026-05-23 | /rmm-audit pass. Agent optimization Phases 1A-3. Auto-version bump mechanism. MSRV bumped to 1.85. Fleet at 0.6.29. |
|
||||
| 2026-05-24 | Linux tray IPC + GTK (PR #13+#14) and peer-cred authz (PR #14) merged. PR #21 (ReadWritePaths fix) merged. Build pipeline split into per-platform scripts. Pluto known-hosts pinned. Fleet converged to 0.6.38. |
|
||||
| 2026-05-31 | Roadmap reconciliation (17 corrections — roadmap understated built state). MSPBackups mapping/verify UI + dev-admin impersonation UI deployed (dashboard v0.2.32). BUG-008/013/014 status corrected to fixed. SPEC-021 (logged-in user domain detection) written after Howard feature request. |
|
||||
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys, NVIDIA driver 32.0.15.9201 on RTX 5070 Ti Laptop GPU); GuruConnect cleared on three grounds; root cause one-off driver TDR. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1 (watermark before send) + SF-2 (non-atomic watermark write); merged to main (0ec55cf), agent versioned 0.6.51. |
|
||||
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh (macOS already correct). Webhook wired to dispatch build-server.sh with change-gate (last-built-commit-server) + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed, override removed, verified clean. [NOTE: the session log recorded "GURU-5070 promoted to stable" — contradicted by live DB; see 2026-06-04 entry.] |
|
||||
| 2026-06-04 | Channel correction confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Site "Mike's Car" and all 39 sites are `update_channel = NULL` (stable default); GURU-5070 is the only beta agent in the 119-agent fleet. Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux via `update_rollouts` (promoted 2026-05-28); beta channel has 0 `update_rollouts` rows (server dispatches newest signed beta artifact directly). GURU-5070 running 0.6.54. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + `WTSEnumerateProcessesW` reconciliation + graceful shutdown event (fix #3 dormant pending `terminate_all` wiring — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
||||
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (`summarize_backup_error` decodes MSP360 message JSON; `create_or_update_alert` now refreshes title/message/severity on re-trigger, also fixes latent severity-escalation freeze) + FU2 (exclude non-backup PlanTypes 8=Restore/13=Consistency-check from alerting/compliance): false `backup_failed` alerts 15 -> 2 fleet-wide (survivors AD1, LAB-Becky are genuine and self-describing), commit `779f7f6`. `backup_storage_low` alert type removed entirely (commit `b82c010`): `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity — produced 5 fleet-wide false alerts including DF-HYPERV-B "100% Full" on a 4 GB plan; `resolve_all_backup_storage_alerts` (type-scoped, idempotent, once-per-tick) clears stragglers; 5 -> 0 verified after 17:21:41 UTC restart. Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint). `BACKUP_STALE` evaluator confirmed already correct — no new code. Both commits on main. Submodule pinned at `226ba9f` in parent. |
|
||||
| 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical credential propagation (Global → Client → Site) with `is_inheritable` flag and de-duplication by (credential_type, label). `/effective` endpoints validated. Dashboard UI: clickable alert severity badges with client filtering, offline badge now scopes to client-specific agents. SPEC-028 offboarding wizard specification created (835 lines) covering site and client offboarding workflows with data export, dependency analysis, typed confirmation, and audit logging. FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management" section. |
|
||||
| 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped (second session). Offline sweep evaluator (60s tokio interval, `server/src/alerts/offline.rs`): server-only `agent_offline` alerts (classifier: `os_product_type` 2/3, else `os_name` ~/server/i, else `role_override`); site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`; aggregates pinned to representative agent with site/fleet `dedup_key`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect — permanently false in steady state). Migration 054 (`agents.role_override`). Dashboard triage: servers elevated individually; workstations collapsed into roll-up. `PUT /api/agents/:id/role-override`. Verified live vs WIN-TG2STMODJG8 ("Windows Server 2019 Standard Evaluation", auto-classified via `os_name`); zero false mass-offline against ~55 chronically-offline agents; restart guard held across deploys. Known gap: `os_product_type` on only 16/168 agents; `os_name` is workhorse. Commits f1cdf5d/30e4f23/21d63bd/3eedf91. Alert ignore/mute (perma-silence): migration 055 (`alert_mutes` table + `muted` status); `dedup_key`-keyed; reason required (400 if missing); gates `create_or_update_alert` + `create_check_alert` bypass; `POST /api/alerts/:id/mute` + `/unmute`; dashboard UI NOT started. Verified live (mute -> `muted`; 60s sweep did not re-fire; unmute -> `active`). Commits 29c405e/a120e71. MSP360 Provider API probed live: monitoring-only tier (Companies/Users/Monitoring GET; management paths 404); plan delete/run/storage remain console tasks; white-label agent has no `cbb.exe`. |
|
||||
| 2026-06-07 | UI gap batch + enrollment audit (GURU-BEAST-ROG session). `get_agent` upgraded to `AgentWithDetails` with `client_id`/`client_name` JOIN. Six new/updated server endpoints: `GET /api/agents/:id/bsod-events`, `GET /api/agents/:id/version-history`, `GET /api/install-reports` + `/:id`, `GET /api/discovery/all-devices`, `DELETE /api/agents/:id/key` (redesigned as dual-layer revoke). Two new enrollment endpoints: `GET /api/sites/:id/enrolled-agents` (audit with duplicate-hostname annotation) + `POST /api/enrolled-agents/:id/revoke`. Dashboard: AgentDetail Crashes tab + version history, Dashboard.tsx fleet stats from `/agents/stats`, SiteDetail Revoke Key + Enrollment audit tab, new pages InstallReports + Discovery. Hotfix `6faa382`: `role_override` field missing from new `get_agent_with_details_by_id` SQL caused all `GET /api/agents/:id` to return 500 in production; fixed within same session. Enrollment `enrolled_agents` table (migration 012) required no new migration — gap was API exposure and UI only. WS auth: Mode 3 checks `enrolled_agents.agent_key_hash` first; Mode 1/2 falls back to `agents.api_key_hash`; dual-revoke now covers both layers. Key commits: `5854c63`, `6faa382`, `49a7109`, `00129af`. All UI gaps except Tunnel and documentation are now complete. SSH from BEAST to production broken (known_hosts issue) — workaround: JWT + curl. |
|
||||
| 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys); GuruConnect cleared on three grounds. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1/SF-2; merged to main (0ec55cf), agent versioned 0.6.51. |
|
||||
| 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh. Webhook wired to dispatch build-server.sh with change-gate + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed. |
|
||||
| 2026-06-04 | Channel state confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event (fix #3 dormant — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. |
|
||||
| 2026-06-07 | Backup-alert quality pass shipped. FU1 (summarize_backup_error + create_or_update_alert refresh) + FU2 (exclude PlanTypes 8/13 from alerting/compliance): false `backup_failed` 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010`). `backup_storage_low` alert type removed entirely (DataCopied/TotalData is not destination capacity; 5 -> 0 false alerts; `resolve_all_backup_storage_alerts`). |
|
||||
| 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical propagation (Global → Client → Site), `is_inheritable`, `/effective` endpoints. Dashboard: clickable alert severity badges + client filtering. SPEC-028 offboarding wizard specification created (835 lines). FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management." |
|
||||
| 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped. Offline sweep evaluator (60s interval; offline.rs): server-only `agent_offline`; migration 054 `agents.role_override`; site (>=50%+>=3) -> `mass_offline_site`; fleet (>=10) -> `mass_offline_fleet`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect). Alert ignore/mute (perma-silence): migration 055 `alert_mutes` + `muted` status; `dedup_key`-keyed; reason required; gates `create_or_update_alert` + `create_check_alert` bypass; dashboard UI pending. |
|
||||
| 2026-06-07 | UI gap batch + enrollment audit (GURU-BEAST-ROG session). `get_agent` upgraded to `AgentWithDetails`. Six new/updated server endpoints. Enrollment audit endpoint + per-row revoke. Dashboard: AgentDetail Crashes tab + version history, SiteDetail Enrollment tab, InstallReports + Discovery pages. Hotfix `6faa382` (role_override missing from new SQL — all GET /api/agents/:id returned 500). |
|
||||
| 2026-06-11 | Physical server migration executed. Ubuntu VM at 172.16.3.30 retired; physical box took the same IP. Production down <27 min (06:53-07:20 MST). 162/212 agents reconnected within 15 min. Old VM parked at .46 as rollback anchor. Server binary path changed from `/usr/local/bin/gururmm-server` to `/opt/gururmm/gururmm-server`. Ubuntu 26.04 + PostgreSQL 18. 7-day whale data backfilled (~3.4 GB, lossless, 0 pool-timeouts). Three build-env gaps fixed on new box (sccache, Pluto key, signing tools). |
|
||||
| 2026-06-11 | Ghost agent root-caused. `device_id` storage fragile (single file, wiped by cleanup.ps1). 11 duplicate-hostname agents found (9 same-site ghosts; 2 cross-site non-merge). Durable agent identity spec created (`specs/durable-agent-identity/`; multi-AI reviewed). Phase 1 Task 1 shipped (agent v0.6.62): multi-location device_id (registry + ProgramData on Win; /etc + /var/lib on Linux; /Library + mirror on macOS) with self-heal; cleanup.ps1 whitelisted. Channel pin regression fixed: moved GURU-5070 beta override from agent-level to site "Mike's Car" (survives re-enrollment). |
|
||||
| 2026-06-11 | Agent-comms-durability Phase 1 shipped (agent v0.6.63 / server v0.3.68). Spec created (specs/agent-comms-durability/; multi-AI reviewed, Gemini + Grok). Root cause: NAT conntrack asymmetry — server->agent pushes black-holed in idle conntrack gaps. Fix: agent CommandAck on receipt (migration 058 acked_at); agent dedup cache (re-report, never double-run); server reaper re-delivery (migration 059 delivery_attempts, cap 10; capability gate via partial index); pending commands re-offered on every heartbeat. Installer hardened (Start-Process). Build-pipeline dubious-ownership fixed. Canary (Goldstein + Peaceful Spirit): 6 agents verified, PST-SERVER ACK latency 16 ms. Fleet promoted 0.6.63 stable; 168 agents converged in ~3 min, 182 online throughout. |
|
||||
|
||||
---
|
||||
|
||||
## Compilation Notes
|
||||
|
||||
- **2026-05-26 recompile:** Added the Capabilities / Feature Set section, synthesized from authoritative artifacts at live `main` (cd27a59) — API route modules, agent source tree, 46 migrations, roadmap, and the feat/perf commit log — NOT from session logs. This was prompted by the prior seeding missing the `user_session` command context entirely (it had only ever stated "runs as LocalSystem"). Corrected: command execution contexts, temperature monitoring (BUG-001 is resolved, not pending), Entra-only SSO, and added user-inventory/discovery/VM-detection/safe-rollout surfaces. **Changelogs are an unreliable capability source here** — committed changelogs stop at agent v0.6.22 while the fleet runs 0.6.39+; migrations + commit log are authoritative.
|
||||
- **2026-05-26 recompile:** Added the Capabilities / Feature Set section, synthesized from authoritative artifacts at live `main` (cd27a59). Corrected: command execution contexts, temperature monitoring (BUG-001 is resolved, not pending), Entra-only SSO, added user-inventory/discovery/VM-detection/safe-rollout surfaces. Changelogs are an unreliable capability source — stop at agent v0.6.22; migrations + commit log are authoritative.
|
||||
- Tunnel subsystem (verified against live main): agent side substantially built; server side is a dead-code skeleton (not declared in `api/mod.rs`, no routes, WS handler logs "not yet implemented"). Confirmed, not unverified.
|
||||
- macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). `build-mac.sh` is a stub as of 2026-05-24 — unclear if automated pipeline includes macOS yet. [unverified]
|
||||
- Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23) — likely still unfixed. [unverified]
|
||||
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24 save. [unverified]
|
||||
- **2026-06-02 recompile:** Folded in BSOD detection feature (Phase 1 shipped — agent/src/bsod.rs, migration 048, ws handler, always-Critical alerts, verified against real 0x116 dump); server build now wired into webhook (change-gated + rollback); build channel default changed to beta (stable is explicit promote); versions updated to agent 0.6.51 / server 0.3.37; fleet converged. Corrected submodule framing (tracks active repo, develop here + push to Gitea — not "stale, do not develop"). Added build-server.sh change-gate marker and server build log to Key Files. Added server's root RMM agent as a good pattern. Updated Current Focus with BSOD Phase 2/3 and Linux fleet unit drift. Added four new anti-patterns (minidump crate, default-stable builds, webhook agent-only gap, auto-update race). Migration count updated 46 -> 48.
|
||||
- **2026-06-04 recompile:** Corrected GURU-5070 channel state — live Postgres confirms `update_channel = 'beta'` per-agent (not stable as the 2026-06-02 session log implied). Stable fleet pinned at 0.6.47 (not 0.6.51). GURU-5070 on 0.6.54 beta. Beta channel has no `update_rollouts` pin. Added BUG-020 (tray duplicate/ghost icons) — symptom, root cause, fix commit `137dd85`, dormant follow-up for fix #3 wiring. Updated Summary, Components table, Active State, Current Focus, History, Good Patterns, and Compilation Notes. Added sources entry for live Postgres query + commit 137dd85. Added `aliases: [guru-rmm]` frontmatter to cross-reference the tombstone at `wiki/projects/guru-rmm.md`.
|
||||
- **2026-06-07 recompile:** Folded in backup-alert quality pass (commits `779f7f6` + `b82c010`, both on main). Updated Backup Integration capability section: added FU1/FU2 alert quality pass detail (false backup_failed 15->2; summarize_backup_error; create_or_update_alert refresh); documented backup_storage_low removal (structurally false DataCopied/TotalData signal; 5->0 false alerts; resolve_all_backup_storage_alerts); confirmed BACKUP_STALE evaluator correct (no new code); added key functions list and MSP360 PlanType exclusion map. Updated Repo Structure to include db/mspbackups.rs and mspbackups/ key functions. Updated Current Focus MSP360 line and added /backup-status endpoint shape gap. Updated Summary date and added backup-alert quality pass note. Active State date note updated. Added 2026-06-07 History row. Patterns and History existing rows preserved verbatim.
|
||||
- **2026-06-07 recompile:** Updated for credential inheritance production deployment (server v0.3.45), clickable alert badges with client filtering, and SPEC-028 offboarding wizard specification. Added Recent Work section documenting 2026-06-07 session accomplishments. Updated Current Focus to reflect credential inheritance as deployed and offboarding wizard as spec-complete/implementation-pending. Updated Dashboard status to include credentials management with inheritance. Updated version numbers throughout (server 0.3.37 → 0.3.45). Added session-logs/2026-06-07-mike-gururmm-offboarding-spec.md to sources. Updated History Highlights with 2026-06-07 entry.
|
||||
- **2026-06-07 recompile (second session):** Folded in role-aware offline alerting (server/src/alerts/offline.rs: offline_sweep + agent_role classifier + mass_offline site/fleet detection; migration 054 agents.role_override; dashboard triage + role-override control on agent detail; warm-up restart guard; four commits f1cdf5d/30e4f23/21d63bd/3eedf91) and alert ignore/mute perma-silence (alert_mutes table, muted status, is_dedup_muted/mute_condition/unmute_condition, two-choke-point gate at create_or_update_alert + create_check_alert bypass, mute/unmute API; migration 055; dashboard pending; two commits 29c405e/a120e71). Added MSP360 API scope finding (monitoring-only tier confirmed by live probe; management paths 404). Updated Alerting & Watchdog section with offline alerting and mute detail including new alert types (agent_offline, mass_offline_site, mass_offline_fleet) and new status (muted). Updated Repo Structure (alerts/ directory; db/alerts.rs key functions; dashboard/ entry with agentRole.ts). Updated Development / Current Focus with alert mute dashboard (Task 5 not started), offline classification gap, and MSP360 API scope item. Added alert mute UI to Dashboard incomplete list. Added second 2026-06-07 History row. Updated migration count to 55+ (054/055 confirmed). Added session log source. Patterns section and all existing History rows preserved verbatim.
|
||||
- **2026-06-07 recompile (GURU-BEAST-ROG/discord-bot):** Folded in UI gap batch + enrollment audit session. Added new Recent Work entry (2026-06-07, UI gaps + enrollment). Added Enrollment Detail subsection under Identity / Multi-tenancy documenting `enrolled_agents` table schema, WS auth modes, dual-layer revoke design, and enrollment audit endpoint. Updated `get_agent` capability (now `AgentWithDetails` with client JOIN). Documented six new/updated endpoints (bsod-events, version-history, install-reports, discovery/all-devices, dual-revoke key, enrollment audit + revoke). Updated Dashboard complete list: added Crashes tab, version history, fleet stats, SiteDetail Revoke Key + Enrollment tab, InstallReports + Discovery pages. Updated Dashboard incomplete list: removed enrollment management (now complete); updated Tunnel status (server skeleton confirmed "not yet implemented"); added documentation gap. Updated Current Focus: BSOD Phase 2 partially complete (Crashes tab shipped; Alerts stream + dump upload still deferred); added Tunnel deferred detail, SSH from BEAST broken, dead code cleanup, enrollment key display gap. Added new 2026-06-07 History row (UI gap batch + enrollment audit). Added session log to sources. Updated compiled_by to GURU-BEAST-ROG/discord-bot. History and Patterns sections preserved verbatim.
|
||||
- macOS build status: Phase 1 was deployed manually from Mikes-MacBook-Air (2026-05-12). `build-mac.sh` is a stub as of 2026-05-24. [unverified whether automated pipeline includes macOS]
|
||||
- Pre-commit hook on 172.16.3.30 lacks execute bit (noted 2026-05-23). [unverified — may still be unfixed on new box]
|
||||
- Auto-update reliability fix for BB-SERVER and RECEPTIONIST-PC was incomplete at 2026-05-24. [now superseded by comms-durability Phase 1 — heartbeat re-offer path should resolve this]
|
||||
- **2026-06-02 recompile:** Folded in BSOD detection, server build webhook wiring, build channel default beta, versions updated. Migration count updated 46 -> 48.
|
||||
- **2026-06-04 recompile:** Corrected GURU-5070 channel state. Stable fleet pinned at 0.6.47. BUG-020 documented.
|
||||
- **2026-06-07 recompile:** Folded in backup-alert quality pass, credential inheritance, offline alerting + mute, UI gap batch + enrollment audit. Updated migration count to 55+ (054/055 confirmed).
|
||||
- **2026-06-11 recompile (GURU-5070/claude-main):** Full recompile. Added: (1) Physical server migration (Ubuntu 26.04, PG 18, binary at /opt/gururmm, old VM .46 rollback anchor). (2) Durable agent identity (ghost root cause, spec, Phase 1 Task 1 durable device_id, channel pin fix). (3) Agent comms durability Phase 1 full detail (spec, slices A/B/C, deployment, fleet rollout, canary verification). (4) New capabilities: Audit Log (migration 056), Systemic Log Feedback Intelligence (migration 057), comms durability architecture. (5) Updated fleet size (~215 enrolled, 168-182 online). (6) Updated agent/server versions to 0.6.63/0.3.68. (7) Build pipeline: server IS auto-deployed by webhook (correction to earlier assumption); parallel build prototype on Pluto (3.51x, deferred integration). (8) Channel/promotion model documented in detail. (9) Downloads layout documented. (10) Updated server binary path + SSH details for physical box. (11) Added new anti-patterns (installer `&` operator, reaper false-fail, agent-level channel pin). Added ADR-11 (durability-first command delivery). Migration count updated to 59. Patterns/History preserved verbatim except new entries added.
|
||||
|
||||
## Backlinks
|
||||
|
||||
- [[clients/cascades-tucson]] — RECEPTIONIST-PC enrolled (site CascadesTucson)
|
||||
- [[systems/gururmm-build]] — Linux VM at 172.16.3.30 on Jupiter; GuruRMM API 3001, ClaudeTools API 8001, Coord API, MariaDB, PostgreSQL, build pipeline; originally a container on Jupiter, migrated to own VM
|
||||
- [[systems/jupiter]] — Unraid host at 172.16.3.20; virsh host for all VMs (GuruRMM VM, Unifi, OwnCloud, Pluto/Claude-Builder); Docker: Gitea port 3000, NPM, Seafile; iptables PREROUTING routes :443 to NPM (NPM proxy `rmm-api -> 172.16.3.20:3001` in credentials.md is STALE — actual GuruRMM API is on 172.16.3.30)
|
||||
- [[systems/gururmm-build]] — Physical server at 172.16.3.30 on Jupiter subnet; GuruRMM API 3001, ClaudeTools API 8001, Coord API, MariaDB, PostgreSQL, build pipeline; migrated from VM to physical box 2026-06-11
|
||||
- [[systems/jupiter]] — Unraid host at 172.16.3.20; virsh host for all VMs (Pluto/Claude-Builder, Unifi, OwnCloud); Docker: Gitea port 3000, NPM, Seafile; NPM is the public TLS terminator for rmm-api.azcomputerguru.com (forwards to .30:80)
|
||||
- [[systems/pluto]] — Windows build server (MSI, WiX) at 172.16.3.36
|
||||
|
||||
Reference in New Issue
Block a user