diff --git a/wiki/index.md b/wiki/index.md index 7af9144c..97d81533 100644 --- a/wiki/index.md +++ b/wiki/index.md @@ -1,6 +1,6 @@ # Wiki Index -Last updated: 2026-06-21 +Last updated: 2026-06-22 Compiled by: HOWARD-HOME/claude-main This wiki is LLM-maintained. Do not edit articles manually — run `/wiki-compile` to update. @@ -67,7 +67,7 @@ Run `/wiki-lint` to check for stale entries and broken backlinks. | Article | Summary | Last Compiled | |---|---|---| -| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; **agent v0.6.63 stable / server v0.3.68**; ~215 enrolled (~168-182 online); 59 migrations. **Agent-comms-durability Phase 1 shipped 2026-06-11** — fixes commands black-holing at NAT'd sites (UDR Ultra): agent `CommandAck` on receipt + dedup, server reaper RE-DELIVERS un-acked commands instead of false-failing (migrations 058 acked_at/059 delivery_attempts), heartbeat re-offer; verified on PST-SERVER (16 ms ACK), 168 agents converged in ~3 min. Build/release: webhook now builds AND auto-deploys the server (build-server.sh); new builds tag **beta**, stable promotion is deliberate via `POST /api/updates/rollouts/:version/promote`. Prior: backup-alert quality pass; hierarchical credential inheritance; role-aware + mass-offline alerting (migration 054); alert mute (055); audit log (056); log-feedback (057); tray BUG-020 fix; durable-agent-identity (ghost) spec. Active development | 2026-06-11 | +| [GuruRMM](projects/gururmm.md) | RMM platform, Rust/Axum server + React dashboard + cross-platform agent; **agent v0.6.67**; ~270 enrolled (~178 online); 60 migrations on main (061-063 on pending branches). **Two-wave Windows build** — stable (modern x86_64) + legacy 1.77 wave (l_amd64/l_x86 for Win7/Server 2008 R2). Merging to main = build+deploy; artifacts hit **beta** first, `promote-dashboard.sh --confirm` for prod. Recent (2026-06-21/22): BUG-021 legacy-build dep-pin fix (`1dce66d`, Windows build green); BUG-018 reliable delete (202 + background purge, `cea87d4`); Event Log Watch management UI shipped (`0fa65f5`); BUG-022 orphaned WatchdogEvent WS dead-code removed (PR #45); SPEC-021 logged-in-user domain; MSP360 deep-link; enrollment-modal UX; event-log-watch policy-clobber HIGH fix; sync.sh submodule-clobber root-cause fix. Watchdog reports via REST `watchdog-alert` only (no WS). Prior: agent-comms-durability Phase 1 (2026-06-11), credential inheritance, alert mute/audit/log-feedback (055-057). Active development | 2026-06-22 | | [GuruConnect](projects/guruconnect.md) | ACG's proprietary Rust remote-access/remote-support tool (ScreenConnect-class) — Windows agent + web dashboard; **v0.3.0 production** at connect.azcomputerguru.com; versioned integration contract with GuruRMM; co-located on the physical .30 box (shares the PG cluster); next: SPEC-018 session broker / capture worker as SYSTEM | 2026-06-12 | | [Dataforth DOS — Test Datasheet Pipeline](projects/dataforth-dos.md) | DOS update system + TestDataDB pipeline (Node.js, PostgreSQL, Hoffman API); 469K records, 458.5K live on website; 2025 crypto attack recovery; security incident 2026-03-27; SCMVAS/SCMHVAS extension; email notifications via Graph API | 2026-05-24 | | [ClaudeTools Discord Bot](projects/discord-bot.md) | Claude Agent SDK bot in Discord; one persistent session per thread; Phase 1.5 complete (native tools, no hand-written tools); Phases 2-4 (API integration, remediation, UX) pending; runs as NSSM service on BEAST | 2026-05-24 | diff --git a/wiki/projects/gururmm.md b/wiki/projects/gururmm.md index 6177f82b..609bc3ec 100644 --- a/wiki/projects/gururmm.md +++ b/wiki/projects/gururmm.md @@ -2,32 +2,44 @@ type: project name: gururmm display_name: GuruRMM -last_compiled: 2026-06-11 -compiled_by: GURU-5070/claude-main +last_compiled: 2026-06-22 +compiled_by: Howard-Home/claude-main aliases: - guru-rmm sources: - "gururmm@main: server/src/api/*.rs (REST API surface, ~37 route modules)" - "gururmm@main: agent/src/ (agent capabilities; transport/CommandContext, ohw.rs, watchdog/wts.rs, bsod.rs, device_id.rs)" - - "gururmm@main: server/migrations/*.sql (59 migrations — feature checkpoints through 059_command_delivery_attempts)" + - "gururmm@main: server/migrations/*.sql (60 migrations — feature checkpoints through 060_alert_mutes_agent_id_index)" - "gururmm@main: server/migrations/048_bsod_events.sql" - "gururmm@main: server/migrations/056_audit_log.sql" - "gururmm@main: server/migrations/057_log_signatures.sql" - "gururmm@main: server/migrations/058_command_acked_at.sql" - "gururmm@main: server/migrations/059_command_delivery_attempts.sql" + - "gururmm@main: server/migrations/060_alert_mutes_agent_id_index.sql" - "gururmm@main: agent/src/bsod.rs" - "gururmm@main: server/src/api/updates.rs (promote/rollback endpoints)" + - "gururmm@main: server/src/api/mod.rs (route surface, 2026-06-22)" - "gururmm@main: deploy/build-pipeline/webhook-handler.py" - "gururmm@main: deploy/build-pipeline/build-server.sh" + - "gururmm@main: docs/FEATURE_ROADMAP.md (BUG-018..022 + SPEC-021 status)" + - "gururmm@main: docs/BUILD.md (build + pre-merge verification guide)" + - "gururmm@main: git log origin/main -40 (2026-06-22)" + - "gururmm@main: commit 1dce66d (BUG-021 dep-pin)" + - "gururmm@main: commit cea87d4 (BUG-018 202+bg)" + - "gururmm@main: commit 0fa65f5 (Event Log Watch UI)" + - "gururmm@main: commit 3de9faf (command_type cmd alias + NAK)" + - "gururmm@main: commit 4027c86 (enrollment modal UX)" + - "gururmm@main: commit f0a4b7f (BSOD dedup key)" + - "gururmm@main: commit 95ef901 (MSI EXDEV fix)" - "gururmm@main: commit 137dd85 (BUG-020 tray fix)" - - "gururmm@main: docs/FEATURE_ROADMAP.md, docs/specs/" - - "gururmm@main: git log feat/perf history (changelogs incomplete past v0.6.22)" + - "gururmm@main: docs/specs/, docs/FEATURE_ROADMAP.md, docs/BUILD.md" - projects/msp-tools/guru-rmm/CONTEXT.md - projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md - projects/msp-tools/guru-rmm/docs/UI_GAPS.md - projects/msp-tools/guru-rmm/docs/ARCHITECTURE_DECISIONS.md - projects/msp-tools/guru-rmm/docs/tech-stack.md - projects/msp-tools/guru-rmm/docs/DESIGN.md + - projects/msp-tools/guru-rmm/docs/BUILD.md - projects/msp-tools/guru-rmm/specs/agent-comms-durability/shape.md - projects/msp-tools/guru-rmm/specs/agent-comms-durability/plan.md - projects/msp-tools/guru-rmm/specs/agent-comms-durability/references.md @@ -40,6 +52,8 @@ sources: - .claude/memory/project_mac_gururmm_setup_pending.md - .claude/memory/feedback_gururmm_build_channel_default.md - .claude/memory/reference_gururmm.md + - .claude/memory/feedback_gururmm_build_verification.md + - .claude/memory/rmm-dashboard-beta-before-main.md - credentials.md - session-logs/2025-12-15-session.md - session-logs/2025-12-20-session.md @@ -59,7 +73,6 @@ sources: - session-logs/2026-05-31-howard-gururmm-roadmap-and-features.md - session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md - session-logs/2026-06-07-mike-gururmm-offboarding-spec.md - - "live GuruRMM Postgres query 2026-06-04: agents/sites/update_rollouts/agent_updates tables (channel verification)" - session-logs/2026-06-07-mike-gururmm-backup-alert-cleanup.md - session-logs/2026-06-07-mike-gururmm-offline-alerting-mute.md - session-logs/2026-06-07-mike-gururmm-ui-gaps-enrollment.md @@ -68,6 +81,14 @@ sources: - projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-promotion-ghost-identity.md - projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability.md - projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-11-mike-rmm-comms-durability-phase1.md + - projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-12-mike-beast-parallel-windows-build.md + - projects/msp-tools/guru-rmm/session-logs/2026-06/2026-06-12-mike-rmm-command-type-cmd-misdiagnosis.md + - session-logs/2026-06/2026-06-15-mike-unifi-wifi-skill-and-gururmm-fixes.md + - session-logs/2026-06/2026-06-18-mike-sync-submodule-autoheal-and-thoughts-rescue.md + - session-logs/2026-06/2026-06-21-howard-gururmm-bug-018-019.md + - session-logs/2026-06/2026-06-21-howard-rmm-bug018-eventlogwatch-ui.md + - session-logs/2026-06/2026-06-21-howard-gururmm-features-audit-submodule-fix.md + - session-logs/2026-06/2026-06-21-mike-rmm-build-skill-errorlog-lint-coord-purge.md backlinks: - clients/cascades-tucson - systems/gururmm-build @@ -79,21 +100,27 @@ backlinks: ## Summary -GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 215 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints. +GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer Guru LLC for internal MSP operations and eventual productization. The server (Rust/Axum) and dashboard (React/TypeScript) are production-deployed at https://rmm.azcomputerguru.com with approximately 270 enrolled agents across multiple client sites. The agent runs on managed Windows, Linux, and macOS endpoints. -**Current version:** agent 0.6.63 (stable) / server 0.3.68 as of 2026-06-11. Fleet converged to 0.6.63 (~168-182 typically online); ~20 stragglers on older versions auto-update on reconnect. Changelogs are stale (stop at agent v0.6.22) — migrations + commit log are the authoritative feature record. +**Current version:** agent 0.6.67 (stable) / server v0.3.74+ as of 2026-06-22. Fleet at ~270 enrolled / ~178 typically online; all metrics flowing, alerts firing with 0 legacy null dedup_keys, Windows build green at v0.6.67 (marker `1dce66d` on origin/main). -**Agent-comms-durability Phase 1 shipped 2026-06-11:** Commands black-holed at NAT'd sites (e.g., UDR Ultra) no longer falsely fail. Agent sends a `CommandAck` on receipt; server reaper re-delivers un-acked commands (never fails them); pending commands re-offered on every heartbeat (rides the warm conntrack). Verified: PST-SERVER (Peaceful Spirit, behind UDR Ultra) returned a command with `acked_at` stamped and `ack_latency 16 ms`. Fleet of 168 agents converged to 0.6.63 in ~3 min. Migrations 058 (`acked_at`) + 059 (`delivery_attempts`) applied on server restart. See Architecture section for full detail. +**BUG-021 (FIXED, commit `1dce66d`, 2026-06-22):** The legacy wave (Rust 1.77 for Win7/Server 2008 R2) was resolving deps fresh and pulling `edition2024` crates. Fixed by pinning `getrandom 0.3.1` + `zeroize 1.8.1` below edition2024 in agent Cargo.toml. NOT a toolchain bump — bumping would drop legacy support. Windows build went green at 02:19, v0.6.67. -**Physical server migration completed 2026-06-11:** The kitchen-sink Ubuntu VM at 172.16.3.30 was retired. A physical box took the same IP (`172.16.3.30`) running Ubuntu 26.04 + PostgreSQL 18. Production back online within 27 min (maintenance window). Old VM parked at 172.16.3.46 with data pristine as rollback anchor. Server binary moved from `/usr/local/bin/gururmm-server` (old VM) to `/opt/gururmm/gururmm-server` (new box). See also: [[gururmm-build]]. +**BUG-018 (FIXED, commit `cea87d4`):** `DELETE /api/agents/:id` now returns 202 Accepted, marks the agent `deleting` (synchronous existence-check), disconnects the live WS, and purges all cascade rows in a background tokio task. Fleet listings and stats exclude `'deleting'` rows; the agent disappears from the UI immediately. -**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide. `backup_storage_low` alert type removed entirely. See Backup Integration section. +**Event Log Watch management UI (SHIPPED, commit `0fa65f5`):** `/event-log-watches` CRUD is wired into the dashboard — list, create, edit, delete, and enable-toggle watches for global and per-agent scope via a new `pages/EventLogWatches.tsx` in the FunctionRail + omnibox. -**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07:** Server-only `agent_offline` alerts; site/fleet mass-offline aggregation; `alert_mutes` perma-silence (dashboard UI still pending). +**Agent-comms-durability Phase 1 shipped 2026-06-11:** Commands black-holed at NAT'd sites no longer falsely fail. See Architecture section for full detail. + +**Physical server migration completed 2026-06-11:** New physical box at 172.16.3.30 running Ubuntu 26.04 + PostgreSQL 18. Old VM at .46 decommissioned 2026-06-12. + +**Backup-alert quality pass shipped 2026-06-07:** False `backup_failed` alerts reduced 15 -> 2 fleet-wide. + +**Role-aware offline alerting + alert ignore/mute shipped 2026-06-07:** Server-only `agent_offline` alerts; site/fleet mass-offline aggregation; `alert_mutes` perma-silence. **See also:** `wiki/projects/guru-rmm.md` is a redirect tombstone pointing here (slug disambiguation). -**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `D:\claudetools\projects\msp-tools\guru-rmm` is a git submodule tracking the active repo. Development happens in the submodule working tree; commits/pushes go to Gitea from there (GURU-5070 cannot push to gururmm Gitea — use the server at 172.16.3.30 or 172.16.3.47). +**Repo:** `azcomputerguru/gururmm` on Gitea (internal: http://172.16.3.20:3000). The copy at `projects/msp-tools/guru-rmm` is a git submodule tracking the active repo; its pinned gitlink intentionally lags origin/main. Development uses `git -C projects/msp-tools/guru-rmm` or worktrees off `origin/main` to access live code. Commits/pushes go to Gitea from the server (.30) or via `GIT_ASKPASS` with the vaulted Gitea API token. Howard is authorized for merges/deploys (cleared 2026-06-21 by Mike). **Goal:** Full-featured MSP platform rivaling commercial RMMs, with a companion PSA (GuruPSA, separate future repo) designed as a truly integrated unified system. @@ -101,6 +128,67 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G ## Recent Work +### 2026-06-22 — BUG-021 Fixed + v0.6.67 Stable; Fleet Verified (Howard-Home) + +- **BUG-021:** Legacy build (Rust 1.77 via `$CARGO +1.77 --features legacy`) was pulling `wit-bindgen` (edition2024) through fresh dep resolution. Fixed by pinning `getrandom 0.3.1` + `zeroize 1.8.1` in agent Cargo.toml below edition2024. Commit `1dce66d` (current origin/main HEAD). Windows build went green at 2026-06-22 02:19, v0.6.67. +- **Fleet verified (same session):** ~270 agents / ~178 online; metrics rows flowing (~2531/15 min); all alerts have non-null dedup_keys (0 legacy nulls); per-agent endpoints all HTTP 200 on a test agent. +- **BUG-022 filed + fixed (PR #45):** `WatchdogEvent` WS variant was dead code — defined + server-handled but the watchdog process has no WS connection and never produced it. Removed the orphaned path (variant, payload, `insert_watchdog_event`, match arm, `watchdog_events::*` re-export). Server-compiled clean (48.9s). Empty `watchdog_events` table left in place (no DROP migration) to avoid collision with in-flight PR migration numbers. PR #45 (`fix/bug-022-watchdog-event-deadcode`, `4eb5054`) + PR #46 (docs) await Howard/Mike merge. Merging #45 = fleet build+deploy. +- **Roadmap verified:** BUG-018 (`cea87d4`) and Event Log Watch UI (`0fa65f5`) confirmed on main; FEATURE_ROADMAP updated to reflect Fixed status. + +--- + +### 2026-06-21 — Feature Batch, Audit, BUG-018 202+bg, Event Log Watch UI, Submodule Fix (Howard-Home + Mike GURU-5070) + +**Howard-Home sessions (three spans):** + +- **BUG-019 FIXED (commit `66a7f4e`, merged to main by Mike):** Container agent no longer performs in-place binary self-update (silent downgrade on container recreate / rollback-artifact log noise). `agent/src/updater/mod.rs` `perform_update()` now early-returns inside a container (`is_docker_container()`), reporting `UpdateStatus::Failed` with a clear message. Structured `update_method=image` deferred to SPEC-023. Verified on Linux build (v0.6.67 beta, 0 errors, 54s). +- **BUG-018 FIXED (commit `cea87d4`, merged):** Handler redesign to 202 + background purge (see Summary). Migration 061 (FK indexes on five previously-unindexed FK columns, speeds the bg purge) on PR #41, pending merge. The root cause of the original HTTP 000 was cascade-delete volume (millions of metrics/logs rows), not missing indexes — the HIGH-volume tables were already indexed on `agent_id`. +- **Event Log Watch management UI (SHIPPED, `0fa65f5`):** `/event-log-watches` CRUD page (list, create, edit, delete, enable-toggle, global + per-agent scope). Built as a standalone page (not an AgentDetail panel) to avoid concurrently-edited AgentDetail. `tsc -b + vite build` clean. Merged + now on origin/main. +- **Enrollment modal UX fix (SHIPPED to prod, `4027c86`):** Enrollment modals (EnrollmentModal + EnrollAgentModal) now have a top-right close button, click-outside/Esc dismissal, and trigger a list refresh on close. Built from a branch worktree -> beta -> tested on rmm-beta -> promoted to prod with `promote-dashboard.sh --confirm`. +- **SPEC-021 + BUG-020 follow-up (PR #40, mig 063, `9171f84`):** Logged-in-user domain + account-type detection (agent metrics + server + dashboard). BUG-020 watchdog tray-teardown wiring (calling `terminate_all` from policy-disable/uninstall path). Migration renumbered 060 -> 063 to clear collision with BUG-019's 060. +- **MSP360 "Open in MSP360" deep-link (PR #42, mig 062):** Button on backup-tab plan card deep-linking to the MSP360 console for a specific agent's backup plan. Backend: `backup_provider_console_id` field (migration 062) + `provider-links.ts` mapping. PR awaits merge. +- **Event Log Watch policy-clobber HIGH fix (PR #43, `432b434`):** Assigning or unassigning any policy silently wiped the connected agent's Event Log Watch monitoring. Third instance of a config-push clobber class already fixed at two other sites. Fixed by reusing `watch_rules_for_agent` (retargeted from `&AppState` to `&PgPool`) — single source of truth across all three push sites. No migration. PR awaits merge. +- **Audit cleanup (PR #44):** Low findings from the 2026-06-21 `/rmm-audit` pass (DoS `cap_field` batch, SSE revocation bypass + JWT-in-URL, `console.log` stubs, `ContextTree.tsx` missing `isError`). Awaiting merge. +- **sync.sh submodule-clobber root-cause fix:** `sync.sh` line 339 `git submodule update --init` ran unconditionally on every sync, re-checking out the intentionally-lagging pinned gitlink in detached HEAD and discarding live work. Fixed with a `git -C "$ppath" rev-parse --git-dir` populate-only guard — only initializes genuinely-missing submodules. Committed to claudetools main; verified guru-rmm now stays on `origin/main @ ed8cad3` through a full sync. +- **Dashboard beta-before-main rule established:** Dashboard changes go to beta first, before main. Beta auto-builds from `origin/main` via the webhook; a feature branch is previewed with a manual git worktree build + rsync to `/var/www/gururmm/dashboard-beta`. Promotion to prod = `promote-dashboard.sh --confirm`. (Saved as memory `rmm-dashboard-beta-before-main`.) + +**Mike GURU-5070 session:** + +- **BUG-019 merged to main** (fast-forward `ed8cad3 -> 66a7f4e`; CI bump `8b5e0dc`). Linux build confirmed green (v0.6.67, 54s, 0 errors). +- **`gururmm-build` skill created:** `bash .claude/skills/gururmm-build/scripts/verify.sh server|agent|dashboard|migrations` for local pre-merge verification. Does NOT trigger the prod build (webhook-on-merge is the trigger). `docs/BUILD.md` added to guru-rmm repo (build guide). Memory `feedback_gururmm_build_verification` captures the key rules. +- **Howard cleared for GuruRMM merges/deploys** (Mike decision). Mike still owns architecture/direction; prepared + verified branches no longer bottleneck on Mike to land. + +--- + +### 2026-06-15 — BSOD Dedup Key Fix, MSI EXDEV Fix, Dashboard Offline Dedup, sync.sh Auto-Heal (Mike GURU-5070 + GURU-BEAST-ROG) + +- **BSOD alert dedup key fixed (commit `f0a4b7f`, server v0.3.73):** Dedup key changed from `bsod::` (unique per crash, so recurring BSODs spawned new alert rows and bypassed per-dump mutes) to `bsod::` (stable across recurrences). Now a single perma-mute silences all future recurrences of the same bugcheck. The live duplicate alerts on MSI were resolved via psql, and a stable-key mute was inserted for the recurring `0x116` VIDEO_TDR_FAILURE on agent `a685af29`. +- **MSI EXDEV fix (commit `95ef901`, server v0.3.74):** Site-MSI builder was staging the signed MSI in `std::env::temp_dir()` (`/tmp`, a tmpfs) then `rename`-ing it to `/opt/gururmm/downloads` (root LV) — a cross-device rename that fails with `EXDEV` (os error 18). Every site-specific MSI install request returned 500. Fixed by staging in `downloads_dir` (same filesystem as the final path), matching the pattern the signed-EXE path already used. +- **Duplicate offline server alerts removed (commit `b6ed564`):** Dashboard Triage view was showing server agents twice in offline status — fixed. +- **sync.sh auto-heal (Phase 1, 2026-06-15):** `resolve_submodule_collisions()` helper added to `sync.sh`. On `git submodule update` abort (untracked files would be overwritten), it enumerates the precise colliding paths (intersection of untracked files and paths tracked by the target gitlink commit), moves only those aside to `.synced-aside-`, then retries once. Preserved 94 orphaned RMM_THOUGHTS lines (2026-06-08 re-grounding pass) that a blind `--force` would have discarded. Phase 2 (populate-only guard) added 2026-06-21. +- **`logs/analyze` switched from Ollama to Claude API (commit `c869e4d`):** Log analysis endpoint now calls the Claude API instead of Ollama-on-Beast. +- **500-error-leak fix (commit `58c1a96`):** Closed the remaining 17 sites that returned raw DB errors in 500 bodies, routing them through `internal_err`. No new DB detail is exposed to callers. + +--- + +### 2026-06-12 — Beast Parallel Windows Build (Lever A) + command_type "cmd" Misdiagnosis Fix (Mike GURU-5070) + +**Beast parallel Windows build (lever A, commit `cfbdb59`):** +- Rewrote `run_remote_build()` in `build-windows.sh` to drive concurrent SSH sessions from bash instead of a serial `cmd /c` chain. Two waves: WAVE 1 = 5 concurrent stable builds (amd64, x86, debug, tray, cleanup); WAVE 2 = 2 concurrent legacy-1.77 builds (l_amd64, l_x86). MSI build overlaps Wave 2. Per-job exit-code aggregation; Pluto fallback preserved. +- **Result:** Beast parallel+warm ~336s (cargo 319s); ~46% faster than Beast serial cold (~622s); ~3.8x faster than Pluto serial (~1269s). +- **Key fix:** Per-variant `--target-dir` is mandatory. Both stable variants sharing `target/` serialized on cargo's build-directory lock. Fixed: amd64=`target/release`, x86=`target/x86`, legacy-amd64=`target/legacy-x64`, legacy-x86=`target/legacy-x86`. +- **Dropped `cargo +1.77 fetch` pre-resolve entirely.** It resolved the full dep graph and pulled `wit-bindgen` (edition2024, unsupported by Cargo 1.77.2) — rc=101 on both hosts. The legacy builds scope deps via `--features legacy` (excludes wit-bindgen) when run as actual builds; the pre-fetch was both unnecessary and the failure point. +- **Live verification:** v0.6.64 (first live Beast build, 622s cold), v0.6.66 (parallel fix, 336s). All 8 artifacts signed OK. + +**command_type "cmd" fix (commit `3de9faf`, agent v0.6.66 stable):** +- Root cause of "PST agents can't receive commands": `CommandType` enum had no `cmd` variant. Server-sent `command_type:"cmd"` failed serde parse; agent logged-and-silently-dropped the message (no ACK, no result) — indistinguishable from a NAT black-hole. 7-round adversarial multi-AI quorum + packet captures were used to isolate the network path before the discriminator was found: the one working command used `powershell`, all failures used `cmd`. +- Fix 1: `#[serde(alias = "cmd")]` on `CommandType::Shell` — accepts `"cmd"` as an alias for Shell (runs cmd.exe, for endpoints without PowerShell). +- Fix 2: NAK on unparseable command — `handle_server_message` now sends `CommandAck` + error `CommandResult` on whole-message parse failure (if the `payload.id` is recoverable), instead of silently dropping. Fails fast rather than black-holing. +- **Reverted:** the earlier "evict non-delivering connections" server change (commit `80df458`) — its motivating premise was this `cmd` phantom; with valid commands acking in ms it only added spurious reconnect churn. +- Promoted to stable fleet-wide. `wiki/clients/peaceful-spirit.md` and `.claude/commands/rmm.md` corrected to remove the wrong "use command_type: cmd" advice. + +--- + ### 2026-06-11 — Physical Server Migration (Cutover + Backfill + Build Pipeline) **Cutover (morning ~06:53-07:20 MST):** @@ -112,8 +200,9 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G **Backfill + Build Pipeline (afternoon session):** - 7-day whale data (metrics 1,189,924 rows + agent_logs 2,262,938 rows; ~3.4 GB total) backfilled VM->new box via direct SSH `\copy` stream in ~2.5 min. Load verified lossless; SSD sustained 186-214 MB/s writes with ZERO pool-timeouts while 212 live agents wrote concurrently. - Three build-env gaps found and fixed after first push-triggered build: (1) sccache not installed (copied v0.8.2 from VM); (2) root had no Pluto SSH key (copied guru's id_ed25519 to /root/.ssh); (3) `/etc/gururmm-signing.env` + jsign + Java missing (installed Java + jsign-7.1.jar; recreated jsign wrapper). -- Parallel Windows build prototyped (7 Start-Job isolated builds, 3.51x speedup 28.9m -> 8.2m, 15 rustc vs 1). Integration into production `build-windows.sh` DEFERRED (design captured in runbook). +- Parallel Windows build prototyped (7 Start-Job isolated builds, 3.51x speedup 28.9m -> 8.2m, 15 rustc vs 1). Replaced by the bash-driven lever-A approach in the 2026-06-12 session. - First clean build on new box: signed Windows agent v0.6.61 beta. Independently verified (Authenticode, Arizona Computer Guru LLC via Azure Trusted Signing, RFC3161 timestamp, sha256 OK). +- **Old VM decommissioned 2026-06-12:** virsh domain + disk removed after stability soak; no longer exists. --- @@ -187,7 +276,7 @@ GuruRMM is a Remote Monitoring & Management platform built by Arizona Computer G ## Capabilities / Feature Set -*Synthesized from authoritative artifacts (API routes, agent modules, 59 migrations, roadmap, commit log) — not from session logs. See Compilation Notes.* +*Synthesized from authoritative artifacts (API routes, agent modules, 60 migrations through migration 060, roadmap, commit log) — not from session logs alone. See Compilation Notes.* Agent<->server communication is a persistent authenticated WebSocket with auto-reconnect + heartbeat; on reconnect, in-flight commands flip to `interrupted`. Platform-parity rule: agent features ship on Windows/Linux/macOS in the same change (stub + TODO where a real impl isn't yet feasible). @@ -196,10 +285,10 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - Hardware sensor telemetry: CPU/GPU temps + full sensor array (temperature, voltage, fan RPM, power). Linux via `/sys/class/thermal`; `sysinfo::Components` fallback. **Windows LibreHardwareMonitor removed 2026-05-27 (CVE-2020-14979); Windows thermal collection pending a safe replacement strategy.** - Process drill-down: top 10 by CPU + top 10 by memory in every metrics payload (cross-platform). - Network state: per-interface IPv4/IPv6, MAC, derived CIDR subnets — sent on change only. -- **Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01):** `agent/src/bsod.rs` polls `C:\Windows\Minidump` for new `.dmp` files (5-min filetime poll), parses kernel dump header at fixed `DUMP_HEADER64`/`DUMP_HEADER32` offsets (bugcheck code @0x38, 4 parameters, FILETIME @0xFA8 — the `minidump` crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references System event log (WER 1001 / Kernel-Power 41) for Report Id + faulting driver, deduplicates via `C:\ProgramData\GuruRMM\bsod-seen.json` watermark. Sends `AgentMessage::BsodEvent`. Server: migration 048 + `server/src/db/bsod_events.rs` + ws handler — always-Critical alert deduplicated by `(agent_id, dump_sha256)`. Verified against a real `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys). Dashboard Crashes tab shipped 2026-06-07. Phase 2/3 deferred: BSOD in Alerts stream, `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map). +- **Windows BSOD/kernel-crash detection (Phase 1, shipped 2026-06-01):** `agent/src/bsod.rs` polls `C:\Windows\Minidump` for new `.dmp` files (5-min filetime poll), parses kernel dump header at fixed `DUMP_HEADER64`/`DUMP_HEADER32` offsets (bugcheck code @0x38, 4 parameters, FILETIME @0xFA8 — the `minidump` crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps), cross-references System event log (WER 1001 / Kernel-Power 41) for Report Id + faulting driver, deduplicates via `C:\ProgramData\GuruRMM\bsod-seen.json` watermark. Sends `AgentMessage::BsodEvent`. Server: migration 048 + `server/src/db/bsod_events.rs` + ws handler — always-Critical alert deduplicated by `(agent_id, bugcheck_code)` (key changed 2026-06-15 from dump_sha256 to bugcheck_code so a mute silences all recurrences). Dashboard Crashes tab shipped 2026-06-07. Phase 2/3 deferred: BSOD in Alerts stream, `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table (Phase 1 ships a 10-code map). ### Remote Execution -- Command types: `shell`, `powershell`, `python`, raw `script{interpreter}`, and `claude_task`. Options: `timeout_seconds`, `elevated`. +- Command types: `shell`, `powershell`, `python`, raw `script{interpreter}`, and `claude_task`. Options: `timeout_seconds`, `elevated`. **Also accepts `cmd` as an alias for `shell` (added 2026-06-12, commit `3de9faf`). Unknown/unparseable commands now NAK immediately with an error result (fail-fast, not silent-drop).** - **Execution context** (`041_add_command_context`): `system` (default — Session 0 / service SYSTEM) or `user_session` (runs in the active logged-on user's desktop session via WTS token impersonation: `WTSQueryUserToken` + `CreateProcessAsUserW` + per-user env block). Windows-only; requires an active session. - Commands individually cancellable (`POST /commands/:id/cancel`) and aborted on disconnect. Auditable history; status running/completed/failed/timeout/interrupted. - Script library (`017_scripts`): stored scripts dispatched to agents with args/env/timeout/`run_as_user`; per-run history. @@ -208,7 +297,7 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - Fix: agent sends `CommandAck { command_id }` on RECEIPT (migration 058 `acked_at`). Agent deduplicates by command_id — re-reports cached result, ignores in-flight duplicate, never double-runs. Server reaper re-delivers an un-acked command past 60s ACK deadline (returns to `pending`) instead of failing it. Migration 059 adds `delivery_attempts` counter (cap 10) + partial index for capability gate. Pending commands re-offered on (re)connect AND on every heartbeat. Verified at PST-SERVER: `acked_at` stamped, `ack_latency 16 ms`, `delivery_attempts 1`. - Capability gate: reaper only re-delivers for agents that have ACK'd at least once (`idx_commands_agent_acked` partial index). Old agents keep legacy fail-on-timeout path; safe mid-rollout. - Code anchors: `server/src/db/commands.rs` (`requeue_undelivered_commands`, `fail_timed_out_commands` rewrite, `DELIVERY_ACK_DEADLINE_SECS=60`, `MAX_DELIVERY_ATTEMPTS=10`), `server/src/ws/mod.rs` (`redispatch_pending_commands`, `CommandAck` handler), `agent/src/transport/websocket.rs` + `agent/src/commands/mod.rs` (ACK send + recent-results cache, `RECENT_CAP=64`, `MAX_CACHED_RESULT_BYTES=256 KB`). - - Phase 2 (live TTY — rides warm WS, seq/resume, cadence switch) and Phase 3 (AIMD keepalive, bulk->HTTPS, half-open eviction sweeper) are PLANNED, not shipped. + - Phase 2 (live TTY — rides warm WS, seq/resume, cadence switch) and Phase 3 (Adaptive keepalive, bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper) are PLANNED, not shipped. ### Inventory & Discovery - Hardware inventory (mfr/model/serial/BIOS, CPU, memory, disks, NICs, OS), software inventory (installed apps), service inventory. On-demand refresh. @@ -222,15 +311,20 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - **Channel model:** new builds tag **beta** by default; agents default to **stable** channel; promotion to stable is a DELIBERATE re-tag of the `.channel` sidecar. `scanner.rs` `get_latest_version`: stable agents get latest STABLE-tagged binary; beta gets absolute-latest. `resolve_agent_channel` = `agent.or(site).or(client).or('stable')`. - **Promotion API:** `POST /api/updates/rollouts/:version/promote` body `{"os","arch"}` — re-tags all `.channel` files for the version, records `update_rollouts` row, force-rescans. Rollback: `POST /api/updates/rollouts/:version/rollback`. - Safe-rollout (`046`): `update_rollouts`/health-metrics/events tables + `/updates/rollouts` promote/rollback. **Health-gated automation is written-but-unwired (Phase 2); promotion is manual via API.** +- **Container guard (BUG-019, shipped `66a7f4e`, 2026-06-21):** Agent inside a container early-returns from `perform_update()` with `UpdateStatus::Failed` + clear message, skipping the in-place binary swap that caused silent downgrades on container recreate. Full image-update path (SPEC-023) is not yet implemented. +- **Legacy fleet build support (two-wave parallel build):** Agent ships both a stable-toolchain (modern x86_64 + x86, debug, tray, cleanup) wave and a legacy wave (Rust 1.77, `--features legacy`, for Win7/Server 2008 R2 endpoints). The `gururmm-build` skill wraps pre-merge verification. IMPORTANT: the legacy wave resolves deps fresh on Rust 1.77 — dep pins to avoid edition2024 pulls are required (BUG-021 lesson: pin `getrandom 0.3.1` + `zeroize 1.8.1`). ### Policy & Configuration Management - Inheritance chain global -> client -> site -> agent; server computes merged effective policy, pushes via `ConfigUpdate`. Effective policy queryable per scope. - Checks engine (`018`/`019`): cpu, memory, disk, ping, port, script, service (restart-if-stopped, pass-if-not-exist; Win `sc.exe`, Linux `systemctl`). Policy-attached check templates (`024`) with push-to-agent sync. On-demand `run-checks`. - Remote registry (Windows, `winreg`): agent supports enumerate/read/write (typed)/create/delete. **HTTP API currently exposes read-only (enumerate, read_value); write paths exist in the agent but are not routed yet [verify].** +- **Event Log Watch (migration 047):** Server-configured rules for monitoring Windows Event Log channels by event ID / level. Agent reports matching events. Dashboard management UI at `/event-log-watches` shipped 2026-06-21 (`0fa65f5`). CRUD + enable/disable for global and per-agent scope. **HIGH fix (PR #43, pending merge):** policy assign/unassign no longer wipes connected agent's Event Log Watch monitoring (fixed by reusing `watch_rules_for_agent` shaper as single source of truth across all config-push sites). ### Alerting & Watchdog - Threshold alerts (ack/resolve, per-agent + fleet summary, dashboard filter). Alert templates (`022`) with effective resolution; per-client email settings (`020`). Maintenance mode (`021`) to suppress alerting per scope. -- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff, alert after 3 fails) + launches/reaps the tray into active user sessions via WTS. +- Watchdog: **separate** supervising process (polls `GuruRMMAgent` every 30s, restart backoff) + launches/reaps the tray into active user sessions via WTS. The watchdog reports to the server via the **REST `POST /watchdog-alert` endpoint** (fires after 3 consecutive failed restarts). This is the only supported watchdog alert path. +- **[CORRECTED 2026-06-22 per BUG-022]:** The `WatchdogEvent` WebSocket variant was dead code — defined in the agent transport enum and fully handled server-side (`insert_watchdog_event`, `watchdog_events` table), but the watchdog is a **separate companion process with no WebSocket connection** and never had a producer. Granular per-restart WS events were never implemented and the path has been removed (PR #45, pending merge). The empty `watchdog_events` table remains in the DB (no DROP migration, to avoid migration-number collision). `watchdog_events = 0` all-time in a healthy fleet is correct and expected. Granular REST-based watchdog visibility is a future RMM_THOUGHTS item. +- **BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; fix #3 wiring in PR #40, pending merge):** Root cause: `TrayLauncher` tracked launches only in an in-memory `HashMap` that resets on watchdog restart; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping `NIM_DELETE` (ghost). Fix (commit `137dd85`): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW`; (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> TerminateProcess fallback. Fix #3 (wiring `terminate_all` into watchdog policy-disable/uninstall path) is in PR #40 (SPEC-021+BUG-020 branch). - **Role-aware offline alerting (shipped 2026-06-07):** Scheduled offline-sweep evaluator (60s tokio interval; `server/src/alerts/offline.rs`). Server-only `agent_offline` alerts. Classifier: `os_product_type` IN {2,3} -> server; else `os_name`/`os_version` ~/server/i -> server; else manual `role_override` (migration 054). Site rule (>=50% + >=3) -> `mass_offline_site`; fleet rule (>=10) -> `mass_offline_fleet`. Dashboard: servers elevated individually; workstations collapsed. `PUT /api/agents/:id/role-override`. Known gap: `os_product_type` populated on only ~16/168 agents; `os_name` is the workhorse classifier. - **Alert ignore/mute — perma-silence (server only, shipped 2026-06-07; dashboard pending):** `alert_mutes` table (migration 055) + `muted` alert status. Keyed on `dedup_key`. Permanent until un-ignored; reason required (400 if missing). Gate at `create_or_update_alert` AND `create_check_alert` bypass. `POST /api/alerts/:id/mute` + `/unmute`. Dashboard Ignore/Muted/Un-ignore NOT yet built. @@ -245,6 +339,7 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - Per-line provenance + deterministic fingerprint added to `agent_logs` table (nullable; new rows populate, old rows stay NULL): `agent_version`, `signature_hash BIGINT`, `normalizer_version SMALLINT`. Fingerprint computed server-side at ingest (`server/src/fingerprint.rs`). - Durable aggregate `log_signatures` table: one row per distinct normalized error signature (occurrence_count, affected_agents, sample_message, first/last_seen, normalizer_version). Long-retention (raw logs short-retention); alerting + dashboards can read this. - `log_signature_versions` table: per-signature x agent_version rollup — answers "which versions does signature X appear on?" (version regression correlation). +- **Log analysis (`logs/analyze`):** switched 2026-06-15 from Ollama-on-Beast to the Claude API (commit `c869e4d`). ### Backup Integration (MSP360 / MSPBackups) - Multi-provider config (`034`/`035`) with connection test, scheduled sync, per-agent + all-providers status, fleet coverage report, and agent<->MSP360 mapping (`044`) with confidence scoring + manual verification. Dashboard UI for mappings/verify shipped 2026-05-31. @@ -252,6 +347,7 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - **`backup_storage_low` alert type REMOVED (2026-06-07):** `DataCopied/TotalData` measures backup-dataset completeness, not destination capacity. Produced 5 fleet-wide false alerts. `resolve_all_backup_storage_alerts` clears stragglers. Genuine destination-capacity alerting deferred. - MSP360 PlanType map: 3=Files, 7=SQL, 8=Restore, 11=Image, 13=Consistency-check, 16=HyperV. Non-backup (excluded): 8, 13. - MSP360 API scope (confirmed 2026-06-07): monitoring-only tier; management paths (plan delete, manual run, storage config) return 404; plan management remains MSP360-console tasks. +- **MSP360 "Open in MSP360" deep-link (PR #42, migration 062, pending merge):** "Open in MSP360" button on the backup-tab plan card deep-linking to the MSP360 console for the specific agent's backup plan. Backend: `backup_provider_console_id` field (migration 062) + `lib/provider-links.ts` mapping. - Key functions: `derive_backup_status`, `error_is_benign`, `summarize_backup_error`, `resolve_orphaned_backup_alerts`, `resolve_all_backup_storage_alerts`, `evaluate_plan` (BACKUP_STALE). ### Remote Access (Tunnel) @@ -263,6 +359,7 @@ Agent<->server communication is a persistent authenticated WebSocket with auto-r - Organizations / multi-tenancy: org CRUD, per-org membership + roles, limits, dev-admin **user impersonation** (`/auth/impersonate/:id`). Backend + dashboard UI shipped 2026-05-31. - Enrollment & keys (`012`): per-agent key issuance on first run, site API keys (regenerable), site-specific MSI with SITEKEY injected at download, public install-report ingestion. Legacy PowerShell agent path for Server 2008 R2. - Logs: agent log upload (periodic + on-demand), per-agent events (`042`), fleet log view, AI-assisted log analysis (`/logs/analyze`) — AI-optional per locked decision. +- **500-error-leak fixed (commit `58c1a96`, 2026-06-15):** All server error paths now route through `internal_err`; raw DB error strings are no longer returned to callers. ### Enrollment Detail (migration 012 + 2026-06-07 audit) @@ -284,9 +381,11 @@ The `enrolled_agents` table is the authoritative enrollment history log: **Revocation — dual-layer (fixed 2026-06-07):** `DELETE /api/agents/:id/key` (bulk) revokes BOTH layers. `POST /api/enrolled-agents/:id/revoke` (targeted) per-enrollment with cascade. +**Enrollment modal UX (fixed prod 2026-06-21, commit `4027c86`):** "Add Devices" and "Enroll an Agent" modals now have a top-right close button, close on click-outside/Esc, and trigger a site/client list refresh on close. + ### Platform coverage - **Cross-platform (Win/Linux/macOS):** metrics, network state, hardware/software/service inventory, user/group + DC/domain detection, checks (service checks Win+Linux), discovery, self-update, scripts/commands. -- **Windows-only:** `user_session` command context (WTS), remote registry, tray-into-session, watchdog SCM supervision, BSOD detection. +- **Windows-only:** `user_session` command context (WTS), remote registry, tray-into-session, watchdog SCM supervision, BSOD detection, legacy fleet (Win7/Server 2008 R2) agent via Rust 1.77 `--features legacy`. - **macOS:** agent deployed (Phase 1); tray is a stub; automated Mac build pipeline is an intentional stub (no build host) [verify before claiming CI Mac releases]. --- @@ -299,7 +398,7 @@ The `enrolled_agents` table is the authoritative enrollment history log: |---|---|---|---| | Server | 172.16.3.30:3001, systemd `gururmm-server.service` (User=root, binary `/opt/gururmm/gururmm-server`, EnvironmentFile `/opt/gururmm/.env`) | Rust, Axum | deployed, production (physical box, Ubuntu 26.04, PG 18; migrated 2026-06-11) | | Dashboard | https://rmm.azcomputerguru.com, nginx at `/var/www/gururmm/dashboard/` | React + TypeScript + Vite, shadcn/ui, Tailwind CSS v4 | deployed, production | -| Agent (Windows) | Endpoints, installed as `GuruRMMAgent` Windows service via WiX MSI | Rust, Windows MSVC | deployed; stable fleet on 0.6.63 | +| Agent (Windows) | Endpoints, installed as `GuruRMMAgent` Windows service via WiX MSI | Rust, Windows MSVC | deployed; stable fleet on 0.6.67 | | Agent (Linux) | Endpoints, systemd `gururmm-agent`, binary `/usr/local/bin/gururmm-agent` | Rust, musl static | deployed | | Agent (macOS) | Endpoints, LaunchDaemon `com.azcomputerguru.gururmm-agent.plist` | Rust, aarch64/x86_64 | Phase 1 deployed 2026-05-12; code signing issue on Apple Silicon | | Tray (Windows) | System tray, named pipe IPC | Rust | deployed | @@ -308,14 +407,14 @@ The `enrolled_agents` table is the authoritative enrollment history log: | PostgreSQL DB | localhost:5432 on 172.16.3.30, database `gururmm` | PostgreSQL 18 | deployed (migrated from PG VM, backfilled 2026-06-11) | | Coord API | 172.16.3.30:8001/api/coord | FastAPI (part of ClaudeTools API) | deployed | | Build pipeline | 172.16.3.30:9000 webhook + `/opt/gururmm/` scripts | Python (webhook-handler.py), Bash | deployed; builds agents AND server (server auto-deploy wired 2026-06-02) | -| Windows build — **Beast (PRIMARY)** | GURU-BEAST-ROG, i9-14900K (24c/32t), RTX 4090; reached over Tailscale at tailnet IP 100.101.122.4 as user `guru` | Rust MSVC, WiX 4.x | **primary** Windows build host — `build-windows.sh` tries Beast first | -| Windows build — Pluto (FALLBACK) | 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid), Xeon E5-2695 v3 8c/16t | Rust MSVC, WiX v4 | **fallback only** — used if Beast is unreachable/down OR its build fails (`attempt_build beast \|\| attempt_build pluto`) | +| Windows build — **Beast (PRIMARY)** | GURU-BEAST-ROG, i9-14900K (24c/32t), RTX 4090; reached over Tailscale at tailnet IP 100.101.122.4 as user `guru` | Rust MSVC, WiX 4.x | **primary** Windows build host — `build-windows.sh` tries Beast first; parallel build ~336s | +| Windows build — Pluto (FALLBACK) | 172.16.3.36, Windows Server 2019 VM on Jupiter (Unraid), Xeon E5-2695 v3 8c/16t | Rust MSVC, WiX v4 | **fallback only** — used if Beast is unreachable/down OR its build fails (`attempt_build beast || attempt_build pluto`); serial ~1269s | | Old VM (decommissioned) | was 172.16.3.46 | Ubuntu (original) | DELETED 2026-06-12 after stability soak — virsh domain + disk removed; no longer exists | ### Key Files & Repos - **Active repo:** `azcomputerguru/gururmm` — http://172.16.3.20:3000/azcomputerguru/gururmm -- **Submodule working tree:** `D:\claudetools\projects\msp-tools\guru-rmm` — tracks active repo; develop here and push to Gitea (push from 172.16.3.30 or 172.16.3.47 — GURU-5070 is not authorized) +- **Submodule working tree:** `projects/msp-tools/guru-rmm` — gitlink intentionally lags origin/main; read live code with `git -C projects/msp-tools/guru-rmm show origin/main:` or worktrees off origin/main. Push via GIT_ASKPASS with vaulted Gitea API token (`services/gitea-howard.sops.yaml`). - **Server binary:** `/opt/gururmm/gururmm-server` on 172.16.3.30 (physical box; prior VM had `/usr/local/bin/gururmm-server`) - **Server config:** `/opt/gururmm/.env` (EnvironmentFile for systemd; DATABASE_URL here; root-only read) - **Source checkout on server:** `/home/guru/gururmm` (build-server.sh: `git reset --hard origin/main` -> `cargo build` -> deploy with backup+rollback) @@ -332,11 +431,13 @@ The `enrolled_agents` table is the authoritative enrollment history log: - **Build log (Windows):** `/var/log/gururmm-build-windows.log` - **Build log (Server):** `/var/log/gururmm-build-server.log` - **API (internal):** http://172.16.3.30:3001 -- **API (external):** https://rmm-api.azcomputerguru.com (Cloudflare -> NPM on Jupiter .20 -> .30:80) +- **API (external):** https://rmm-api.azcomputerguru.com (grey-cloud, no Cloudflare — agents connect here; Cloudflare only proxies the dashboard `rmm.azcomputerguru.com`) +- **Agent ingress path:** agent -> site gateway -> pfSense (172.16.0.1, DNAT wan:443 -> NPM .20:18443) -> NPM (Docker on Jupiter .20, TLS terminate) -> origin nginx (.30:80, /ws -> 127.0.0.1:3001) -> Rust server :3001 - **Dashboard:** https://rmm.azcomputerguru.com - **DB URL:** `postgres://gururmm:@localhost:5432/gururmm` (pw in `/opt/gururmm/.env`; do not hardcode here) -- **Vault paths:** `infrastructure/gururmm-server.sops.yaml` (API creds), `infrastructure/gururmm-server-physical` (SSH ed25519 key, sudo pw) -- **SSH to prod:** `ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30` (ed25519, key-only; sudo pw in vault `infrastructure/gururmm-server-physical`) +- **Vault paths:** `infrastructure/gururmm-server.sops.yaml` (API creds, DB password, sudo password), `infrastructure/gururmm-server-physical` (SSH ed25519 key) +- **SSH to prod:** `ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30` (ed25519, key-only; sudo pw in vault `infrastructure/gururmm-server.sops.yaml` `credentials.password`) +- **Pre-merge verification skill:** `bash .claude/skills/gururmm-build/scripts/verify.sh server|agent|dashboard|migrations` ### Repo Structure @@ -349,29 +450,32 @@ gururmm/ │ ├── device_id.rs Durable multi-location device_id (Phase 1 Task 1) │ ├── ipc.rs Unix socket IPC (Linux); Windows named pipe │ ├── transport/ -│ │ ├── mod.rs AgentMessage enum (incl. CommandAck) -│ │ └── websocket.rs WS transport; execute_command ACK + dedup +│ │ ├── mod.rs AgentMessage enum (incl. CommandAck; WatchdogEvent REMOVED) +│ │ └── websocket.rs WS transport; execute_command ACK + dedup; cmd-alias NAK │ ├── tunnel/ TunnelManager state machine │ ├── metrics/ sysinfo collection + temps │ ├── registry_ops/ Windows registry read/write -│ ├── updater/ Self-update handler +│ ├── updater/ Self-update handler (container guard added BUG-019) │ └── main.rs systemd unit template generation ├── server/ Rust/Axum API server │ └── src/ │ ├── api/ REST handlers (~37 route modules; updates.rs: promote/rollback) │ ├── alerts/ Alerting modules (offline.rs: offline_sweep + mass_offline) │ ├── db/ Database layer (sqlx); commands.rs (requeue_undelivered, fail_timed_out rewrite) +│ │ watchdog_events.rs (doc-only stub — table exists, path removed per BUG-022 PR #45) │ ├── fingerprint.rs Log signature normalization + hash │ ├── ws/ WebSocket handler (CommandAck, redispatch_pending_commands) │ └── mspbackups/ MSP360 backup integration ├── dashboard/ React/TypeScript UI +│ └── src/pages/ +│ └── EventLogWatches.tsx CRUD management UI (shipped 0fa65f5) ├── tray/ System tray binary -├── installer/ WiX v4 MSI (gururmm-agent.wxs); cleanup.ps1 (now preserves device_id) +├── installer/ WiX v4 MSI (gururmm-agent.wxs); cleanup.ps1 (preserves device_id) ├── deploy/ │ └── build-pipeline/ webhook-handler.py, build-*.sh, build-server.sh ├── scripts/ Build/ops scripts └── docs/ FEATURE_ROADMAP.md, UI_GAPS.md, ARCHITECTURE_DECISIONS.md, tech-stack.md, - DESIGN.md, specs/, HOST_MIGRATION_RUNBOOK.md + DESIGN.md, BUILD.md, specs/, HOST_MIGRATION_RUNBOOK.md, RMM_THOUGHTS.md ``` --- @@ -380,25 +484,28 @@ gururmm/ ### Current Focus -As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68): +As of 2026-06-22 (agent 0.6.67 stable / server v0.3.74+): -- **Agent-comms-durability Phase 1 SHIPPED.** Remaining Phase 1 items: (a) `is_connected` reads null fleet-wide — cosmetic dashboard bug; gap is in API DTO/dashboard layer, not ws/mod.rs; (b) keepalive AIMD shortening (Task 4; deferred as optimization — durability nets are the actual fix). -- **Phase 2 (live TTY, planned):** Extend WS message enum with TTY stream frames (stdin/stdout/stderr, resize, seq). "Activate" cadence switch (agent goes HOT on next heartbeat after tech opens live window). Resume from seq on mid-session drop. Single-use time-bounded session token; max one interactive session per agent; full audit. Estimated 3-5 days separate effort. -- **Phase 3 (planned):** Adaptive keepalive (AIMD, persist interval), bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper (`last_inbound` track + evict >~90s; treat missed Pong as close). -- **Durable agent identity Phase 1 Tasks 2-3 (pending):** Task 2 = hardware-fingerprint capture (agent `inventory.rs` baseboard serial + primary MAC; server migration `agents.hardware_fingerprint` + `hardware_signals JSONB` + `legacy_device_id`). Task 3 = dashboard "probable duplicate" surfacing (read-only). Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for the 9 existing ghosts) pending soak. -- **Confirm Windows v0.6.62 build compiles + deploys** (validates new registry cfg(windows) code from durable-identity Phase 1 Task 1). If it failed, fix and re-push. -- **Host migration soak:** drop `.47` (new box) + `.46` (VM) management IPs after stability soak; decommission old VM at `.46`. GURU-5070 Ethernet repair. +- **BUG-022 (PR #45, pending merge):** Remove dead WatchdogEvent WS path. Merging PR #45 (`fix/bug-022-watchdog-event-deadcode`) = fleet build+deploy. Empty `watchdog_events` table stays until a future consolidated cleanup migration. +- **Open PRs awaiting merge (migration order matters — merge in order: 060 already on main; 061 -> 062 -> 063):** + - PR #40 SPEC-021 + BUG-020 (migration 063, renumbered from 060): logged-in-user domain/account-type detection + watchdog tray-teardown wiring. Also needs Pluto-signed-MSI agent build + fleet rollout. + - PR #41 BUG-018 FK indexes (migration 061): FK indexes on five previously-unindexed cascade children (speeds the BUG-018 background purge). BUG-018 handler itself is already merged (`cea87d4`). + - PR #42 MSP360 deep-link (migration 062): "Open in MSP360" button on backup tab plan card. + - PR #43 Event Log Watch policy-clobber HIGH fix: no migration; merge anytime. + - PR #44 Audit cleanup (low findings from 2026-06-21 audit). + - PR #45 BUG-022 watchdog dead code. + - PR #46 docs (BUG-021/022 roadmap status updates + RMM_THOUGHTS). +- **Agent-comms-durability Phase 2 (live TTY, planned):** Extend WS message enum with TTY stream frames (stdin/stdout/stderr, resize, seq). "Activate" cadence switch. Resume from seq on mid-session drop. Single-use time-bounded session token; max one interactive session per agent; full audit. Estimated 3-5 days separate effort. +- **Agent-comms-durability Phase 3 (planned):** Adaptive keepalive (AIMD, persist interval), bulk file transfer -> short-lived HTTPS, server half-open eviction sweeper (`last_inbound` track + evict >~90s; treat missed Pong as close). +- **Durable agent identity Phase 1 Tasks 2-3 (pending):** Task 2 = hardware-fingerprint capture. Task 3 = dashboard "probable duplicate" surfacing (read-only). Phase 2 (guarded auto-reclaim) + Phase 3 (operator merge tool for the 9 existing ghosts) pending soak. - **SPEC-028 offboarding wizard (specification complete, 835 lines; implementation pending):** Site + client offboarding workflows, data export, typed confirmation, audit logging. -- **BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):** Fix #3 (graceful `Global\GuruRMM_TrayShutdown_{sid}` event) is implemented but dormant — `terminate_all` has no caller. Coord todo `25fdf31a`: wire into watchdog policy-disable/uninstall path. - **BSOD detection Phase 2/3 (deferred):** Dashboard Crashes tab shipped 2026-06-07. Remaining: BSOD in Alerts stream (issue #10), `fetch_bsod_dump` on-demand upload, full ~350-entry bugcheck name table. - **Linux fleet unit drift:** Auto-updater replaces binary but does NOT refresh systemd unit file. Pre-BUG-016-fix agents have new binary + old unit (missing `StateDirectory=gururmm`). Needs ops-script pass. -- **Watchdog alerts UI** — backend complete but `PUT /watchdog-alerts/:id/resolve` and `DELETE /watchdog-alerts/:id` routes missing on server. - **Alert mute dashboard (Task 5) -- NOT started:** Ignore button + required-reason prompt; Muted filter; Un-ignore control. Server endpoints ready. - **MSP360 backup Phase 2 (not started):** Genuine destination-capacity alerting deferred (needs MSP360 storage-accounts endpoint, monitoring-only API tier confirmed). -- **Parallel Windows build integration (DEFERRED, design captured in runbook):** 7-job Start-Job parallel build achieves 3.51x speedup on Pluto. Blocked on Windows orchestration friction (non-interactive PS, git host-key trust). Run via new-box `setsid` when integrated. - **Tray IPC + peer authorization** — Linux tray merged (PR #13+#14). Open: Windows peer authz (#16), logind console-user resolution (#17), macOS tray (#18), subscriber broadcast (#19). -- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `internal_err()` raw DB errors at ~130 call sites (HIGH). `update_rollouts.promoted_by` UUID vs `users.id` int mismatch (schema quirk from migration 046). -- **`/backup-status` endpoint shape gap:** Returns only one plan per agent; agents with a dead old plan + healthy current plan look stale-but-green. Not fixed — noted for future. +- **Security audit backlog:** `credentials/:id/reveal` horizontal privilege escalation (HIGH), `update_rollouts.promoted_by` UUID vs `users.id` int mismatch (schema quirk from migration 046). +- **SPEC-021 (PR #40):** Logged-in-user domain + account-type detection (agent + server + dashboard). Pending merge. ### Patterns & Anti-Patterns @@ -411,7 +518,7 @@ As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68): | Deploying without stopping the server first | "text file busy" kernel error. Always `systemctl stop` before `cp`. | | Building without `DATABASE_URL` | sqlx compile-time macros fail. `DATABASE_URL` is in `/opt/gururmm/.env` (or `/home/guru/.cargo/env` on dev). | | DB migrations without inserting into `_sqlx_migrations` | Server crashes on start. Must insert SHA-384 checksum manually. | -| WiX MSI builds on Linux | WiX requires `msi.dll`. MSI must be built on Pluto (Windows). | +| WiX MSI builds on Linux | WiX requires `msi.dll`. MSI must be built on Pluto or Beast (Windows). | | Manual builds via SSH | All builds go through `webhook-handler.py`. Never SSH and run `cargo build` + artifact copy manually. | | TOML/config for agent endpoint or site_id | Server URL compiled into binary, site_id baked into MSI. No runtime config files for these values. | | `path.find('\\')` in `#[cfg(windows)]` files | Compiles on Linux silently, fails on Pluto MSVC with unterminated char literal. Use `'\\\\'`. | @@ -431,6 +538,13 @@ As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68): | Channel pin at agent-level for a beta canary | Agent-level `update_channel` is keyed to agent_id and lost on re-enrollment. Use site-level or client-level override (survives re-enroll). | | Installer using `& $stagingPath install 2>&1 \| Out-Null` under `$ErrorActionPreference='Stop'` | Swallows output; surfaces a misleading NativeCommandError on any non-zero exit. Use `Start-Process -Wait -PassThru` + check `$proc.ExitCode`. Fixed 2026-06-11 (commit 5c0d004). | | Reaper flipping un-acked commands to `failed` | Causes false-failure for commands black-holed in NAT conntrack gaps. The reaper must only fail commands that were ACK'd but overran their real execution timeout. Un-acked commands must be requeued to `pending`. | +| `command_type: "cmd"` in API commands | `cmd` was not a valid `CommandType` variant — agent silently dropped the whole message (no ACK, no result). Now aliased to `Shell` (commit `3de9faf`), but check command_type first whenever a command appears un-acked. Valid types: `shell`, `powershell`, `python`, `script`, `claude_task` (+ `cmd` alias). | +| `cargo +1.77 fetch` as a parallel-build pre-resolve | Resolves the full dep graph, which on Rust 1.77.2 pulls `wit-bindgen` (edition2024) and fails with rc=101 on both build hosts. Removed: the legacy builds scope deps via `--features legacy` and don't need a pre-resolve. | +| Concurrent parallel builds sharing one `--target-dir` | Cargo's per-build-dir lock forces them to run serially. Each variant must have its own target dir (amd64=`target/release`, x86=`target/x86`, legacy-amd64=`target/legacy-x64`, legacy-x86=`target/legacy-x86`). Fixed 2026-06-12 (commit `cfbdb59`). | +| Pinning new legacy-wave deps without checking edition requirements | BUG-021: `getrandom 0.3.2` / `zeroize 1.9.x` require edition2024, which Cargo 1.77 cannot handle. Always verify the edition compatibility of any new dep version before adding it to the legacy build. Pin to `getrandom 0.3.1` + `zeroize 1.8.1` (or later confirmed-compatible versions). | +| Asserting build status from a point-in-time log snapshot | Build logs are append-only; a prior FAILED line stays in the tail even after a later successful run. Check the LIVE `last-built-commit-` marker vs `origin/main` HEAD to confirm the current build status. | +| Config-push clobber at multiple independent sites | The server has multiple code paths that push per-agent configuration (policy assign, policy unassign, reconnect, etc.). If the rule-shaper is not the single source of truth, adding a new push site will silently omit config (Event Log Watches were wiped by policy assign/unassign — third instance of this class). Always reuse the shared shaper; never duplicate the rule-mapping logic. | +| Reading the stale submodule working tree for code | The submodule gitlink intentionally lags `origin/main`. Always use `git -C projects/msp-tools/guru-rmm show origin/main:` or a worktree off `origin/main` to read current code. Reading the working tree gives stale results and led to false audit findings. | **Good patterns:** @@ -446,6 +560,9 @@ As of 2026-06-11 (agent 0.6.63 stable / server 0.3.68): - **GURU-5070 site as permanent beta-channel canary** — site "Mike's Car" has `update_channel='beta'` (survives agent re-enrollment). Gets every new beta build; stable fleet protected by explicit `update_rollouts` pin. (Channel pin on site, not agent — agent-level overrides are lost on re-enroll.) - **Capability gate via partial index, not version-string compare** — CI auto-bumps versions (`[ci-version-bump]`); pinning behavior to a fixed version string is fragile. Use a DB-observable capability marker (e.g., `acked_at IS NOT NULL`) that is set when the agent actually demonstrates the capability. - **Two independently-deployable slices for cross-component features** — agent slice and server slice can be deployed in any order; the server slice uses a capability gate so old agents keep legacy behavior. +- **Worktrees off origin/main for concurrent sessions** — isolates each session's work from shared dirty checkout. Commit + push by explicit SHA (`git push origin :refs/heads/`) to avoid HEAD/branch-ref races. Standard pattern when multiple sessions touch the same submodule. +- **`gururmm-build` skill for pre-merge verification** — `bash .claude/skills/gururmm-build/scripts/verify.sh ` runs the same compile checks the server will. Does NOT trigger production build. Merging to main is the only prod trigger. +- **Dashboard beta-before-main rule** — dashboard changes go to beta (auto-built from main) first. Preview a feature branch without merging: `git worktree add --detach origin/` + `npm install + npm run build` + `rsync dist/ /var/www/gururmm/dashboard-beta/`. Promote with `promote-dashboard.sh --confirm`. ### Build & Deploy @@ -459,9 +576,7 @@ Gitea push to main -> build-windows.sh (SSH -> Beast PRIMARY (guru@100.101.122.4 over Tailscale) || fall back to Pluto Administrator@172.16.3.36; each via its own pinned known-hosts - cargo build --release x64 MSVC + i686 MSVC - +1.77 legacy builds with --ignore-rust-version - WiX MSI build for site-specific base + TWO-WAVE PARALLEL BUILD (see below) sign-windows.sh (jsign + Azure Trusted Signing, /etc/gururmm-signing.env) SCP artifacts back; log: /var/log/gururmm-build-windows.log) -> build-mac.sh (stub — no build machine configured yet) @@ -473,6 +588,15 @@ Gitea push to main -> systemctl restart gururmm-agent (local agent on .30) ``` +**Two-wave parallel Windows build (lever A, wired 2026-06-12, commit `cfbdb59`):** +- WAVE 1 (5 concurrent, stable toolchain): `s_amd64` (`target/release`), `s_x86` (`target/x86`), `s_debug` (`target/debug-agent`), `s_tray` (`tray/target`), `s_cleanup` (`installer/cleanup/target`) +- WAVE 2 (2 concurrent, Rust 1.77 legacy, after Wave 1): `l_amd64` (`target/legacy-x64`), `l_x86` (`target/legacy-x86`) — uses `$CARGO +1.77 --features legacy --ignore-rust-version` +- MSI build overlaps Wave 2 (depends only on Wave 1 artifacts) +- Per-job exit-code aggregation; if Beast fails, Pluto is tried with the same structure +- Beast timing: ~336s cargo phase (319s), ~46% faster than Beast serial cold (~622s), ~3.8x vs Pluto serial (~1269s) + +**[CRITICAL] Legacy wave dep-pin gotcha (BUG-021):** The legacy wave resolves deps fresh with `cargo +1.77 fetch`... NO — `cargo fetch` is REMOVED from the pipeline (it was the failure point: Cargo 1.77.2 chokes on edition2024 deps like `wit-bindgen` when resolving the full dep graph). Legacy builds scope deps via `--features legacy` and do NOT need a pre-fetch. However: if any dep added to Cargo.toml ships a new patch that requires edition2024, the legacy build will fail. Pin known-problematic deps: `getrandom = "0.3.1"` (not 0.3.2+), `zeroize = "1.8.1"` (not 1.9.x). Verify before bumping any dep that the new version's MSRV is compatible with Rust 1.77. + **Auto-version:** `build-shared.sh` diffs `agent/`, `server/`, `dashboard/` against last built SHA. For each changed component, bumps patch version in `Cargo.toml` or `package.json`, commits `[ci-version-bump]`, pushes. Webhook skips builds where all commits are version bumps. **Build channel classification:** New agent builds are tagged `beta` by default. Promotion to `stable` is an explicit step via the API: `POST /api/updates/rollouts/:version/promote {"os":"windows","arch":"amd64"}` — re-tags all `.channel` files for that version, records `update_rollouts` row, triggers scanner rescan. Rollback: `POST /api/updates/rollouts/:version/rollback`. Agents on the stable channel only receive the latest `stable`-tagged binary. @@ -480,7 +604,7 @@ Gitea push to main **Downloads layout:** - Base binaries: `gururmm-agent--.exe` + `.channel` (beta/stable) + `.sha256` - Latest symlinks: `gururmm-agent--latest.exe` + `.channel` + `.sha256` -- Per-site signed cache: `gururmm-agent-site---.exe` (built+jsign-signed on demand at install-request time) +- Per-site signed cache: `gururmm-agent-site---.exe` (built+jsign-signed on demand at install-request time; staged in `downloads_dir`, NOT `/tmp` — EXDEV fix `95ef901`) **Dashboard channels — BETA-FIRST:** @@ -489,17 +613,14 @@ Gitea push to main | beta | https://rmm-beta.azcomputerguru.com | `/var/www/gururmm/dashboard-beta` | auto on push — `build-dashboard.sh` (change-gated on `last-built-commit-dashboard`) | | prod | https://rmm.azcomputerguru.com | `/var/www/gururmm/dashboard` | explicit only — `sudo /opt/gururmm/promote-dashboard.sh --confirm` (backs up prod; `--rollback` restores) | -Do NOT hand-rsync into the prod web root. One artifact serves both channels — the Vite build bakes in the absolute prod API URL; beta uses an nginx-layer `sub_filter` BETA banner. +Dashboard changes go to beta BEFORE main. To preview a feature branch without merging: build its `dashboard/` in a `git worktree add --detach origin/`, `npm install --no-audit --no-fund && npm run build`, then `rsync -a --delete dist/ /var/www/gururmm/dashboard-beta/`. Do NOT touch `/opt/gururmm/last-built-commit-dashboard`. Do NOT hand-rsync into the prod web root. -**DB migrations** — manual; must insert SHA-384 checksum into `_sqlx_migrations` or server crashes on start. Migrations run automatically on server restart (sqlx), but the checksum must be correct. +**DB migrations** — applied automatically on server restart (sqlx), but must have correct SHA-384 checksum in `_sqlx_migrations` or server crashes. Migration number collisions across branches are a real hazard — renumber before merging if two branches both add `NNN_*.sql`. Migration numbers confirmed on origin/main: 001-060 (`060_alert_mutes_agent_id_index`). Pending branches: 061 (BUG-018 FK indexes), 062 (MSP360), 063 (SPEC-021). -**Windows build hosts — Beast PRIMARY, Pluto FALLBACK.** `build-windows.sh` runs -`attempt_build beast || attempt_build pluto`: Beast is tried first; Pluto is used only if Beast -is unreachable/down or its build fails. Both build the same `C:\gururmm` checkout; host keys at -`/opt/gururmm/{beast,pluto}_known_hosts`. +**Windows build hosts — Beast PRIMARY, Pluto FALLBACK.** `build-windows.sh` runs `attempt_build beast || attempt_build pluto`: Beast is tried first; Pluto is used only if Beast is unreachable/down or its build fails. Both build the same `C:\gururmm` checkout; host keys at `/opt/gururmm/{beast,pluto}_known_hosts`. **Beast (GURU-BEAST-ROG) — PRIMARY:** -- Physical workstation, **i9-14900K (24c/32t)**, RTX 4090. Much faster per-core than Pluto. +- Physical workstation, **i9-14900K (24c/32t)**, RTX 4090, 127.8 GB RAM. - Reached from `.30` over **Tailscale** (installed on `.30`) at tailnet IP **100.101.122.4**, user `guru`. - SSH: `ssh -o UserKnownHostsFile=/opt/gururmm/beast_known_hosts guru@100.101.122.4` - Rust MSVC toolchain, **WiX 4.x** (WiX v6+ pulls OSMF and breaks the build), Gitea clone at `C:\gururmm\`. @@ -510,7 +631,7 @@ is unreachable/down or its build fails. Both build the same `C:\gururmm` checkou - SSH: `ssh -o UserKnownHostsFile=/opt/gururmm/pluto_known_hosts Administrator@172.16.3.36` - Rust stable 1.95.0 + 1.77 pinned for legacy builds - VS Build Tools (MSVC), sccache at `C:\sccache`, WiX v4, Gitea clone at `C:\gururmm\` -- Xeon E5-2695 v3 @2.3GHz, 8c/16t KVM VM; sequential build ~23min; parallel prototype 3.51x (deferred) +- Xeon E5-2695 v3 @2.3GHz, 8c/16t KVM VM; serial build ~1269s **Auto-update delivery:** - Server scans every 300s; dispatches update command on agent heartbeat @@ -523,11 +644,11 @@ is unreachable/down or its build fails. Both build the same `C:\gururmm` checkou ## Active State -**Fleet (as of 2026-06-11, post-0.6.63 rollout):** -- ~215 enrolled agents total (growing) -- Stable channel: 0.6.63 windows/amd64 (promoted 2026-06-11); 168 agents on 0.6.63 during rollout, 182 online throughout (no mass dropout). ~20 stragglers on older versions auto-update on reconnect. +**Fleet (as of 2026-06-22):** +- ~270 enrolled agents total; ~178 typically online +- Stable channel: 0.6.67 windows/amd64 (Windows build green 2026-06-22 02:19, marker `1dce66d`) +- Metrics flowing (~2531 rows/15 min); alerts firing with 0 legacy null dedup_keys; per-agent API endpoints all HTTP 200 - Beta channel: site "Mike's Car" (`103c10b9-c1de-4dd8-b382-b8362ed3143e`) has `update_channel='beta'` (persists across re-enrollment). All GURU-5070 machines are on this site. -- GURU-5070 live agent: `819df0c8`; ghost (orphaned prior enrollment): `c043d9ac` (offline, durable-identity fix in progress). **Enrolled clients/sites (2026-05-24 baseline; no removals since):** @@ -537,12 +658,12 @@ is unreachable/down or its build fails. Both build the same `C:\gururmm` checkou | BirthBiologic | Corporate | Main Office | BB-SERVER | | Cascades of Tucson | Corporate | CascadesTucson | 27 agents — CS-SERVER, RECEPTIONIST-PC, ASSISTMAN-PC, MDIRECTOR-PC, MEMRECEPT-PC, and ~22 others | | Dataforth Corp | Corporate | D1 | AD2, DF-GAGETRAK | -| Goldstein | (verify) | (verify) | Canary client for 0.6.62/0.6.63 beta soak | +| Goldstein | (verify) | (verify) | Canary client for beta channel | | Grabb & Durando Law Office | Corporate | Main Office | GND-SERVER | | Instrumental Music Center | Corporate | IMCMain | IMC1 | | Key, Paul | Residential | Home | KEY-MEDIA | | Peaceful Spirit | Residential | Bridgette Home, Country Club, Mara Home | BridgettePSHomeComputer, PST-SERVER (UDR Ultra, comms-durability verification), PST-SERVER2, PST-SURFACE, Maras-HP-Laptop, MaraHomeNew | -| Safesite | Corporate | Glendale | DESKTOP-3USU20B (comms-durability affected) | +| Safesite | Corporate | Glendale | DESKTOP-3USU20B | | Sombra Residential LLC | Corporate | main office | DESKTOP-UQRN4K3, Server2013 | | Stamback Septic | Corporate | StambackSeptic | DESKTOP-BTR2AM3, StambackLaptopNew | | Swanson, Len | Residential | Home | LAS-GAMER | @@ -551,10 +672,10 @@ is unreachable/down or its build fails. Both build the same `C:\gururmm` checkou - `POST /api/auth/login` -> JWT (~24h) - Creds: vault `infrastructure/gururmm-server.sops.yaml` -> `credentials.gururmm-api.admin-email` / `admin-password`; or via `bash .claude/scripts/rmm-auth.sh` - Key endpoints: `GET /api/agents`, `POST /api/agents/:id/command`, `GET /api/commands/:id`, `POST /api/agents/:id/update`, `POST /api/updates/rollouts/:version/promote` -- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`), `command` (script text, JSON-encoded), optional `context` — `system` (default) or `user_session` (Windows WTS), plus `timeout_seconds`/`elevated`. +- Command fields: `command_type` (`shell`/`powershell`/`python`/`script`/`claude_task`/`cmd` alias), `command` (script text, JSON-encoded), optional `context` — `system` (default) or `user_session` (Windows WTS), plus `timeout_seconds`/`elevated`. **Dashboard — complete and working:** -Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis, Alerts (clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO, Policies Dashboard (all tabs), Registry editor (read-only via HTTP), MSP360 backup status + mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance, AgentDetail Crashes tab + version history, fleet stats from `/agents/stats`, SiteDetail Revoke Key + Enrollment audit tab, Install Reports page, Fleet Discovery page. +Agents management (delete now returns 202 + background purge), Clients/Sites CRUD, Commands execution + terminal, Logs + AI analysis (Claude API), Alerts (clickable severity badges + client filtering), Metrics (CPU/RAM/disk/network, process drill-down modal), Auto-update triggering, Network state, Entra ID SSO, Policies Dashboard (all tabs), Registry editor (read-only via HTTP), MSP360 backup status + mappings/verify UI, Organizations management + dev-admin impersonation UI, Credentials management with inheritance, AgentDetail Crashes tab + version history, fleet stats from `/agents/stats`, SiteDetail Revoke Key + Enrollment audit tab, Install Reports page, Fleet Discovery page, **Event Log Watches management page** (`/event-log-watches`, shipped `0fa65f5`). **Dashboard — incomplete (see UI_GAPS.md):** - Watchdog alerts UI — blocked on 2 missing server routes @@ -564,6 +685,7 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI - Alert mute UI — Ignore button + required-reason prompt, Muted filter, Un-ignore control (server endpoints ready; dashboard NOT started) - `is_connected` reflects real connectivity (null fleet-wide cosmetic bug; gap in API/dashboard layer) - Documentation / user guide / inline help tooltips (P3, not started) +- MSP360 "Open in MSP360" deep-link (PR #42, dashboard portion pending merge) **Open Gitea issues:** - #10 — BSOD detection Phase 2/3 (dashboard Alerts stream + fetch_bsod_dump + full bugcheck table) @@ -573,13 +695,8 @@ Agents management, Clients/Sites CRUD, Commands execution + terminal, Logs + AI - #18 — macOS tray - #19 — subscriber broadcast -**BUG-020 — tray duplicate/ghost icons (fixed to beta 2026-06-04; dormant follow-up open):** -- Root cause: `TrayLauncher` tracked launches only in an in-memory `HashMap` that resets on watchdog restart; no single-instance guard in the tray; `terminate_all` hard-killed via `TerminateProcess` skipping `NIM_DELETE` (ghost). -- Fix (commit `137dd85`): (1) per-session `Local\GuruRMM_Tray` single-instance mutex; (2) launcher reconciliation via `WTSEnumerateProcessesW`; (3) graceful `Global\GuruRMM_TrayShutdown_{sid}` event -> 3s wait -> TerminateProcess fallback. Fix #3 is dormant pending `terminate_all` wiring (coord todo `25fdf31a`). - **Security backlog (HIGH):** - `credentials/:id/reveal` — horizontal privilege escalation (no ownership scope check) -- `internal_err()` — ~130 call sites returning raw DB errors to callers --- @@ -592,7 +709,7 @@ These decisions are locked. Do not reverse without explicit user approval. 3. **No TOML/config for endpoints** — Server URL compiled into binary. No runtime config files for server URL or site_id. 4. **Policy inheritance chain** — global -> site -> client -> agent. Server computes merged effective policy and pushes via `ConfigUpdate` WebSocket message. 5. **Platform parity rule** — Any agent feature ships on Windows, Linux, and macOS in the same change. Stub + TODO required if a real implementation is not yet feasible. -6. **Watchdog as separate process** — Main agent cannot reliably restart itself after a crash. +6. **Watchdog as separate process** — Main agent cannot reliably restart itself after a crash. The watchdog reports via REST (`POST /watchdog-alert`) — it has no WebSocket connection. 7. **Build pipeline is the only path to production** — Enforces signing, checksum generation, consistent artifact layout. 8. **Multi-tenancy identity model (ADR-001)** — Dev team with partner impersonation. Three levels: Dev -> Partner -> Client. Computer Guru is partner #1. 9. **Holistic feature development (DESIGN.md)** — Every feature requires backend + API + dashboard UI + documentation. Backend-only features are rejected. @@ -620,14 +737,20 @@ These decisions are locked. Do not reverse without explicit user approval. | 2026-05-31 | Roadmap reconciliation (17 corrections — roadmap understated built state). MSPBackups mapping/verify UI + dev-admin impersonation UI deployed (dashboard v0.2.32). BUG-008/013/014 status corrected to fixed. SPEC-021 (logged-in user domain detection) written after Howard feature request. | | 2026-06-01 | BUG-016 (Linux systemd missing StateDirectory=gururmm) + BUG-017 (device_id OnceLock cache) fixed (commit 30da053). GURU-KALI had 11 ghost agent rows from repeated UUID churn — fixed and verified. BSOD forensics: GURU-5070 bluescreened with `0x116 VIDEO_TDR_FAILURE` (nvlddmkm.sys); GuruConnect cleared on three grounds. BSOD detection feature (issue #10 Phase 1) implemented: bsod.rs + migration 048 + ws/mod.rs handler; code review caught and fixed SF-1/SF-2; merged to main (0ec55cf), agent versioned 0.6.51. | | 2026-06-02 | Server 0.3.37 + migration 048 deployed. Build channel default-beta fix applied to build-windows.sh + build-linux.sh. Webhook wired to dispatch build-server.sh with change-gate + backup/rollback. Fleet converged to 0.6.51. GURU-KALI BUG-016 unit file refreshed. | -| 2026-06-04 | Channel state confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event (fix #3 dormant — coord todo `25fdf31a`). Verified by Grok + Code Review Agent. | +| 2026-06-04 | Channel state confirmed via live Postgres query: GURU-5070 `agents.update_channel = 'beta'` (explicit per-agent override). Stable channel pinned at 0.6.47 windows/amd64 + 0.6.46 linux. BUG-020 (duplicate/ghost tray icons) fixed in commit `137dd85` to beta: per-session single-instance mutex + WTSEnumerateProcessesW reconciliation + graceful shutdown event (fix #3 dormant — now in PR #40). | | 2026-06-07 | Backup-alert quality pass shipped. FU1 (summarize_backup_error + create_or_update_alert refresh) + FU2 (exclude PlanTypes 8/13 from alerting/compliance): false `backup_failed` 15 -> 2 fleet-wide (commits `779f7f6` + `b82c010`). `backup_storage_low` alert type removed entirely (DataCopied/TotalData is not destination capacity; 5 -> 0 false alerts; `resolve_all_backup_storage_alerts`). | | 2026-06-07 | Credential inheritance deployed to production (server v0.3.45). Hierarchical propagation (Global → Client → Site), `is_inheritable`, `/effective` endpoints. Dashboard: clickable alert severity badges + client filtering. SPEC-028 offboarding wizard specification created (835 lines). FEATURE_ROADMAP.md updated with "Client & Site Lifecycle Management." | | 2026-06-07 | Role-aware offline alerting + alert ignore/mute shipped. Offline sweep evaluator (60s interval; offline.rs): server-only `agent_offline`; migration 054 `agents.role_override`; site (>=50%+>=3) -> `mass_offline_site`; fleet (>=10) -> `mass_offline_fleet`. Warm-up restart guard (code review caught + fixed `last_seen < started_at` spec defect). Alert ignore/mute (perma-silence): migration 055 `alert_mutes` + `muted` status; `dedup_key`-keyed; reason required; gates `create_or_update_alert` + `create_check_alert` bypass; dashboard UI pending. | | 2026-06-07 | UI gap batch + enrollment audit (GURU-BEAST-ROG session). `get_agent` upgraded to `AgentWithDetails`. Six new/updated server endpoints. Enrollment audit endpoint + per-row revoke. Dashboard: AgentDetail Crashes tab + version history, SiteDetail Enrollment tab, InstallReports + Discovery pages. Hotfix `6faa382` (role_override missing from new SQL — all GET /api/agents/:id returned 500). | -| 2026-06-11 | Physical server migration executed. Ubuntu VM at 172.16.3.30 retired; physical box took the same IP. Production down <27 min (06:53-07:20 MST). 162/212 agents reconnected within 15 min. Old VM parked at .46 as rollback anchor. Server binary path changed from `/usr/local/bin/gururmm-server` to `/opt/gururmm/gururmm-server`. Ubuntu 26.04 + PostgreSQL 18. 7-day whale data backfilled (~3.4 GB, lossless, 0 pool-timeouts). Three build-env gaps fixed on new box (sccache, Pluto key, signing tools). | +| 2026-06-11 | Physical server migration executed. Ubuntu VM at 172.16.3.30 retired; physical box took the same IP. Production down <27 min (06:53-07:20 MST). 162/212 agents reconnected within 15 min. Old VM at .46 as rollback anchor, decommissioned 2026-06-12. Server binary path changed from `/usr/local/bin/gururmm-server` to `/opt/gururmm/gururmm-server`. Ubuntu 26.04 + PostgreSQL 18. 7-day whale data backfilled (~3.4 GB, lossless, 0 pool-timeouts). Three build-env gaps fixed on new box (sccache, Pluto key, signing tools). | | 2026-06-11 | Ghost agent root-caused. `device_id` storage fragile (single file, wiped by cleanup.ps1). 11 duplicate-hostname agents found (9 same-site ghosts; 2 cross-site non-merge). Durable agent identity spec created (`specs/durable-agent-identity/`; multi-AI reviewed). Phase 1 Task 1 shipped (agent v0.6.62): multi-location device_id (registry + ProgramData on Win; /etc + /var/lib on Linux; /Library + mirror on macOS) with self-heal; cleanup.ps1 whitelisted. Channel pin regression fixed: moved GURU-5070 beta override from agent-level to site "Mike's Car" (survives re-enrollment). | | 2026-06-11 | Agent-comms-durability Phase 1 shipped (agent v0.6.63 / server v0.3.68). Spec created (specs/agent-comms-durability/; multi-AI reviewed, Gemini + Grok). Root cause: NAT conntrack asymmetry — server->agent pushes black-holed in idle conntrack gaps. Fix: agent CommandAck on receipt (migration 058 acked_at); agent dedup cache (re-report, never double-run); server reaper re-delivery (migration 059 delivery_attempts, cap 10; capability gate via partial index); pending commands re-offered on every heartbeat. Installer hardened (Start-Process). Build-pipeline dubious-ownership fixed. Canary (Goldstein + Peaceful Spirit): 6 agents verified, PST-SERVER ACK latency 16 ms. Fleet promoted 0.6.63 stable; 168 agents converged in ~3 min, 182 online throughout. | +| 2026-06-12 | Beast parallel Windows build (lever A) wired into production pipeline (commit `cfbdb59`). Two-wave parallel SSH: 5 concurrent stable + 2 concurrent legacy-1.77. Beast timing: ~336s (vs 622s serial, 1269s Pluto). Per-variant target-dir mandatory. Dropped `cargo +1.77 fetch` pre-resolve (edition2024 failure). v0.6.66 built clean on Beast in 336s. Old VM (.46) decommissioned (virsh domain + disk removed). | +| 2026-06-12 | command_type "cmd" root-caused as silent-drop (no such variant in CommandType enum). 7-round adversarial multi-AI quorum + packet captures. Fixed: `#[serde(alias = "cmd")]` on Shell + NAK on unparseable command (commit `3de9faf`). Reverted spurious eviction change (80df458). v0.6.66 promoted to stable fleet-wide. | +| 2026-06-15 | BSOD dedup key changed from dump hash to bugcheck code (f0a4b7f, server v0.3.73) — mutes now silence all recurrences of same bugcheck, not just one dump. MSI EXDEV fix (95ef901, server v0.3.74) — site MSI builder staged in /tmp causing cross-device rename failure; fixed to stage in downloads_dir. Duplicate offline server alerts in Triage removed (b6ed564). sync.sh Phase 1 auto-heal (resolve_submodule_collisions) — rescued 94 orphaned RMM_THOUGHTS lines. logs/analyze switched to Claude API (c869e4d). 500-error-leak fully closed (58c1a96). | +| 2026-06-18 | sync.sh submodule auto-heal verified fleet-wide after earlier 2026-06-15 fix. RMM_THOUGHTS 2026-06-08 re-grounding pass + 8 Raw sections rescued and staged. | +| 2026-06-21 | BUG-019 (container self-update guard) fixed and merged to main (66a7f4e, v0.6.67 beta). BUG-018 (DELETE 202+bg, cea87d4) merged. Event Log Watch UI shipped (0fa65f5). Enrollment modal UX fix merged to prod (4027c86). sync.sh populate-only guard (Phase 2) fixed submodule-clobber root cause. Howard cleared for GuruRMM merges/deploys (Mike decision). gururmm-build skill + docs/BUILD.md created. Five PRs opened (#40-#44) covering SPEC-021, BUG-018 FK indexes, MSP360 deeplink, Event Log Watch policy-clobber HIGH fix, audit cleanup. | +| 2026-06-22 | BUG-021 (legacy build dep-pin getrandom+zeroize) fixed on main (1dce66d). Windows build green at v0.6.67, 2026-06-22 02:19. Fleet verified: ~270 agents / ~178 online, 0 legacy null dedup_keys. BUG-022 filed + fixed (PR #45): removed dead WatchdogEvent WS path (no producer — watchdog has no WS connection; REST watchdog-alert is the only real path). | --- @@ -642,6 +765,7 @@ These decisions are locked. Do not reverse without explicit user approval. - **2026-06-04 recompile:** Corrected GURU-5070 channel state. Stable fleet pinned at 0.6.47. BUG-020 documented. - **2026-06-07 recompile:** Folded in backup-alert quality pass, credential inheritance, offline alerting + mute, UI gap batch + enrollment audit. Updated migration count to 55+ (054/055 confirmed). - **2026-06-11 recompile (GURU-5070/claude-main):** Full recompile. Added: (1) Physical server migration (Ubuntu 26.04, PG 18, binary at /opt/gururmm, old VM .46 rollback anchor). (2) Durable agent identity (ghost root cause, spec, Phase 1 Task 1 durable device_id, channel pin fix). (3) Agent comms durability Phase 1 full detail (spec, slices A/B/C, deployment, fleet rollout, canary verification). (4) New capabilities: Audit Log (migration 056), Systemic Log Feedback Intelligence (migration 057), comms durability architecture. (5) Updated fleet size (~215 enrolled, 168-182 online). (6) Updated agent/server versions to 0.6.63/0.3.68. (7) Build pipeline: server IS auto-deployed by webhook (correction to earlier assumption); parallel build prototype on Pluto (3.51x, deferred integration). (8) Channel/promotion model documented in detail. (9) Downloads layout documented. (10) Updated server binary path + SSH details for physical box. (11) Added new anti-patterns (installer `&` operator, reaper false-fail, agent-level channel pin). Added ADR-11 (durability-first command delivery). Migration count updated to 59. Patterns/History preserved verbatim except new entries added. +- **2026-06-22 recompile (Howard-Home/claude-main):** Delta from 2026-06-11: (1) Fleet updated to ~270 enrolled / ~178 online, agent v0.6.67. (2) BUG-021 (legacy wave dep-pin, commit 1dce66d), BUG-018 (DELETE 202+bg, cea87d4), BUG-019 (container guard, 66a7f4e) all fixed on main. (3) Event Log Watch UI shipped (0fa65f5). (4) Enrollment modal UX fix to prod (4027c86). (5) Watchdog section corrected per BUG-022: WatchdogEvent WS path was dead code (no producer; watchdog has no WS connection); removed in PR #45. REST `watchdog-alert` is the only supported watchdog alert path. (6) Beast parallel two-wave Windows build documented (lever A, 336s, ~3.8x vs Pluto). (7) BUG-021 dep-pin gotcha documented (getrandom 0.3.1 + zeroize 1.8.1). (8) command_type "cmd" alias + NAK documented. (9) Event Log Watch policy-clobber HIGH fix (PR #43). (10) MSP360 deep-link (PR #42, mig 062). (11) SPEC-021 (PR #40, mig 063). (12) sync.sh submodule-clobber root-cause fix (populate-only guard). (13) BSOD dedup key changed to bugcheck code. (14) MSI EXDEV fix. (15) 500-error-leak fix. (16) logs/analyze switched to Claude API. (17) gururmm-build skill + docs/BUILD.md. (18) Howard cleared for merges/deploys. (19) Dashboard beta-before-main rule. (20) Migration count updated to 60 (origin/main), with 061-063 on pending branches. (21) Old VM confirmed deleted (2026-06-12). (22) Open PRs #40-#46 documented with merge-order. (23) New anti-patterns added (command_type cmd, cargo +1.77 fetch, target-dir concurrency, BUG-021 dep-pin, stale snapshot build status, config-push clobber, stale submodule working tree). (24) New good patterns added (worktrees for concurrency, gururmm-build skill, beta-before-main). ## Backlinks