Dataforth UI push + dedup + refactor, GuruRMM roadmap evolution, Azure signing setup

Dataforth (projects/dataforth-dos/):
- UI feature: row coloring + PUSH/RE-PUSH buttons + Website Status filter
- Database dedup to one row per SN (2.89M -> 469K rows, UNIQUE constraint added)
- Import logic handles FAIL -> PASS retest transition
- Refactored upload-to-api.js to render datasheets in-memory (dropped For_Web filesystem dep)
- Bulk pushed 170,984 records to Hoffman API
- Statistical sanity check: 100/100 stamped SNs verified on Hoffman

GuruRMM (projects/msp-tools/guru-rmm/):
- ROADMAP.md: added Terminology (5-tier hierarchy), Tunnel Channels Phase 2,
  Logging/Audit/Observability, Multi-tenancy, Modular Architecture,
  Protocol Versioning, Certificates sections + Decisions Log
- CONTEXT.md: hierarchy table, new anti-patterns (bootstrap sacred,
  no cross-module imports), revised next-steps priorities

Session logs for both projects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-15 17:39:32 -07:00
parent eae9d7f644
commit 733d87f20e
42 changed files with 9153 additions and 7 deletions

View File

@@ -1,7 +1,7 @@
# GuruRMM - Project Context
**Last Updated:** 2026-04-14
**Status:** Active Development - Tunnel Phase 1 Complete
**Last Updated:** 2026-04-15
**Status:** Active Development - Tunnel Phase 1 Verified Live; Phase 2 Unblocked
## Quick Start - Infrastructure Overview
@@ -38,10 +38,26 @@
- SL-SERVER: **STUCK IN PENDING UPDATE** - requires manual service restart
### Recent Session Logs (MUST READ BEFORE CONTINUING WORK)
- **2026-04-15:** End-to-end tunnel lifecycle verified via public API. Three actionable findings — `session-logs/2026-04-15-session.md`
- **2026-04-14:** Tunnel API testing, authentication fix - `session-logs/2026-04-14-session.md`
- **2026-04-02:** Tunnel implementation, update bug fixes - See git history
- **2026-04-01:** Cloudflare Tunnel configuration - See credentials.md
### What To Do Next (priority order, revised 2026-04-15)
**Architectural pivot:** multi-tenancy is now a core requirement (product going to MSP market). Logging split into three tiers (agent OS-native / client event pull / tunnel audit to DB). Detailed breakdown in ROADMAP.md (sections: Logging & Audit, Multi-tenancy, Tunnel Channels).
1. **Fix `/api/v1/tunnel/status/{id}` 403 bug**`server/src/db/tunnel.rs:94-103`. Small PR. Blocks Phase 2 integration tests. (Roadmap S8.)
2. **Agent self-logging via OS-native sinks** — Windows Event Log provider, Linux journald, macOS os_log. Ship before anything else touches Phase 2. (Roadmap L1.)
3. **Tech-side tunnel subscriber design** — browser needs a WS endpoint to receive tunnel data; `server/src/ws/mod.rs:808-825` currently discards `AgentMessage::TunnelData`. Decide pub-sub shape before implementing any channel. (Roadmap T5.)
4. **Multi-tenancy schema**`tenant_id` on every table. Auth middleware filters by tenant. Do this before building more features because retroactive migration cost scales with schema size. (Roadmap M1-M2.)
5. **Terminal channel** — only after 1-4. `tokio::process::Command` in `agent/src/transport/websocket.rs:handle_tunnel_data()`. (Roadmap T1.)
6. **Client event pull (`client_events` table)** — 15-min delta + on-tunnel-open/close. Windows Get-WinEvent, Linux journalctl, macOS log show. (Roadmap L2-L4.)
**Housekeeping:**
- Update 1Password `Infrastructure/GuruRMM Server/Admin Password` to `GuruRMM2025` (stored value is stale and fails login).
- Add agent file logging (`C:\ProgramData\GuruRMM\agent.log`) as bridge until OS-native sinks land — lets Phase 2 work proceed with visibility.
## Anti-Patterns (DON'T DO THIS)
**DO NOT build on macOS** - Binaries won't run on Linux server. SSH to 172.16.3.30 and build natively.
@@ -62,6 +78,22 @@
**DO NOT use emojis** - ASCII markers only: [OK], [ERROR], [WARNING], [SUCCESS], [INFO]
**DO NOT make breaking changes to `/api/v1/bootstrap/hello`** - This is the anchor that lets long-offline agents reconnect and self-upgrade. Input and output schemas are **additive-only forever**. An agent from v0.1 must be able to hit this endpoint in 2030 and get a meaningful response telling it how to update. Every other endpoint/message is free to evolve; this one is not. See ROADMAP.md V1-V10.
**DO NOT cross module boundaries by importing another module's internals** - The product is architected modularly (core + PSA + backups + syslog + ...). Modules own their schema namespace and never touch another module's tables. Cross-module communication goes through the event bus or that module's exposed API only. Core and modules are separate Rust crates by design; enforce via `use` restrictions. Breaking this discipline once poisons the whole architecture. See ROADMAP.md X1-X12.
### Hierarchy Terminology (use these exact terms)
| Tier | Term | DB | Meaning |
|---|---|---|---|
| 1 | Platform | — | The software author (us, GuruRMM) |
| 2 | Partner | `tenant_id` | An MSP — a paying customer of the Platform |
| 3 | Client | `client_id` | A Partner's customer |
| 4 | Site | `site_id` | A location within a Client (physical or logical) |
| 5 | Agent | `agent_id` | An endpoint at a Site |
UI/API says "Partner"; DB column is `tenant_id`. Do not rename. Do not use "sub-tenant" or bare "customer". Full canonical definition + API path convention + event topic naming in ROADMAP.md Terminology section.
## Where to Find Things
### Codebase Structure
@@ -304,10 +336,15 @@ enum AgentMessage {
### Backlog
- [ ] Fix SL-SERVER stuck update (manual restart required)
- [ ] Investigate 4 duplicate agent records in database
- [ ] Investigate 4 duplicate agent records in database (2x SL-SERVER seen)
- [ ] Windows update system testing (scheduled task timing)
- [ ] Agent reconnection on network failure
- [ ] Multi-tenant access control audit
- [ ] **[2026-04-15] Status endpoint returns 403 for closed sessions** — should return `{status: closed}` with session record when caller owns it. See session log. (Tracked as Roadmap S8.)
- [ ] **[2026-04-15] Agent writes no logs** — add tracing+file appender to `agent/src/main.rs`; logs to `C:\ProgramData\GuruRMM\agent.log`. (Bridge to Roadmap L1 OS-native sinks.)
- [ ] **[2026-04-15] Logging redesign — three-tier architecture.** See ROADMAP.md "Logging, Audit & Observability" section (L1-L10).
- [ ] **[2026-04-15] Multi-tenancy schema refactor.** See ROADMAP.md "Multi-tenancy / MSP SaaS" section (M1-M7). Blocks scaling to other MSPs.
- [ ] **[2026-04-15] Tunnel Channels (Phase 2).** See ROADMAP.md "Tunnel Channels" section (T1-T8). T5 (tech-side subscriber) is the gating design decision.
## Useful Links

View File

@@ -2,7 +2,34 @@
Tracked list of desired features, improvements, and changes. Used to evaluate whether the current codebase supports these goals or if a rewrite is needed.
**Last Updated:** 2026-04-01
**Last Updated:** 2026-04-15
---
## Terminology (canonical)
Decided 2026-04-15. Use these exact terms in code, UI, API, docs, and conversation. Don't invent synonyms.
| Tier | Term | DB column | Meaning | Example |
|---|---|---|---|---|
| 1 | **Platform** | — | The software author (us) | GuruRMM |
| 2 | **Partner** | `tenant_id` | An MSP — a paying customer of the Platform | "Acme IT Services" |
| 3 | **Client** | `client_id` | A Partner's customer | "Dataforth Corp" |
| 4 | **Site** | `site_id` | A location or logical grouping within a Client | "Dataforth Tucson HQ" |
| 5 | **Agent** | `agent_id` | An endpoint at a Site | AD2, SL-SERVER |
**Notes:**
- UI/API use "Partner"; DB uses `tenant_id` (industry-standard term for isolation). Do not rename `tenant_id` in code.
- "Client" may collide with HTTP-client terminology in context; when ambiguous, use "client org" or "client account".
- **Site** is not always a physical location — can be a DMZ, VLAN, cloud region, whatever grouping makes sense for that Client.
- **Do not use** "sub-tenant" or "customer" (ambiguous across tiers).
- User roles: Platform admin (us), Partner admin, Partner tech, Client contact (limited read access to their own data).
- Optional Department/OU tier inside a Site is deferred until a real customer asks for it.
- MSPs can label-override their UI via `partner_settings.label_overrides` JSONB (e.g., rename "Client"→"Customer" for their branded view) — supported without schema changes.
**API path convention:** `/api/public/v1/partners/{partner_id}/clients/{client_id}/sites/{site_id}/agents/{agent_id}`
**Event bus topic convention:** `agent.online`, `site.created`, `client.deleted`, `partner.upgraded`, etc.
---
@@ -43,8 +70,63 @@ Tracked list of desired features, improvements, and changes. Used to evaluate wh
| S4 | Policy actions: custom script execution | High | Open | Policies can trigger scripts (PowerShell/bash) on matching agents. Scheduled or on-demand. |
| S5 | Customizable alerting system | High | Open | User-defined alert rules: offline detection, disk space thresholds, SMART errors, RAID degradation, bad sectors, CPU/RAM sustained high, temp thresholds. Configurable severity, notification channels, escalation. |
| S6 | Alert notification channels | Medium | Open | Email, webhook, Slack/Teams integration, push notifications. Per-alert-rule routing. |
| S7 | Real-time tunnel mechanism (separate from check-in) | High | Open | On-demand WebSocket tunnel between tech's browser and agent for interactive tools. Multiplexed channels for terminal, file browser, registry, services. Low latency, not tied to metrics interval. |
| S8 | | | | |
| S7 | Real-time tunnel mechanism (separate from check-in) | High | Phase 1 Done | Session lifecycle REST+WS+DB+agent state machine complete (2026-04-14 / verified 2026-04-15). Phase 2 (channels) tracked under Tunnel Channels section below. |
| S8 | Closed-session status endpoint returns 403 | Medium | Open | `GET /api/v1/tunnel/status/{id}` returns 403 for closed sessions (should return `{status: closed}`). Root cause: `verify_session_ownership()` applies `WHERE status='active'` before ownership check. Fix in `server/src/db/tunnel.rs:94-103`. |
| S9 | | | | |
## Tunnel Channels (Phase 2)
On-demand capabilities layered on top of the tunnel session framework. Each channel is a typed WebSocket payload pair (request/response) routed by `channel_id` under an open `tech_session`. All channel operations are audited per Logging & Audit section.
| # | Feature | Priority | Status | Notes |
|---|---------|----------|--------|-------|
| T1 | Terminal channel (interactive shell) | High | Open | `TunnelDataPayload::Terminal { command }``TerminalOutput { stdout, stderr, exit_code }` (types exist in `server/src/ws/mod.rs:310-319`, agent stub at `agent/src/transport/websocket.rs:408-434`). Implement via `tokio::process::Command` with configurable timeout (default 30s). 80% of field use cases. Ship before other channels. |
| T2 | File channel (upload/download/rename/delete + tree browse) | High | Open | Covers D7. Stream file bytes in chunks over WS with progress. Path safety (no `..` traversal). Needs allowlist vs freeform decision. |
| T3 | Registry channel (Windows) | Medium | Open | Covers D8. Read/write/create/delete keys + values. Use `winreg` crate. Gate to tenant admins only. |
| T4 | Service channel (Windows services) | High | Open | Covers D9. List/start/stop/restart/change-startup-type. `windows-service` crate. |
| T5 | Tech-side tunnel subscriber | High | Open | **Blocks all channels.** Browser currently has no mechanism to receive tunnel data from server. Design: `GET /api/v1/tunnel/stream/{session_id}` WebSocket + in-memory `HashMap<session_id, mpsc::Sender<TunnelData>>` pub-sub. |
| T6 | Server-side forward path | High | Open | `server/src/ws/mod.rs:808-825` currently logs+drops incoming `AgentMessage::TunnelData`. Wire to T5 pub-sub + tunnel_audit INSERT. |
| T7 | Working directory / shell choice / elevation decisions | High | Open | Terminal channel design decisions: cwd allowlist vs free-form; PowerShell vs cmd on Windows; admin elevation gating by role. |
| T8 | Channel concurrency + rate limits | Medium | Open | Multiple channels in one session. Per-channel rate/quota. Output size cap (default 1 MB/command). |
| T9 | | | | |
## Logging, Audit & Observability
Three-tier design decided 2026-04-15. Each tier has distinct purpose, storage, retention, and consumer.
**Design principles:**
- **Agent self-logging** uses OS-native mechanisms (no custom transport). Troubleshoot with familiar tools.
- **Client machine health** via OS event log pulls. Feeds dashboard and alerting.
- **Tunnel audit** captured directly to RMM DB. Non-negotiable, never scrubbed, designed for legal/compliance retention.
| # | Feature | Priority | Status | Notes |
|---|---------|----------|--------|-------|
| L1 | Agent self-logging via OS-native sinks | High | Open | Windows Event Log (custom `GuruRMM-Agent` provider registered at install), Linux systemd/journald (`tracing` → stdout when run as unit), macOS unified log (`os_log` crate). Verbosity per-tenant configurable. Default INFO. |
| L2 | Client event log pull + summarize | High | Open | Agent polls OS event log on schedule; ships filtered events to server `client_events` table. Windows: `Get-WinEvent -Level 1,2 -MaxEvents N`. Linux: `journalctl -p err --output json`. macOS: `log show --predicate 'messageType == error' --style json`. |
| L3 | L2 cadence — default 15-min delta poll + on tunnel open/close | High | Open | Default 900s. On tunnel open: force delta pull so tech has fresh context. On tunnel close: force delta pull to capture anything tech's actions triggered. Configurable per-tenant in dashboard. |
| L4 | L2 levels — default Critical + Error + Warning | High | Open | Configurable per-tenant. Default: Critical(1), Error(2), Warning(3). Separate "noisy" bucket (Info/Debug/Audit/Notification) pulled every 4h default. |
| L5 | Tunnel audit — every tech action persisted | High | Open | Reuse existing `tunnel_audit` table (migration 010, unused today). Every command, file op, registry op, service op gets INSERT with session_id, channel_id, operation, details JSONB. No scrubbing — must retain sensitive input if a tech types it. |
| L6 | Retention config | High | Open | `client_events`: 90 days default, tenant configurable. `tunnel_audit` (live): 90 days default, tenant configurable. `tunnel_audit` (archive): indefinite, system-level rotation to object storage. Agent self-logs follow OS-native retention policy. |
| L7 | Tunnel audit archive rotation | High | Open | Monthly job: aged partitions of `tunnel_audit` → compressed JSONL or Parquet in S3/R2/MinIO. Naming: `tunnel_audit/tenant_id={uuid}/year={YYYY}/month={MM}.jsonl.gz`. Dashboard "deep search" endpoint queries archive on demand (Athena/DuckDB). |
| L8 | Agent config push | High | Open | On agent WS connect, server sends `ServerMessage::Config { tenant_settings }`. Real-time updates when tenant admin changes settings in dashboard. Agent adjusts poll cadence + event level filters live without restart. |
| L9 | Dashboard surfaces for L2 (client_events) | Medium | Open | Red-number badge on agent tile (count of unresolved errors last 24h). Time-sorted feed on agent detail page with filter/search. Acknowledge/dismiss individual events. |
| L10 | Sensitive-data-at-rest protection | High | Open | `tunnel_audit` may contain unscrubbed credentials. Postgres TDE or full-disk encryption on server. Access to audit tables strictly admin-role-gated. Meta-audit: log every `SELECT` on `tunnel_audit` to separate table. Document in tech SOP: "every tunnel keystroke is logged." |
| L11 | | | | |
## Multi-tenancy / MSP SaaS
Goal stated 2026-04-15: make this a marketable product for other MSPs. Multi-tenancy must be baked in from here on — adding `tenant_id` later would be a brutal migration.
| # | Feature | Priority | Status | Notes |
|---|---------|----------|--------|-------|
| M1 | Core tenancy schema | High | Open | New tables: `tenants` (id, name, plan, status, created_at), `tenant_settings` (tenant_id, key, value JSONB), `msp_users` (superadmins across tenants), `tenant_users` (tech ↔ tenant join with role). Add `tenant_id UUID` FK to: `agents`, `tech_sessions`, `tunnel_audit`, `client_events`, `commands`, any other per-customer table. |
| M2 | Tenant-scoped authorization | High | Open | JWT carries `tenant_id` + `role`. Every query must filter by tenant_id (middleware). Super-admin role bypasses for GuruRMM staff. Penalty for bugs here: data leakage across tenants. |
| M3 | Tenant admin dashboard | High | Open | UI for MSP admins to configure their tenant settings (L3/L4/L6 cadences, levels, retention). Super-admin can override across tenants. |
| M4 | Billing / licensing meter | Medium | Open | Per-agent-per-month is standard for RMM. Needs usage counter from day one. Consider Stripe Billing or manual invoicing to start. |
| M5 | Data residency options | Low | Open | Some MSPs require on-prem or regional hosting. Architectural impact: deployment model (single-tenant vs multi-tenant DB), encryption key management. Not required for MVP. |
| M6 | Tenant export API | Medium | Open | MSPs with SOC2/PCI customers will need to export their tenant's audit trail. `GET /api/v1/tenants/{id}/export` producing JSONL or Parquet. Self-service for portability. |
| M7 | Onboarding flow | High | Open | MSP signs up → tenant provisioned → first site created → install link generated → agent installs → first heartbeat → onboarding complete. End-to-end wizard. |
| M8 | | | | |
## Infrastructure / Operations
@@ -57,6 +139,108 @@ Tracked list of desired features, improvements, and changes. Used to evaluate wh
---
## Modular Architecture & Public APIs
Goal stated 2026-04-15: the product should be modular from inception. Future modules under consideration: PSA/CRM, remote syslog aggregation, backups, likely more. Both first-party (us) and eventually third-party (other developers, customers) should be able to build modules against stable, versioned interfaces. End users should also have API access to automate against their own data.
**Architectural principles:**
- **Core is thin + opinionated.** Tenants, agents, auth, audit, command dispatch, tunnel framework — that's the "kernel." Everything else is a module.
- **Modules own their data.** Each module owns a schema namespace (`psa_*`, `backup_*`, `syslog_*`) and never writes directly to another module's tables. Cross-module data access goes through module-exposed APIs.
- **Event bus for cross-cutting communication.** Agent.online, tunnel.opened, command.completed, client_event.received — core publishes, any module subscribes.
- **Public API is a first-class product surface**, not an afterthought. OpenAPI spec, semver-versioned, rate-limited, key-authenticated, documented.
- **Boundary discipline:** if it's tempting to reach across a module boundary, that's a signal to add an API there instead. Breaking this discipline once kills the modularity.
| # | Feature | Priority | Status | Notes |
|---|---------|----------|--------|-------|
| X1 | Core vs. module boundary definition | High | Open | Document what's "core" (tenants, agents, auth, audit, command dispatch, tunnel framework, bootstrap) vs. what's a module (everything else). Codify via separate crates / modules in the Rust workspace (`core/`, `modules/psa/`, `modules/backups/`, etc.). Enforce via build system — module code cannot `use` private core internals, only the exposed `core::api::*` surface. |
| X2 | Module manifest / registration | High | Open | Each module ships a `module.toml` declaring: name, version, provides (APIs exposed), consumes (events/APIs used), permissions required (read_agents, write_commands, read_audit, etc.). Loaded at server startup; dashboard reflects installed modules. |
| X3 | Event bus | High | Open | NATS JetStream or Redis Streams. Every significant core action emits a typed event (`agent.online`, `agent.offline`, `tunnel.opened`, `tunnel.closed`, `command.completed`, `client_event.received`, `tenant.created`). Modules subscribe via the bus, not via direct core calls. Decouples timing + enables async modules. |
| X4 | Module-to-core APIs | High | Open | Core exposes a stable in-process API for modules: `core::agents::list(tenant_id)`, `core::commands::enqueue(...)`, `core::audit::record(...)`. Versioned like `core_api_v1`, `core_api_v2`. Modules declare which version they require. |
| X5 | Module-to-module APIs | Medium | Open | Modules can expose their own APIs for other modules to consume. Example: PSA module exposes `psa::tickets::create()` which a Backups module could call when a backup fails. All via the module registry — no direct imports. |
| X6 | Public REST API (for end users + integrations) | High | Open | Versioned under `/api/public/v1/`. OpenAPI 3.1 spec auto-generated. Rate-limited per API key. Scoped API keys (read-only / write / admin). Separate from internal `/api/v1/` used by dashboard. Publish spec at `/api/public/v1/openapi.json`. |
| X7 | API key management | High | Open | Dashboard UI: tenants create/revoke/rotate API keys, scope per key, view last-used and usage stats. Keys carry tenant_id. JWT session tokens (for dashboard) are separate from API keys (for machines). |
| X8 | Public webhook subscriptions | High | Open | Tenants subscribe to events via webhook URL. Event bus (X3) feeds a delivery worker that signs payloads (HMAC), retries with backoff, tracks delivery status in DB. Lets customers integrate without polling. |
| X9 | Third-party module sandbox | Medium | Open | Future work. Options: (a) WebAssembly modules loaded at runtime with capability-based access to core APIs; (b) signed OCI container images run as sidecars with mTLS. (a) is better UX but maturity risk. (b) is ops-heavy but proven. Decide when third-party demand is real. |
| X10 | Module billing isolation | Medium | Open | Each module can have independent pricing (PSA seat-based, Backups GB-based, RMM per-agent). Core billing meter (M4) becomes per-module, aggregates to tenant invoice. Enable tenants to subscribe to some modules but not others. |
| X11 | Module upgrade independence | Medium | Open | Modules version independently of core. Core API versioning (X4) lets modules pin `core_api_v2` and survive core updates. Dashboard shows which modules need upgrades for a new core release. |
| X12 | Module discoverability / marketplace | Low | Open | Eventually: marketplace UI for MSPs to browse/install first- and third-party modules. Signed+reviewed entries only. Revenue share for third-party developers. Many moons away, design constraint for now: don't paint ourselves into a corner. |
| X13 | | | | |
### Module candidates currently in mind
Capture these now so the core API design has concrete use cases to validate against:
- **PSA/CRM module** — tickets, time tracking, contracts, invoicing. Likely largest module, heaviest DB load. Consumes: `agent.online`, `client_event.received`, `command.completed`. Exposes: `psa::tickets::create|assign|close`, `psa::time::log`.
- **Remote Syslog module** — aggregates syslog/Windows Event Log from customer devices to a central searchable store. Consumes: `client_event.received`. Exposes: `syslog::query|subscribe`. Heavy ingest.
- **Backups module** — schedules, monitors, reports on backup jobs (Veeam, Datto, Acronis, Synology, etc.). Consumes: integrations with third-party backup products (pull). Exposes: `backups::status|history|alert`. Compliance-sensitive.
- **Patch management** — track OS + app patch levels, schedule installs, report compliance.
- **Documentation (IT Glue-style)** — customer environment docs, credential vault, runbooks. Deep integration with PSA (customer entity shared).
- **Remote access** — already covered by core tunnel framework; could grow into its own "pro" module with session recording, MFA-gated elevation, etc.
- **Network monitoring** — SNMP/ping monitoring of non-agent devices (switches, printers, UPSs).
## Protocol Versioning & Stale-Agent Recovery
Problem surfaced 2026-04-15: as the codebase evolves (multi-tenancy pivot, tunnel channels, new message types), long-offline agents will return to find the wire format they knew is gone. Without an upgrade lane, those agents become zombies — visible in the dashboard as "offline for 47 days," never self-heal, require manual intervention (RDP in, uninstall, reinstall).
Concrete example: Scileppi VP laptop offline for days. When it wakes up and tries to check in with v0.6.0 against a server that by then expects v0.9.x protocol, we need the server to say "I see you, you're old, here's how to update yourself" — and have the agent auto-comply.
**Design principle:** the bootstrap/hello path is sacred. It must never break, even across major protocol revisions. All other endpoints and message shapes are allowed to change. An agent that can still reach `/hello` can always recover.
| # | Feature | Priority | Status | Notes |
|---|---------|----------|--------|-------|
| V1 | Protocol version negotiation on connect | High | Open | Agent sends `{agent_version, protocol_version, os, arch}` as first message. Server responds with `{server_version, min_supported_protocol, latest_protocol, action}` where action ∈ {`proceed`, `upgrade_required`, `rejected`}. WebSocket subprotocol header is one delivery option; a dedicated HTTP hello endpoint is another. Pick one, then never change its shape. |
| V2 | Stable bootstrap endpoint | High | Open | `POST /api/v1/bootstrap/hello` that accepts the agent handshake forever. Contract: input schema is additive-only (new optional fields OK, never rename/remove), output shape is additive-only. Agents as old as v0.1 must be able to hit this and get meaningful response. |
| V3 | Compat shim layer per old protocol version | High | Open | When an old agent checks in, server translates between the old wire format and current internal types. Shim lives in `server/src/compat/v{N}.rs`. Each shim documents: which protocol versions it supports, what adapters it provides, planned removal date. |
| V4 | Server-initiated forced upgrade instruction | High | Open | When handshake returns `action: upgrade_required`, response also includes `update_url`, `update_checksum`, `update_args`, and optional `restart_policy`. Agent treats this as highest-priority command, bypasses normal command queue, upgrades + relaunches itself. |
| V5 | Agent self-update atomic rename (verify) | High | Exists (hardening needed) | Already done per 2026-04-01 ADR. Audit against V4 flow: does current updater handle "tell me exactly which version to install" vs. "upgrade to latest"? May need parameterization. |
| V6 | Per-version support matrix + sunset policy | High | Open | Dashboard surface: table showing N agents per protocol version per tenant. Automated sunset: when a protocol version has 0 live agents for 60 days across all tenants, flag compat shim for removal in next release. Manual override to force-remove earlier. |
| V7 | Agent version pinning per tenant | Medium | Open | MSP can opt tenants into "stable" (N-1), "current" (latest), or "beta" (preview) update channels. Controls auto-update rollout pace across their fleet. |
| V8 | Late check-in handling: accept then command | High | Open | On stale-agent connect: (a) accept the handshake via compat shim, (b) record the connect event in audit, (c) immediately enqueue the upgrade command, (d) agent executes before any other work. Dashboard shows agent as "upgrading" briefly before "online". |
| V9 | Graceful protocol deprecation warnings | Medium | Open | When an agent connects on a deprecated (but still supported) protocol, server sends a warning field in every response. Agent logs it. Gives MSPs lead time to upgrade their fleet before hard-removal. |
| V10 | Rollback path for bad upgrades | High | Open | If v0.N upgrade bricks agents, bootstrap endpoint must let an operator mark v0.N `action: downgrade_required` and ship an older binary. Requires keeping old binaries in `/var/www/gururmm/downloads/` with pinned checksums. |
| V11 | | | | |
## Certificates & Trust
Code signing and TLS/trust certificates required to ship + operate the product without install-time friction. Decisions 2026-04-15.
| # | Item | Priority | Status | Cost | Notes |
|---|------|----------|--------|------|-------|
| C1 | Azure Trusted Signing — Windows agent + installer | High | In progress (2026-04-15) | ~$9.99/mo + per-sig fee | Hosted signing service. Bypasses hardware-token requirement that took effect June 2023. Public Trust level requires 3+ yrs business existence; Private Trust available immediately but limited usefulness. Identity verification via Microsoft takes days. See setup steps in session-logs/2026-04-15. |
| C2 | Apple Developer Program — macOS agent notarization | High | Open | $99/yr | Developer ID Application + Installer certs; notarization via `xcrun notarytool`; Hardened Runtime entitlements; ticket stapling for offline installs. Enrollment can take days — start early. |
| C3 | GPG signing — Linux .deb / .rpm packages | High | Open | Free | Generate key pair, publish pubkey at a stable URL, sign packages with `debsign`/`rpmsign`, host signed apt/yum repo with proper `Release`/`repomd.xml`. |
| C4 | Timestamping — all signed artifacts | High | Open | Free | Use DigiCert or Sectigo public timestamp servers so signatures remain valid after cert rotation. Verify in CI that every signed binary has a valid timestamp. |
| C5 | TLS automation for own domains | High | Done | Free | Cloudflare + Let's Encrypt already in place for `rmm-api.azcomputerguru.com`. Wildcard for `*.gururmm.com` when that domain lights up. |
| C6 | Per-Partner white-label custom domains | Medium | Open | ~$7/mo/domain via CF-for-SaaS, or DIY with ACME DNS-01 | Partners want `rmm.theirbrand.com`. Decide: host certs ourselves via ACME DNS-01 + Cloudflare API, or use Cloudflare for SaaS. Defer until first Partner asks. |
| C7 | Agent-to-server mTLS (enterprise option) | Low | Open | Internal CA + time | Self-signed CA + per-agent client certs. Bootstrap enrolls agent and issues cert scoped to `agent_id`. Adds install complexity. Defer until an enterprise customer demands it. |
| C8 | SBOM + Sigstore/cosign provenance | Medium | Open | Free | Auto-generate CycloneDX or SPDX SBOM per release. `cosign` sign artifacts + container images. Important for SOC2-conscious MSPs evaluating supply chain. |
| C9 | Windows Defender / vendor FP submission runbook | Medium | Open | — | Despite valid signing, heuristic engines flag new binaries. Keep a runbook with submission portal links (Microsoft Security Intelligence, Malwarebytes, etc.). |
| C10 | Email sending trust: DKIM / SPF / DMARC | Medium | Open | Free | Required when PSA module sends ticket notifications. Set up on sending domain; per-Partner if white-labeled email is a feature. |
| C11 | WHQL driver signing | Deferred | Open | $$$ + weeks turnaround | Only if we ship a kernel driver. Avoid this path — use user-mode alternatives first. |
| C12 | | | | | |
## Decisions Log
Short record of why things are the shape they are. Append, don't edit.
**2026-04-15 — Tunnel Phase 1 verified live.** End-to-end test from off-LAN workstation via `rmm-api.azcomputerguru.com`. Open/status/close lifecycle works. Confirmed nginx proxies `/api/*` (not just `/downloads/`). See session-logs/2026-04-15-session.md.
**2026-04-15 — Logging split into three tiers.** Decided against a single custom log transport. Agent self-logging to OS-native sinks (Event Viewer / journald / os_log). Client machine health via OS event log pulls. Tunnel audit direct to RMM DB. Rationale: sysadmins can troubleshoot with familiar tools; only high-value audit data hits our DB.
**2026-04-15 — Tunnel audit is never scrubbed.** If a tech types a password during a session, it gets stored. Purpose is to audit tech behavior, and scrubbing would undermine that. Offsetting controls: encryption at rest, admin-role-gated access, meta-audit of log views, tech SOP documentation. See L10.
**2026-04-15 — Multi-tenancy from day one.** Target market is MSPs reselling this product. Adding `tenant_id` retroactively after feature growth is a brutal migration; baking it in now is cheap. Every new table gets `tenant_id` FK from here forward.
**2026-04-15 — Poll cadences.** 15-min delta + on-tunnel-open/close for critical+error+warning. 4h bulk for info/debug/audit/notification. All tenant-configurable.
**2026-04-15 — Retention.** 90 days default for tenant-visible tables. Indefinite system-level for `tunnel_audit` with object-storage archive after the tenant-visible window. Legal/compliance contexts (HIPAA 6yr, PCI 1yr) handled by per-tenant extended retention configs.
**2026-04-15 — Hierarchy terminology locked.** Platform > Partner (MSP, DB: tenant_id) > Client > Site > Agent. API and UI say "Partner"; DB says `tenant_id`. No "sub-tenant", no ambiguous "customer". Department/OU tier deferred. MSPs can white-label labels via JSONB overrides. See Terminology section at top of this file.
**2026-04-15 — Modular architecture from day one.** Core = tenants + agents + auth + audit + commands + tunnel framework + bootstrap. Everything else = module. Modules own their schema namespace, never touch each other's tables, communicate via event bus (X3) and versioned module APIs (X4/X5). Public REST API (X6) separate from internal dashboard API. Webhook subscriptions (X8) for customer integrations. Third-party modules via WASM or signed containers — deferred but design-constrained now. Concrete module candidates: PSA/CRM, remote syslog, backups, patch management, IT-Glue-style docs, network monitoring. See X1-X12.
**2026-04-15 — Bootstrap endpoint is sacred.** Protocol version negotiation via a single `/api/v1/bootstrap/hello` endpoint whose input/output are additive-only forever. Every other endpoint/message is free to evolve. Enables late-arriving agents (Scileppi VP example: offline for days, wakes up to find a newer server protocol) to reconnect, get accepted, and receive an automatic upgrade instruction. Compat shim layer per old protocol version with automated sunset policy when fleet-wide usage hits zero. See V1-V10.
## Rewrite Assessment
**Criteria for rewrite:**
@@ -64,4 +248,4 @@ Tracked list of desired features, improvements, and changes. Used to evaluate wh
- If the tech stack is fundamentally wrong for the goals
- If accumulated tech debt makes changes unreasonably slow
**Current assessment:** TBD -- add features above first, then evaluate.
**Current assessment (2026-04-15):** The multi-tenancy pivot means a schema refactor is unavoidable (add `tenant_id` everywhere, tenancy-aware auth middleware). This is additive, not a rewrite. Rust + Axum + Postgres + WebSocket stack remains fit for purpose. Current code is a solid foundation. No rewrite planned; structural additions tracked above.

View File

@@ -0,0 +1,162 @@
# GuruRMM Session Log — 2026-04-15
## Context
End-to-end test of the Tunnel Phase 1 lifecycle, triggered opportunistically
while troubleshooting SSH flakiness on AD2 (Dataforth project). No code
changes — exercised the production API from an off-LAN workstation via the
public Cloudflare endpoint (`rmm-api.azcomputerguru.com`).
## What worked
| Step | Endpoint | Result |
|---|---|---|
| Login | `POST /api/auth/login` | 200, token returned |
| List agents | `GET /api/agents` | 6 agents, AD2 and DESKTOP-0O8A1RL online on v0.6.0 |
| Open tunnel | `POST /api/v1/tunnel/open` (agent_id=AD2 `d28a1c90-47d7-448f-a287-197bc8892234`) | 200, `{session_id: 0682a80c-a899-403b-9473-aaaed50e4aba, status: active}` |
| Status while active | `GET /api/v1/tunnel/status/{id}` | 200, full session record (opened_at, last_activity, agent_id) |
| Close tunnel | `POST /api/v1/tunnel/close` | 200, `{status: closed}` |
## Findings (actionable)
### 1. Status endpoint returns 403 after close
`GET /api/v1/tunnel/status/{id}` against a just-closed session returns
`403 Forbidden — "Session not found or not owned by user"` instead of
`{status: closed}`. Root cause likely that the `WHERE status = 'active'`
filter (from `idx_tech_sessions_active` — see CONTEXT.md line 256) is applied
to the status lookup in addition to the ownership check, so closed sessions
fail ownership verification and fall through to the 403 branch.
**Fix:** separate the existence lookup from the ownership check. If the
session exists but belongs to the requesting tech, return the closed record
rather than masking it as a permission error.
Location to inspect: `server/src/api/tunnel.rs` (status handler) and/or
`server/src/db/tunnel.rs` (session fetch query).
### 2. Agent writes no logs
`gururmm-agent.exe 0.6.0` on AD2 produces no files in
`C:\Program Files\GuruRMM\`, `C:\ProgramData\GuruRMM\`, nor any Windows
Application Event Log entries under provider `gururmm*`. This made it
impossible to confirm the agent-side state transition
(`Heartbeat → Tunnel`) or receipt of `TunnelReady` during the test.
**Fix:** add a log target in `agent/src/main.rs` (env_logger or tracing
with a rolling file appender) writing to
`C:\ProgramData\GuruRMM\agent.log`. Optionally also emit critical events
(tunnel open/close, update success/failure) to the Windows Event Log via
`eventlog` crate.
### 3. Phase 2 gap confirmed against a real use case
Live need: run a couple of diagnostic commands on AD2 (sshd flapping
sporadically on port 22, no process crash in Event Log; want to investigate
firewall/Defender events from the server side). With no channels, the
tunnel's only utility today is proving the session layer works. The actual
remote-operate capability still depends on Phase 2.
**Priority order for Phase 2 channels** (based on what would have been useful
here):
1. **Terminal channel** first — unlocks 80% of field use cases (log tails,
`Get-Service`, `Restart-Service`, `Get-WinEvent`).
2. **Service channel** second — tight scope, high value for "restart sshd".
3. **File channel** third — needed but rarely urgent; SFTP already exists.
4. **Registry channel** last — niche, can defer.
## What Else We Observed
- The public tunnel chain `rmm-api.azcomputerguru.com` → Cloudflare → nginx
→ API (3001) proxies `/api/*` correctly. The docs in CONTEXT.md implied
nginx only served `/downloads/`; confirmed today that it also proxies API
paths, which is why off-LAN admin usage works.
- AD2 agent start time `2026-04-11 22:09` corresponds to last reboot of
AD2; the agent has not restarted since despite sshd port flaps (sshd PID
4012 also continuously running since same moment). Confirms the tunnel
infrastructure and the RMM agent are stable; the sshd flap is a separate
network-layer issue unrelated to GuruRMM.
## Credentials Used
- **Admin Email:** admin@azcomputerguru.com
- **Admin Password:** GuruRMM2025
- **Public API:** https://rmm-api.azcomputerguru.com
**Note:** `op read "op://Infrastructure/GuruRMM Server/Admin Password"`
returned a stale value (`ClaudeAPI2026!@#`) that fails login. The
2026-04-14 session log documents the current password as `GuruRMM2025`.
1Password entry should be updated to match.
## Next Steps
1. Update 1Password `Infrastructure/GuruRMM Server` entry — set
`Admin Password` field to `GuruRMM2025` to match what server accepts.
2. Fix `/api/v1/tunnel/status/{id}` for closed sessions (see Finding 1).
3. Add file/event-log output to agent (see Finding 2).
4. Begin Phase 2 — Terminal channel first.
---
## Update (evening session): Roadmap evolution + Azure Trusted Signing setup
Substantial architectural planning session. Product direction shifted from "single-tenant RMM tool" to "multi-tenant SaaS for MSPs." Roadmap updated significantly to reflect.
### Roadmap additions to ROADMAP.md
1. **Terminology (canonical)** — locked in the 5-tier hierarchy: Platform → Partner (DB: tenant_id) → Client → Site → Agent. API/UI says "Partner"; DB column is `tenant_id`. API path convention `/api/public/v1/partners/{pid}/clients/{cid}/sites/{sid}/agents/{aid}`. Event topics like `agent.online`, `partner.upgraded`. Full table + rules at top of ROADMAP.md.
2. **Tunnel Channels (Phase 2)** — T1-T8 tracking Terminal/File/Registry/Service channels + tech-side subscriber (T5 is gating dep — browser currently has no way to receive tunnel data, `server/src/ws/mod.rs:808-825` discards incoming `AgentMessage::TunnelData`).
3. **Logging, Audit & Observability** — L1-L10 three-tier design:
- Agent self-logging via OS-native sinks (Windows Event Log custom provider, Linux journald, macOS os_log)
- Client machine health via OS event log pulls — default 15-min delta + force-pull on tunnel open/close; default levels Critical+Error+Warning for delta, 4h bulk for Info/Debug/Audit/Notification; all tenant-configurable
- Tunnel audit direct to DB table `tunnel_audit` (already exists, unused) — no scrubbing, sensitive input captured intentionally for tech-behavior audit; 90-day tenant-visible retention default; indefinite system archive to object storage
- Agent config push via `ServerMessage::Config` on connect + real-time when tenant admin changes settings
4. **Multi-tenancy / MSP SaaS (M1-M7)** — tenant_id on every table from now forward, tenancy-aware auth middleware, tenant admin dashboard, per-agent/month billing meter, data residency options, tenant export API, onboarding wizard.
5. **Modular Architecture & Public APIs (X1-X12)** — core vs. module boundary, event bus (NATS JetStream or Redis Streams), module manifest, module-to-core + module-to-module versioned APIs, public REST API `/api/public/v1/` with OpenAPI spec + scoped API keys, webhook subscriptions, WASM or OCI sandbox for third-party modules (deferred), per-module billing. Concrete module candidates documented: PSA/CRM, Remote Syslog, Backups, Patch Mgmt, IT-Glue-style Docs, Network Monitoring.
6. **Protocol Versioning & Stale-Agent Recovery (V1-V10)**`/api/v1/bootstrap/hello` declared **sacred** (additive-only forever). Compat shim layer per old protocol version at `server/src/compat/v{N}.rs`. Server-initiated forced-upgrade instruction. Per-tenant update channels (stable/current/beta). Auto-sunset policy when old version fleet hits zero. Rollback path via `action: downgrade_required`. Concrete motivating example: Scileppi VP laptop offline for days — must be able to reconnect, get accepted, auto-upgrade.
7. **Certificates & Trust (C1-C11)** — full cost + priority matrix. C1: Azure Trusted Signing for Windows (Public Trust). C2: Apple Developer Program. C3: GPG for Linux. C4-C11: TLS automation, mTLS, SBOM, FP submissions, DKIM.
8. **Decisions Log** — appended rationale entries for every 2026-04-15 decision so future sessions don't re-litigate.
### CONTEXT.md anti-patterns added
- "DO NOT make breaking changes to `/api/v1/bootstrap/hello`" — additive-only forever
- "DO NOT cross module boundaries by importing another module's internals" — event bus or exposed APIs only
- Hierarchy terminology table added to anti-patterns block (canonical reference)
### Azure Trusted Signing — provisioned and IV submitted
**Business identity confirmed** via D&B profile lookup: `Arizona Computer Guru LLC` (D-U-N-S `00-566-1506` / `005661506`), 7437 E 22ND St, Tucson AZ 85710, (520) 304-8300, mike@azcomputerguru.com. 25+ years operating history → Public Trust eligible (>3yr threshold).
**Provisioned in subscription `Basic` (`e507e953-2ce9-4887-ba96-9b654f7d3267`):**
- Resource group: `gururmm-signing-rg` (westus2)
- Trusted Signing Account: `gururmm-signing`
- Account URI: `https://wus2.codesigning.azure.net/`
- SKU: Basic (~$9.99/mo billing started 2026-04-16 00:16 UTC)
**RBAC granted:**
- `mike@azcomputerguru.com` → role `Artifact Signing Identity Verifier` at account scope
**Identity Validation submitted:**
- IV ID: `03028768-f611-4904-aa58-c755020f436a`
- Status: `In Progress` (Microsoft review, 1-5 business days typical)
- Submitted name: `Arizona Computer Guru LLC` (state filing); D&B record has older `COMPUTER GURU` Corporation — may need to update D&B profile for consistency
- Primary email: mike@; Secondary: admin@azcomputerguru.com
- Microsoft may call 520-304-8300 — voicemail should identify Computer Guru
**Pending (blocks on IV approval):**
- Certificate Profile creation: `az trustedsigning certificate-profile create --resource-group gururmm-signing-rg --account-name gururmm-signing --profile-name gururmm-public-trust --profile-type PublicTrust --identity-validation-id 03028768-f611-4904-aa58-c755020f436a`
- Signing role assignment: `Trusted Signing Certificate Profile Signer` to CI build principal
- Local tooling install: Windows SDK (for signtool.exe), Microsoft.Trusted.Signing.Client NuGet package
**All details persisted to vault:** `D:\vault\services\azure-trusted-signing.sops.yaml` (encrypted).
### Action items for next session
1. Check IV status — portal → Trusted Signing Accounts → gururmm-signing → Identity Validation
2. If approved → run the cert profile create command (already staged in vault)
3. If Microsoft flags legal name mismatch: reply with AZ Corp Commission LLC Articles; update D&B record
4. Start signtool.exe + dlib integration in a local scratch project
5. Meanwhile, fix the two backlog items (tunnel status 403 bug, agent logging) — they're both independent of the Azure work and small PRs