Files

Mike Swanson 733d87f20e Dataforth UI push + dedup + refactor, GuruRMM roadmap evolution, Azure signing setup

Dataforth (projects/dataforth-dos/):
- UI feature: row coloring + PUSH/RE-PUSH buttons + Website Status filter
- Database dedup to one row per SN (2.89M -> 469K rows, UNIQUE constraint added)
- Import logic handles FAIL -> PASS retest transition
- Refactored upload-to-api.js to render datasheets in-memory (dropped For_Web filesystem dep)
- Bulk pushed 170,984 records to Hoffman API
- Statistical sanity check: 100/100 stamped SNs verified on Hoffman

GuruRMM (projects/msp-tools/guru-rmm/):
- ROADMAP.md: added Terminology (5-tier hierarchy), Tunnel Channels Phase 2,
  Logging/Audit/Observability, Multi-tenancy, Modular Architecture,
  Protocol Versioning, Certificates sections + Decisions Log
- CONTEXT.md: hierarchy table, new anti-patterns (bootstrap sacred,
  no cross-module imports), revised next-steps priorities

Session logs for both projects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-15 17:39:32 -07:00

31 KiB

Raw Blame History

GuruRMM - Feature Roadmap & Change Requests

Tracked list of desired features, improvements, and changes. Used to evaluate whether the current codebase supports these goals or if a rewrite is needed.

Last Updated: 2026-04-15

Terminology (canonical)

Decided 2026-04-15. Use these exact terms in code, UI, API, docs, and conversation. Don't invent synonyms.

Tier	Term	DB column	Meaning	Example
1	Platform	—	The software author (us)	GuruRMM
2	Partner	`tenant_id`	An MSP — a paying customer of the Platform	"Acme IT Services"
3	Client	`client_id`	A Partner's customer	"Dataforth Corp"
4	Site	`site_id`	A location or logical grouping within a Client	"Dataforth Tucson HQ"
5	Agent	`agent_id`	An endpoint at a Site	AD2, SL-SERVER

Notes:

UI/API use "Partner"; DB uses tenant_id (industry-standard term for isolation). Do not rename tenant_id in code.
"Client" may collide with HTTP-client terminology in context; when ambiguous, use "client org" or "client account".
Site is not always a physical location — can be a DMZ, VLAN, cloud region, whatever grouping makes sense for that Client.
Do not use "sub-tenant" or "customer" (ambiguous across tiers).
User roles: Platform admin (us), Partner admin, Partner tech, Client contact (limited read access to their own data).
Optional Department/OU tier inside a Site is deferred until a real customer asks for it.
MSPs can label-override their UI via partner_settings.label_overrides JSONB (e.g., rename "Client"→"Customer" for their branded view) — supported without schema changes.

API path convention: /api/public/v1/partners/{partner_id}/clients/{client_id}/sites/{site_id}/agents/{agent_id}

Event bus topic convention: agent.online, site.created, client.deleted, partner.upgraded, etc.

Dashboard / UI

#	Feature	Priority	Status	Notes
D1	All metrics clickable to relevant content	High	Done	Stat cards link to filtered agent views
D2	Dark theme with branded sidebar	High	Done	JetBrains Mono + Plus Jakarta Sans, GURURMM MISSION CONTROL branding
D3	Command cancel/delete/clear history	Medium	Done	Cancel pending/running, delete any, bulk clear finished
D4	Global search across all agent details	High	Open	Search by hostname, MAC, IP, OS, version -- any agent field. Dashboard main page.
D5	Clickable metric cards on agent detail -> drill-down views	High	Open	CPU card -> process list sorted by CPU%. Memory card -> process list sorted by RAM. Disk card -> drive/folder usage breakdown. Sortable tables.
D6	Real-time terminal (PS/cmd) via WebSocket tunnel	High	Open	Interactive shell session relayed through server. Separate from check-in process. Spawns on demand, full bidirectional I/O.
D7	Remote file system browser	High	Open	Browse, upload, download, rename, delete files on agent. Tree view + detail pane. Via real-time tunnel.
D8	Remote registry editor (Windows)	Medium	Open	Browse/edit/create/delete registry keys and values. Tree view like regedit. Via real-time tunnel.
D9	Remote services manager	High	Open	List all services with status. Start/stop/restart/disable/enable/edit startup type. Sortable, searchable. Via real-time tunnel.
D10

Agent / Installer

#	Feature	Priority	Status	Notes
A1	Site-code-based installers (no API keys)	High	Done	/install/:site_code/* endpoints, binary with embedded config
A2	Public shareable install links per client	High	Done	Landing page at /install/:site_code with OS detection
A3	Capture full OS detail (distro/version)	High	Open	Linux agents just report "linux" -- should capture distro name and version (e.g., Ubuntu 22.04, Debian 12). Agent-side change to collect, server-side to store/display.
A4	Reliable CPU/GPU temperature collection	High	Open	Not working on any machine currently. Windows: WMI/OpenHardwareMonitor/LibreHardwareMonitor. Linux: lm-sensors/sysfs thermal zones. Need fallback chain.
A5	Process list collection (CPU%, RAM, disk I/O)	High	Open	Needed for D5 drill-downs. Agent collects top processes, sends on demand or as part of extended state.
A6	Disk usage detail (per-drive, large folders)	Medium	Open	Needed for D5 disk drill-down. Per-partition usage + optional large folder scan.
A7

Server / API

#	Feature	Priority	Status	Notes
S1	Claude Code integration (claude_task command type)	Medium	Planned	gururmm-agent project has the Rust module, not yet integrated
S2	Stackable/inheritable policy system	High	Open	Policies at Company > Site > Machine levels. Lower level overrides higher. Merge behavior for non-conflicting settings.
S3	Dynamic groups based on agent attributes	High	Open	Rule-based groups (e.g., RAM <= 8GB, OS = Windows 10, disk > 90%). Policies can target dynamic groups.
S4	Policy actions: custom script execution	High	Open	Policies can trigger scripts (PowerShell/bash) on matching agents. Scheduled or on-demand.
S5	Customizable alerting system	High	Open	User-defined alert rules: offline detection, disk space thresholds, SMART errors, RAID degradation, bad sectors, CPU/RAM sustained high, temp thresholds. Configurable severity, notification channels, escalation.
S6	Alert notification channels	Medium	Open	Email, webhook, Slack/Teams integration, push notifications. Per-alert-rule routing.
S7	Real-time tunnel mechanism (separate from check-in)	High	Phase 1 Done	Session lifecycle REST+WS+DB+agent state machine complete (2026-04-14 / verified 2026-04-15). Phase 2 (channels) tracked under Tunnel Channels section below.
S8	Closed-session status endpoint returns 403	Medium	Open	`GET /api/v1/tunnel/status/{id}` returns 403 for closed sessions (should return `{status: closed}`). Root cause: `verify_session_ownership()` applies `WHERE status='active'` before ownership check. Fix in `server/src/db/tunnel.rs:94-103`.
S9

Tunnel Channels (Phase 2)

On-demand capabilities layered on top of the tunnel session framework. Each channel is a typed WebSocket payload pair (request/response) routed by channel_id under an open tech_session. All channel operations are audited per Logging & Audit section.

#	Feature	Priority	Status	Notes
T1	Terminal channel (interactive shell)	High	Open	`TunnelDataPayload::Terminal { command }` → `TerminalOutput { stdout, stderr, exit_code }` (types exist in `server/src/ws/mod.rs:310-319`, agent stub at `agent/src/transport/websocket.rs:408-434`). Implement via `tokio::process::Command` with configurable timeout (default 30s). 80% of field use cases. Ship before other channels.
T2	File channel (upload/download/rename/delete + tree browse)	High	Open	Covers D7. Stream file bytes in chunks over WS with progress. Path safety (no `..` traversal). Needs allowlist vs freeform decision.
T3	Registry channel (Windows)	Medium	Open	Covers D8. Read/write/create/delete keys + values. Use `winreg` crate. Gate to tenant admins only.
T4	Service channel (Windows services)	High	Open	Covers D9. List/start/stop/restart/change-startup-type. `windows-service` crate.
T5	Tech-side tunnel subscriber	High	Open	Blocks all channels. Browser currently has no mechanism to receive tunnel data from server. Design: `GET /api/v1/tunnel/stream/{session_id}` WebSocket + in-memory `HashMap<session_id, mpsc::Sender<TunnelData>>` pub-sub.
T6	Server-side forward path	High	Open	`server/src/ws/mod.rs:808-825` currently logs+drops incoming `AgentMessage::TunnelData`. Wire to T5 pub-sub + tunnel_audit INSERT.
T7	Working directory / shell choice / elevation decisions	High	Open	Terminal channel design decisions: cwd allowlist vs free-form; PowerShell vs cmd on Windows; admin elevation gating by role.
T8	Channel concurrency + rate limits	Medium	Open	Multiple channels in one session. Per-channel rate/quota. Output size cap (default 1 MB/command).
T9

Logging, Audit & Observability

Three-tier design decided 2026-04-15. Each tier has distinct purpose, storage, retention, and consumer.

Design principles:

Agent self-logging uses OS-native mechanisms (no custom transport). Troubleshoot with familiar tools.
Client machine health via OS event log pulls. Feeds dashboard and alerting.
Tunnel audit captured directly to RMM DB. Non-negotiable, never scrubbed, designed for legal/compliance retention.

#	Feature	Priority	Status	Notes
L1	Agent self-logging via OS-native sinks	High	Open	Windows Event Log (custom `GuruRMM-Agent` provider registered at install), Linux systemd/journald (`tracing` → stdout when run as unit), macOS unified log (`os_log` crate). Verbosity per-tenant configurable. Default INFO.
L2	Client event log pull + summarize	High	Open	Agent polls OS event log on schedule; ships filtered events to server `client_events` table. Windows: `Get-WinEvent -Level 1,2 -MaxEvents N`. Linux: `journalctl -p err --output json`. macOS: `log show --predicate 'messageType == error' --style json`.
L3	L2 cadence — default 15-min delta poll + on tunnel open/close	High	Open	Default 900s. On tunnel open: force delta pull so tech has fresh context. On tunnel close: force delta pull to capture anything tech's actions triggered. Configurable per-tenant in dashboard.
L4	L2 levels — default Critical + Error + Warning	High	Open	Configurable per-tenant. Default: Critical(1), Error(2), Warning(3). Separate "noisy" bucket (Info/Debug/Audit/Notification) pulled every 4h default.
L5	Tunnel audit — every tech action persisted	High	Open	Reuse existing `tunnel_audit` table (migration 010, unused today). Every command, file op, registry op, service op gets INSERT with session_id, channel_id, operation, details JSONB. No scrubbing — must retain sensitive input if a tech types it.
L6	Retention config	High	Open	`client_events`: 90 days default, tenant configurable. `tunnel_audit` (live): 90 days default, tenant configurable. `tunnel_audit` (archive): indefinite, system-level rotation to object storage. Agent self-logs follow OS-native retention policy.
L7	Tunnel audit archive rotation	High	Open	Monthly job: aged partitions of `tunnel_audit` → compressed JSONL or Parquet in S3/R2/MinIO. Naming: `tunnel_audit/tenant_id={uuid}/year={YYYY}/month={MM}.jsonl.gz`. Dashboard "deep search" endpoint queries archive on demand (Athena/DuckDB).
L8	Agent config push	High	Open	On agent WS connect, server sends `ServerMessage::Config { tenant_settings }`. Real-time updates when tenant admin changes settings in dashboard. Agent adjusts poll cadence + event level filters live without restart.
L9	Dashboard surfaces for L2 (client_events)	Medium	Open	Red-number badge on agent tile (count of unresolved errors last 24h). Time-sorted feed on agent detail page with filter/search. Acknowledge/dismiss individual events.
L10	Sensitive-data-at-rest protection	High	Open	`tunnel_audit` may contain unscrubbed credentials. Postgres TDE or full-disk encryption on server. Access to audit tables strictly admin-role-gated. Meta-audit: log every `SELECT` on `tunnel_audit` to separate table. Document in tech SOP: "every tunnel keystroke is logged."
L11

Multi-tenancy / MSP SaaS

Goal stated 2026-04-15: make this a marketable product for other MSPs. Multi-tenancy must be baked in from here on — adding tenant_id later would be a brutal migration.

#	Feature	Priority	Status	Notes
M1	Core tenancy schema	High	Open	New tables: `tenants` (id, name, plan, status, created_at), `tenant_settings` (tenant_id, key, value JSONB), `msp_users` (superadmins across tenants), `tenant_users` (tech ↔ tenant join with role). Add `tenant_id UUID` FK to: `agents`, `tech_sessions`, `tunnel_audit`, `client_events`, `commands`, any other per-customer table.
M2	Tenant-scoped authorization	High	Open	JWT carries `tenant_id` + `role`. Every query must filter by tenant_id (middleware). Super-admin role bypasses for GuruRMM staff. Penalty for bugs here: data leakage across tenants.
M3	Tenant admin dashboard	High	Open	UI for MSP admins to configure their tenant settings (L3/L4/L6 cadences, levels, retention). Super-admin can override across tenants.
M4	Billing / licensing meter	Medium	Open	Per-agent-per-month is standard for RMM. Needs usage counter from day one. Consider Stripe Billing or manual invoicing to start.
M5	Data residency options	Low	Open	Some MSPs require on-prem or regional hosting. Architectural impact: deployment model (single-tenant vs multi-tenant DB), encryption key management. Not required for MVP.
M6	Tenant export API	Medium	Open	MSPs with SOC2/PCI customers will need to export their tenant's audit trail. `GET /api/v1/tenants/{id}/export` producing JSONL or Parquet. Self-service for portability.
M7	Onboarding flow	High	Open	MSP signs up → tenant provisioned → first site created → install link generated → agent installs → first heartbeat → onboarding complete. End-to-end wizard.
M8

Infrastructure / Operations

#	Feature	Priority	Status	Notes
I1	Automate dark class injection in deploy	Low	Open	Vite strips class="dark" -- need Vite plugin or build script
I2	Resolve stashed local changes on server	Medium	Open	git stash on 172.16.3.30 has divergent dev work
I3	CI/CD webhook auto-builds on push	Low	Exists	webhook at /webhook/build, build-agents.sh -- needs dashboard build added
I4

Modular Architecture & Public APIs

Goal stated 2026-04-15: the product should be modular from inception. Future modules under consideration: PSA/CRM, remote syslog aggregation, backups, likely more. Both first-party (us) and eventually third-party (other developers, customers) should be able to build modules against stable, versioned interfaces. End users should also have API access to automate against their own data.

Architectural principles:

Core is thin + opinionated. Tenants, agents, auth, audit, command dispatch, tunnel framework — that's the "kernel." Everything else is a module.
Modules own their data. Each module owns a schema namespace (psa_*, backup_*, syslog_*) and never writes directly to another module's tables. Cross-module data access goes through module-exposed APIs.
Event bus for cross-cutting communication. Agent.online, tunnel.opened, command.completed, client_event.received — core publishes, any module subscribes.
Public API is a first-class product surface, not an afterthought. OpenAPI spec, semver-versioned, rate-limited, key-authenticated, documented.
Boundary discipline: if it's tempting to reach across a module boundary, that's a signal to add an API there instead. Breaking this discipline once kills the modularity.

#	Feature	Priority	Status	Notes
X1	Core vs. module boundary definition	High	Open	Document what's "core" (tenants, agents, auth, audit, command dispatch, tunnel framework, bootstrap) vs. what's a module (everything else). Codify via separate crates / modules in the Rust workspace (`core/`, `modules/psa/`, `modules/backups/`, etc.). Enforce via build system — module code cannot `use` private core internals, only the exposed `core::api::*` surface.
X2	Module manifest / registration	High	Open	Each module ships a `module.toml` declaring: name, version, provides (APIs exposed), consumes (events/APIs used), permissions required (read_agents, write_commands, read_audit, etc.). Loaded at server startup; dashboard reflects installed modules.
X3	Event bus	High	Open	NATS JetStream or Redis Streams. Every significant core action emits a typed event (`agent.online`, `agent.offline`, `tunnel.opened`, `tunnel.closed`, `command.completed`, `client_event.received`, `tenant.created`). Modules subscribe via the bus, not via direct core calls. Decouples timing + enables async modules.
X4	Module-to-core APIs	High	Open	Core exposes a stable in-process API for modules: `core::agents::list(tenant_id)`, `core::commands::enqueue(...)`, `core::audit::record(...)`. Versioned like `core_api_v1`, `core_api_v2`. Modules declare which version they require.
X5	Module-to-module APIs	Medium	Open	Modules can expose their own APIs for other modules to consume. Example: PSA module exposes `psa::tickets::create()` which a Backups module could call when a backup fails. All via the module registry — no direct imports.
X6	Public REST API (for end users + integrations)	High	Open	Versioned under `/api/public/v1/`. OpenAPI 3.1 spec auto-generated. Rate-limited per API key. Scoped API keys (read-only / write / admin). Separate from internal `/api/v1/` used by dashboard. Publish spec at `/api/public/v1/openapi.json`.
X7	API key management	High	Open	Dashboard UI: tenants create/revoke/rotate API keys, scope per key, view last-used and usage stats. Keys carry tenant_id. JWT session tokens (for dashboard) are separate from API keys (for machines).
X8	Public webhook subscriptions	High	Open	Tenants subscribe to events via webhook URL. Event bus (X3) feeds a delivery worker that signs payloads (HMAC), retries with backoff, tracks delivery status in DB. Lets customers integrate without polling.
X9	Third-party module sandbox	Medium	Open	Future work. Options: (a) WebAssembly modules loaded at runtime with capability-based access to core APIs; (b) signed OCI container images run as sidecars with mTLS. (a) is better UX but maturity risk. (b) is ops-heavy but proven. Decide when third-party demand is real.
X10	Module billing isolation	Medium	Open	Each module can have independent pricing (PSA seat-based, Backups GB-based, RMM per-agent). Core billing meter (M4) becomes per-module, aggregates to tenant invoice. Enable tenants to subscribe to some modules but not others.
X11	Module upgrade independence	Medium	Open	Modules version independently of core. Core API versioning (X4) lets modules pin `core_api_v2` and survive core updates. Dashboard shows which modules need upgrades for a new core release.
X12	Module discoverability / marketplace	Low	Open	Eventually: marketplace UI for MSPs to browse/install first- and third-party modules. Signed+reviewed entries only. Revenue share for third-party developers. Many moons away, design constraint for now: don't paint ourselves into a corner.
X13

Module candidates currently in mind

Capture these now so the core API design has concrete use cases to validate against:

PSA/CRM module — tickets, time tracking, contracts, invoicing. Likely largest module, heaviest DB load. Consumes: agent.online, client_event.received, command.completed. Exposes: psa::tickets::create|assign|close, psa::time::log.
Remote Syslog module — aggregates syslog/Windows Event Log from customer devices to a central searchable store. Consumes: client_event.received. Exposes: syslog::query|subscribe. Heavy ingest.
Backups module — schedules, monitors, reports on backup jobs (Veeam, Datto, Acronis, Synology, etc.). Consumes: integrations with third-party backup products (pull). Exposes: backups::status|history|alert. Compliance-sensitive.
Patch management — track OS + app patch levels, schedule installs, report compliance.
Documentation (IT Glue-style) — customer environment docs, credential vault, runbooks. Deep integration with PSA (customer entity shared).
Remote access — already covered by core tunnel framework; could grow into its own "pro" module with session recording, MFA-gated elevation, etc.
Network monitoring — SNMP/ping monitoring of non-agent devices (switches, printers, UPSs).

Protocol Versioning & Stale-Agent Recovery

Problem surfaced 2026-04-15: as the codebase evolves (multi-tenancy pivot, tunnel channels, new message types), long-offline agents will return to find the wire format they knew is gone. Without an upgrade lane, those agents become zombies — visible in the dashboard as "offline for 47 days," never self-heal, require manual intervention (RDP in, uninstall, reinstall).

Concrete example: Scileppi VP laptop offline for days. When it wakes up and tries to check in with v0.6.0 against a server that by then expects v0.9.x protocol, we need the server to say "I see you, you're old, here's how to update yourself" — and have the agent auto-comply.

Design principle: the bootstrap/hello path is sacred. It must never break, even across major protocol revisions. All other endpoints and message shapes are allowed to change. An agent that can still reach /hello can always recover.

#	Feature	Priority	Status	Notes
V1	Protocol version negotiation on connect	High	Open	Agent sends `{agent_version, protocol_version, os, arch}` as first message. Server responds with `{server_version, min_supported_protocol, latest_protocol, action}` where action ∈ {`proceed`, `upgrade_required`, `rejected`}. WebSocket subprotocol header is one delivery option; a dedicated HTTP hello endpoint is another. Pick one, then never change its shape.
V2	Stable bootstrap endpoint	High	Open	`POST /api/v1/bootstrap/hello` that accepts the agent handshake forever. Contract: input schema is additive-only (new optional fields OK, never rename/remove), output shape is additive-only. Agents as old as v0.1 must be able to hit this and get meaningful response.
V3	Compat shim layer per old protocol version	High	Open	When an old agent checks in, server translates between the old wire format and current internal types. Shim lives in `server/src/compat/v{N}.rs`. Each shim documents: which protocol versions it supports, what adapters it provides, planned removal date.
V4	Server-initiated forced upgrade instruction	High	Open	When handshake returns `action: upgrade_required`, response also includes `update_url`, `update_checksum`, `update_args`, and optional `restart_policy`. Agent treats this as highest-priority command, bypasses normal command queue, upgrades + relaunches itself.
V5	Agent self-update atomic rename (verify)	High	Exists (hardening needed)	Already done per 2026-04-01 ADR. Audit against V4 flow: does current updater handle "tell me exactly which version to install" vs. "upgrade to latest"? May need parameterization.
V6	Per-version support matrix + sunset policy	High	Open	Dashboard surface: table showing N agents per protocol version per tenant. Automated sunset: when a protocol version has 0 live agents for 60 days across all tenants, flag compat shim for removal in next release. Manual override to force-remove earlier.
V7	Agent version pinning per tenant	Medium	Open	MSP can opt tenants into "stable" (N-1), "current" (latest), or "beta" (preview) update channels. Controls auto-update rollout pace across their fleet.
V8	Late check-in handling: accept then command	High	Open	On stale-agent connect: (a) accept the handshake via compat shim, (b) record the connect event in audit, (c) immediately enqueue the upgrade command, (d) agent executes before any other work. Dashboard shows agent as "upgrading" briefly before "online".
V9	Graceful protocol deprecation warnings	Medium	Open	When an agent connects on a deprecated (but still supported) protocol, server sends a warning field in every response. Agent logs it. Gives MSPs lead time to upgrade their fleet before hard-removal.
V10	Rollback path for bad upgrades	High	Open	If v0.N upgrade bricks agents, bootstrap endpoint must let an operator mark v0.N `action: downgrade_required` and ship an older binary. Requires keeping old binaries in `/var/www/gururmm/downloads/` with pinned checksums.
V11

Certificates & Trust

Code signing and TLS/trust certificates required to ship + operate the product without install-time friction. Decisions 2026-04-15.

#	Item	Priority	Status	Cost	Notes
C1	Azure Trusted Signing — Windows agent + installer	High	In progress (2026-04-15)	~$9.99/mo + per-sig fee	Hosted signing service. Bypasses hardware-token requirement that took effect June 2023. Public Trust level requires 3+ yrs business existence; Private Trust available immediately but limited usefulness. Identity verification via Microsoft takes days. See setup steps in session-logs/2026-04-15.
C2	Apple Developer Program — macOS agent notarization	High	Open	$99/yr	Developer ID Application + Installer certs; notarization via `xcrun notarytool`; Hardened Runtime entitlements; ticket stapling for offline installs. Enrollment can take days — start early.
C3	GPG signing — Linux .deb / .rpm packages	High	Open	Free	Generate key pair, publish pubkey at a stable URL, sign packages with `debsign`/`rpmsign`, host signed apt/yum repo with proper `Release`/`repomd.xml`.
C4	Timestamping — all signed artifacts	High	Open	Free	Use DigiCert or Sectigo public timestamp servers so signatures remain valid after cert rotation. Verify in CI that every signed binary has a valid timestamp.
C5	TLS automation for own domains	High	Done	Free	Cloudflare + Let's Encrypt already in place for `rmm-api.azcomputerguru.com`. Wildcard for `*.gururmm.com` when that domain lights up.
C6	Per-Partner white-label custom domains	Medium	Open	~$7/mo/domain via CF-for-SaaS, or DIY with ACME DNS-01	Partners want `rmm.theirbrand.com`. Decide: host certs ourselves via ACME DNS-01 + Cloudflare API, or use Cloudflare for SaaS. Defer until first Partner asks.
C7	Agent-to-server mTLS (enterprise option)	Low	Open	Internal CA + time	Self-signed CA + per-agent client certs. Bootstrap enrolls agent and issues cert scoped to `agent_id`. Adds install complexity. Defer until an enterprise customer demands it.
C8	SBOM + Sigstore/cosign provenance	Medium	Open	Free	Auto-generate CycloneDX or SPDX SBOM per release. `cosign` sign artifacts + container images. Important for SOC2-conscious MSPs evaluating supply chain.
C9	Windows Defender / vendor FP submission runbook	Medium	Open	—	Despite valid signing, heuristic engines flag new binaries. Keep a runbook with submission portal links (Microsoft Security Intelligence, Malwarebytes, etc.).
C10	Email sending trust: DKIM / SPF / DMARC	Medium	Open	Free	Required when PSA module sends ticket notifications. Set up on sending domain; per-Partner if white-labeled email is a feature.
C11	WHQL driver signing	Deferred	Open	$$$ + weeks turnaround	Only if we ship a kernel driver. Avoid this path — use user-mode alternatives first.
C12

Decisions Log

Short record of why things are the shape they are. Append, don't edit.

2026-04-15 — Tunnel Phase 1 verified live. End-to-end test from off-LAN workstation via rmm-api.azcomputerguru.com. Open/status/close lifecycle works. Confirmed nginx proxies /api/* (not just /downloads/). See session-logs/2026-04-15-session.md.

2026-04-15 — Logging split into three tiers. Decided against a single custom log transport. Agent self-logging to OS-native sinks (Event Viewer / journald / os_log). Client machine health via OS event log pulls. Tunnel audit direct to RMM DB. Rationale: sysadmins can troubleshoot with familiar tools; only high-value audit data hits our DB.

2026-04-15 — Tunnel audit is never scrubbed. If a tech types a password during a session, it gets stored. Purpose is to audit tech behavior, and scrubbing would undermine that. Offsetting controls: encryption at rest, admin-role-gated access, meta-audit of log views, tech SOP documentation. See L10.

2026-04-15 — Multi-tenancy from day one. Target market is MSPs reselling this product. Adding tenant_id retroactively after feature growth is a brutal migration; baking it in now is cheap. Every new table gets tenant_id FK from here forward.

2026-04-15 — Poll cadences. 15-min delta + on-tunnel-open/close for critical+error+warning. 4h bulk for info/debug/audit/notification. All tenant-configurable.

2026-04-15 — Retention. 90 days default for tenant-visible tables. Indefinite system-level for tunnel_audit with object-storage archive after the tenant-visible window. Legal/compliance contexts (HIPAA 6yr, PCI 1yr) handled by per-tenant extended retention configs.

2026-04-15 — Hierarchy terminology locked. Platform > Partner (MSP, DB: tenant_id) > Client > Site > Agent. API and UI say "Partner"; DB says tenant_id. No "sub-tenant", no ambiguous "customer". Department/OU tier deferred. MSPs can white-label labels via JSONB overrides. See Terminology section at top of this file.

2026-04-15 — Modular architecture from day one. Core = tenants + agents + auth + audit + commands + tunnel framework + bootstrap. Everything else = module. Modules own their schema namespace, never touch each other's tables, communicate via event bus (X3) and versioned module APIs (X4/X5). Public REST API (X6) separate from internal dashboard API. Webhook subscriptions (X8) for customer integrations. Third-party modules via WASM or signed containers — deferred but design-constrained now. Concrete module candidates: PSA/CRM, remote syslog, backups, patch management, IT-Glue-style docs, network monitoring. See X1-X12.

2026-04-15 — Bootstrap endpoint is sacred. Protocol version negotiation via a single /api/v1/bootstrap/hello endpoint whose input/output are additive-only forever. Every other endpoint/message is free to evolve. Enables late-arriving agents (Scileppi VP example: offline for days, wakes up to find a newer server protocol) to reconnect, get accepted, and receive an automatic upgrade instruction. Compat shim layer per old protocol version with automated sunset policy when fleet-wide usage hits zero. See V1-V10.

Rewrite Assessment

Criteria for rewrite:

If >50% of planned features require fighting the current architecture
If the tech stack is fundamentally wrong for the goals
If accumulated tech debt makes changes unreasonably slow

Current assessment (2026-04-15): The multi-tenancy pivot means a schema refactor is unavoidable (add tenant_id everywhere, tenancy-aware auth middleware). This is additive, not a rewrite. Rust + Axum + Postgres + WebSocket stack remains fit for purpose. Current code is a solid foundation. No rewrite planned; structural additions tracked above.

31 KiB Raw Blame History