spec: add SPEC-016 zero-touch per-site agent enrollment
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 10m46s
Build and Test / Build Server (Linux) (push) Successful in 15m33s
Build and Test / Security Audit (push) Successful in 6m3s
Build and Test / Build Summary (push) Successful in 25s

ScreenConnect-class managed enrollment: one signed installer per site,
machines self-register on first run and the server mints a per-machine
cak_ key bound to a deterministic machine_uid (dedups re-installs).
Per-site rotatable enrollment key (long secret + vN (XXXX) fingerprint);
rotating blocks new enrollments from old installers, leaves enrolled
agents untouched. Auto-approve + new-enrollment/site-move alert.

Resolves SPEC-007's signature-vs-appended-config open question:
sign the base agent once in CI + per-site signed wrapper that writes
site config around the signed bytes (never appended into the PE).

Deferred (room reserved): enrollment policy + per-seat licensing,
--enroll-key/--site-code/--reassign flag overrides, technician-assisted
interactive install. Tracking todo dbfe6a56.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-02 09:13:59 -07:00
parent 3b9e4068c9
commit 18429f6fe3
2 changed files with 212 additions and 1 deletions

View File

@@ -87,10 +87,11 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [x] Sessions / machines / support-codes / events
- [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). **[→ v2 Phase 2]** ([SPEC-003](specs/SPEC-003-machine-inventory.md))
- [ ] **Stable machine identity + session lifecycle reaping + operator removal** — P1 — give the agent a deterministic machine-derived `machine_uid` (Windows `MachineGuid`-based) so the same box can't register duplicates (root cause: `agent_id` is a config-file random UUID that a portable/misconfigured run regenerates each launch); key registration on it; add TTL reaping + same-machine supersede as defense-in-depth; and admin-gated per-row + multi-select bulk removal of stale sessions/units. Identity must be bound to the per-machine agent key (spoof guard). Fixes ghost-session accumulation seen on the live console (15 sessions / 0 live, ~10 orphans for one machine). **[→ v2 Phase 1]** ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
- [ ] **Zero-touch per-site agent enrollment** — P1 — ScreenConnect-class managed enrollment: one signed installer per site, machines self-register on first run and the server mints a per-machine `cak_` bound to a deterministic `machine_uid` (dedups re-installs). Per-site **rotatable** enrollment key (long secret + `vN (XXXX)` fingerprint) — rotating blocks new enrollments from old installers, leaves enrolled agents untouched. Auto-approve + new-enrollment/site-move alert. **Sign base agent once (CI, shipped) + per-site signed wrapper that writes site config around the signed bytes — resolves SPEC-007's signature-vs-appended-config question.** Anticipated/deferred: enrollment policy + licensing, `--enroll-key`/`--reassign` flag overrides, technician-assisted interactive install. **[→ v2 Phase 1]** ([SPEC-016](specs/SPEC-016-zero-touch-enrollment.md))
- [ ] **Machines list view — dual connection indicators + rich rows** — P2 — ScreenConnect "Access"-list parity: per-row Host/Guest two-segment connection bar (Guest=agent online, Host=viewer connected, with names + durations) and rich inline metadata (company, site, device type, tags, logged-on user + idle, client version in red when outdated). Server-enriches `/api/machines` with live session state + SPEC-003 inventory. **[→ v2 Phase 2]** ([SPEC-005](specs/SPEC-005-machines-list-view-parity.md))
- [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005.
- [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). **[→ v2 Phase 2]** ([SPEC-006](specs/SPEC-006-universal-machine-search.md))
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. **[→ v2 Phase 2]** ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; the signature-vs-appended-config question is resolved by SPEC-016 (sign-once base + per-site signed wrapper, no PE append). **[→ v2 Phase 2]** ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
- [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. **[→ v2 Phase 0/1 conventions]** ([SPEC-008](specs/SPEC-008-valuable-error-messages.md))
- [ ] **Feature-rich, fully-documented management API** — P2 — everything the console can do, callable by API: OpenAPI 3.x generated from code (utoipa) + browsable docs at `/api/docs`, long-lived revocable scoped API tokens (PAT-style, distinct from the 24h JWT + agent keys), an API-completeness gap audit, and consistent pagination/error conventions. Distinct from the ADR-001 RMM integration contract. **[→ v2 Phase 3]** ([SPEC-009](specs/SPEC-009-feature-rich-documented-api.md))
- [ ] **Branding and white-label configuration** — P2 — Allow MSPs to customize logo, colors, and product name for white-labeled remote support. Dashboard admin settings page with logo upload (PNG/SVG, max 2MB), brand hue slider (OKLCH 0-360°, default 184=cyan), product name override, company name, and favicon. Agent tray tooltip uses custom product name from registry. Singleton database table with public GET endpoint for unauthenticated rendering. CSS variables (`--brand-hue`, `--accent`, `--panel`) for dynamic theming. **[→ v2 Phase 2]** ([SPEC-014](specs/SPEC-014-branding-whitelabel.md))

View File

@@ -0,0 +1,210 @@
# SPEC-016: Zero-Touch Per-Site Agent Enrollment
**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-06-02)
**Estimated Effort:** X-Large
## Overview
Give GuruConnect a ScreenConnect-class managed-agent enrollment flow: a technician runs
**one signed installer per site** on every machine at that site — no per-machine key
minting, no flags, no typing — and each machine **self-registers** on first run, the
server minting it a per-machine `cak_` key bound to a stable, machine-derived
`machine_uid`. Each site installer carries a **rotatable per-site enrollment key** (a long
server-generated secret) plus a short human-readable **fingerprint** (`vN (XXXX)`) so an
operator can tell at a glance whether an installer is current. Rotating a site's key blocks
*new* enrollments from old installers while leaving already-enrolled machines untouched
(they hold their own `cak_`).
This is the missing piece that turns the v2 secure-session-core (SPEC-004 per-agent keys +
`machine_uid`) into a real product workflow, and it **resolves SPEC-007's open
signature-vs-appended-config question**: the agent binary is signed **once** in CI
(already shipped via `release.yml`), and per-site customization rides in a thin **signed
wrapper** that writes site config to the endpoint at install time — never appended into the
signed PE.
**Success criteria:**
1. A tech installs one site installer on N machines; all N appear in the console under the
correct company/site, each as a distinct, deduplicated machine — zero per-machine setup.
2. Re-installing / re-imaging the same hardware **reuses** the existing machine row (no
ghost duplicates — the failure mode SPEC-004 documents).
3. Rotating a site's enrollment key makes old installers unable to enroll new machines,
while every already-enrolled agent keeps working.
4. Every distributed installer is **validly Authenticode-signed** (SmartScreen/WDAC clean).
## Background — what exists today (confirmed in code)
- **Embedded config is append-based and breaks signing.** `server/src/api/downloads.rs`
(`download_agent`, ~`:152`) reads `static/downloads/guruconnect.exe` and **appends**
`MAGIC_MARKER` + `len:u32` + JSON (`:196`) to the end of the PE. The agent reads it back
in `agent/src/config.rs` (`read_embedded_config`, `:223`). Appending bytes after a signed
PE invalidates the Authenticode signature — so the current customization path and the
newly-shipped CI signing are mutually exclusive.
- **No self-registration exists.** Per-agent `cak_` keys are minted **admin-only** in
`server/src/api/machine_keys.rs` (`create_key`, `:119`; "Admin issued a per-agent key",
`:146`). There is no endpoint where an agent first-run exchanges an enrollment credential
for its own key.
- **Relay already accepts per-agent keys.** `server/src/relay/mod.rs`
(`validate_agent_api_key`, `:417`) calls `crate::auth::agent_keys::verify_agent_key`
(`:422`) — the `cak_` path — then falls back to the **deprecated** shared `AGENT_API_KEY`
(`:444`, logs a "migrate to per-agent `cak_`" warning).
- **Key primitives exist.** `server/src/auth/agent_keys.rs`: `generate_agent_key` mints a
`cak_`-prefixed high-entropy key (`:36`/`:46`); `verify_agent_key` (`:71`).
`server/src/db/agent_keys.rs` already inserts into `connect_agent_keys (machine_id,
key_hash, tenant_id)` (`:47`) — the v2 tenancy column is present (migration
`004_v2_secure_session_core.sql`).
- **Identity is a random config UUID, not machine-derived** — the root cause of duplicates
per SPEC-004 (`agent/src/config.rs` `generate_agent_id`, `:90`).
- **Agent mode dispatch:** `agent/src/main.rs` `Commands::Install` (`:160`) → `run_install`;
`agent/src/config.rs` `detect_run_mode` (`:162`) returns `RunMode::PermanentAgent` when
embedded config is present.
## Scope
### Included in v1 (CORE)
1. **`machine_uid` — deterministic machine identity.** Derive a stable id from the Windows
`MachineGuid` (`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`), independent of the
config-file `agent_id`. (Shared root with SPEC-004; whichever lands first owns the impl,
the other consumes it.) Used as the dedup key for register/move.
2. **Per-site enrollment key + fingerprint.**
- Long (≥256-bit) server-generated secret per site, stored **hashed** (Argon2id, same
as `cak_`/passwords), never recoverable in plaintext after issue.
- A non-secret **fingerprint** = monotonic version + short derived code, rendered
`vN (XXXX)` (e.g. `v3 (7F2A)`), shown in the dashboard, baked into the installer
filename, and reported by the agent at enrollment.
- **Rotate** regenerates the secret and bumps the version; old installers are rejected
for *new* enrollments; existing agents (holding `cak_`) are unaffected.
3. **Self-registration endpoint.** New `POST /api/enroll` (public, unauthenticated by JWT —
gated by the enrollment key) accepting `{ site_code, enrollment_key, machine_uid,
hostname, labels{company,site,department,device_type,tags} }`:
- Verify `(site_code, enrollment_key)` against the current per-site key.
- **Dedup by `machine_uid`** within the site: if the machine exists, reuse the row and
rotate its `cak_`; else create the machine row.
- Mint a `cak_` (reuse `generate_agent_key`), store hashed via `db::agent_keys` bound to
`machine_id` (+ `tenant_id` from the site), return the plaintext `cak_` **once**.
- Emit an audit event + **new-enrollment alert** (and a **site-move** alert when an
existing `machine_uid` enrolls under a different site).
- **Rate-limit + lockout** per `(site_code, source-IP)` as defense-in-depth (the key is
long, so this is belt-and-suspenders, not load-bearing).
4. **Agent first-run enrollment.** On `RunMode::PermanentAgent` with no stored `cak_`:
read site config → call `/api/enroll` with `machine_uid` → persist the returned `cak_`
to a SYSTEM-only protected store (HKLM under a SYSTEM-only ACL, or DPAPI-machine) →
connect to `wss://connect.azcomputerguru.com/ws/agent` using the `cak_`. On subsequent
runs, use the stored `cak_` directly (no re-enroll).
5. **Sign-once base + per-site signed wrapper (resolves SPEC-007 open question).**
- The base agent is signed once in CI (`release.yml`, already shipped) and stays
byte-identical for everyone.
- Per-site customization (labels + enrollment key + fingerprint) is delivered to the
endpoint **at install time** via a signing-safe channel — NOT appended to the signed
PE. v1 mechanism: a small **signed wrapper/bootstrapper** (or signed MSI) that carries
the site config, lays down the signed agent, and writes the site config to the
protected config location. Decision to lock in planning: wrapper-exe vs MSI
(see Open Questions).
- **Deprecate the append path** in `downloads.rs` for managed installs (keep only for
attended/support-code if still needed), eliminating the signature-invalidation defect.
6. **Auto-approve posture.** A self-registered machine is live and controllable
immediately (ScreenConnect parity). The new-enrollment alert is the tripwire.
### Explicitly out of scope (ANTICIPATED — reserve room, do NOT build in v1)
The v1 data model and agent mode-dispatch must leave room for these without building them:
- **Per-site enrollment POLICY** — a `sites.enrollment_policy` field (default
`auto-approve`; future `pending-approval`) plus per-seat/per-endpoint licensing controls.
Commercial, multi-tenant (the `tenant_id` column already exists). Its own future SPEC.
- **Flag overrides** — `--enroll-key` / `--site-code` (generic installer, key supplied on
the command line) and `--reassign` (move an existing machine to a new site, gated by
possession of the destination site's key, with an **explicit accidental-move guard**:
a different-site re-run refuses unless `--reassign` is passed) + cross-client move policy.
Backend (`machine_uid` + authorized site + `cak_`) is designed to support it; CLI surface
is deferred.
- **Technician-assisted interactive install** — `--technician` on a generic installer:
prompts for the tech's own server credentials, and on auth presents a **validated**
Company/Site/tags picker from the live authorized list (authz-by-identity, full audit
trail). Heaviest path (interactive UI + auth/list callback); deferred.
All three converge on the **same backend operation** delivered in v1: `machine_uid` +
authorized site + issued `cak_`. v1 only ships the per-site-embedded-key door.
## Architecture
- **Agent** (`agent/`): compute `machine_uid`; first-run enroll → store `cak_`; use stored
`cak_` thereafter; read site config from the wrapper-written location instead of an
appended PE blob. Touches `config.rs` (`EmbeddedConfig`/`detect_run_mode`/storage),
`main.rs` (`Install`/run-mode), a new `enroll` client module, transport auth.
- **Relay-server** (`server/`): new `POST /api/enroll`; per-site key issue/rotate/verify;
`machine_uid` dedup + site-move on register; audit + alert emission; rate-limit/lockout.
Touches `api/` (new `enroll.rs`, `sites` key endpoints), `auth/agent_keys.rs`,
`db/agent_keys.rs`, `relay/mod.rs` (enrollment vs. connect), `main.rs` routes.
- **Dashboard**: per-site enrollment-key display (fingerprint `vN (XXXX)`), **Rotate**
action, "current installer" download wired to the signed wrapper build. (Builder UI is
SPEC-007; this spec supplies the key/fingerprint/rotation it consumes.)
- **DB migration:** `site_enrollment_keys` (or columns on the site): `site_id`,
`key_hash`, `version`, `fingerprint`, `created_at`, `rotated_at`, `active`. Reserve
`sites.enrollment_policy` (nullable, default `auto-approve`) for the anticipated policy
work. `connect_machines` gains `machine_uid` (unique per tenant/site).
- **Protobuf** (`proto/guruconnect.proto`): no wire change required for enrollment if
`/api/enroll` is REST; `AgentStatus` label fields per SPEC-007 (`department`,
`device_type`) ride along if landed together.
## Security considerations
- **Two-tier credential model:** low-sensitivity **enrollment key** (gates "may register",
shared per site, rotatable) vs. high-sensitivity **per-machine `cak_`** (operating
credential, per-machine revocation). Compromise of an enrollment key is recovered by
rotating one site — no fleet-wide re-key.
- **Enrollment keys stored hashed** (Argon2id); plaintext shown once at issue/rotate.
- **`cak_` at rest on the endpoint** must be SYSTEM-only (HKLM SYSTEM ACL or DPAPI-machine)
so a non-admin user can't read it.
- **`machine_uid` binding** is the spoof-guard SPEC-004 wants: a `cak_` is bound to a
`machine_uid`; a different box presenting another box's `cak_` is detectable.
- **Authorization model** for moves/enrolls is possession-of-destination-key in v1
(identity-based authz deferred to the technician-assisted path).
- **Open registration risk** is mitigated by requiring `(site_code + long key)` and
rate-limit/lockout; auto-approve is acceptable because the enrollment key is the gate and
every enrollment/site-move fires an alert.
- **Audit events:** enroll, re-enroll/reuse, site-move, key-rotate — all logged with
`machine_uid`, site, and source IP.
## Testing strategy
- **Unit:** `machine_uid` derivation stability; enrollment-key verify/rotate; fingerprint
derivation; `cak_` mint/hash/verify; dedup decision (new vs. reuse vs. move).
- **Integration:** enroll new → row + `cak_` issued; re-enroll same `machine_uid` → reuse,
no duplicate; enroll with rotated (old) key → rejected; old `cak_` still connects after
rotation; rate-limit/lockout trips; site-move emits alert.
- **Manual:** build a site wrapper installer → run on a clean VM → appears in console under
correct site, immediately controllable; re-image VM → same row reused; `signtool verify
/pa` passes on the distributed wrapper and the laid-down agent.
## Effort estimate & dependencies
- **Size:** X-Large (agent + relay + DB migration + CI build/sign wrapper + dashboard
key/rotation surface).
- **Depends on:** SPEC-004 `machine_uid` (shared root); the CI signing already shipped
(SPEC-001 §2 / `release.yml`).
- **Unblocks:** SPEC-007 (installer builder gets a real per-site key + the signing
resolution), and the parked managed-agent test deployment on the internal beta machines.
- **Relationship to v2 phases:** sits with the Phase-1 secure-session-core (per-agent keys
+ identity) and feeds Phase-2 dashboard work.
## Open questions
1. **Wrapper shape:** signed standalone bootstrapper `.exe` vs. signed **MSI** for the
per-site installer. MSI gives clean install/uninstall + GPO/Intune deploy; bootstrapper
is lighter. Lock in planning.
2. **`cak_` storage:** HKLM SYSTEM-ACL registry value vs. DPAPI-machine-protected file —
pick one for the protected store.
3. **Fingerprint code style:** raw hex (`7F2A`) vs. the RMM-house word style
(`GREEN-FALCON`). Cosmetic; pick for operator readability.
4. **Cross-tenant `machine_uid` collisions** (same hardware imaged across tenants) — scope
`machine_uid` uniqueness per tenant, not globally.
5. **Attended (support-code) path:** confirm whether the append-based `download_support`
path is retained as-is or also migrated off appending.