Files
guru-connect/docs/specs/SPEC-016-zero-touch-enrollment.md
Mike Swanson c286a29b9d
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 14m25s
Build and Test / Build Server (Linux) (push) Successful in 20m31s
Build and Test / Security Audit (push) Successful in 8m28s
Build and Test / Build Summary (push) Successful in 30s
spec: SPEC-016 resolve all 5 open questions (enrollment design decisions)
Fold the 2026-06-02 interview decisions into SPEC-016:
- Installer wrapper: ship BOTH signed .exe and signed MSI per site
- cak_ at-rest storage: DPAPI-machine-encrypted blob in a SYSTEM-ACL'd location
- Fingerprint: hex (7F2A), deliberately unlike RMM word-codes
- machine_uid: per-tenant scope + hardware-derived salt (survives re-image,
  separates distinct boxes) + collision-gated activation (template-cloned VMs
  sharing a hardware UUID drop to pending + alert, need dashboard confirm)
- Attended support-code path: unchanged (filename-based, already signing-safe)

Open Questions section -> Resolved decisions + a short Remaining-for-planning
list (exact hardware salt signal set, WiX/MSI authoring approach).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:54:19 -07:00

245 lines
16 KiB
Markdown

# SPEC-016: Zero-Touch Per-Site Agent Enrollment
**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-06-02)
**Estimated Effort:** X-Large
## Overview
Give GuruConnect a ScreenConnect-class managed-agent enrollment flow: a technician runs
**one signed installer per site** on every machine at that site — no per-machine key
minting, no flags, no typing — and each machine **self-registers** on first run, the
server minting it a per-machine `cak_` key bound to a stable, machine-derived
`machine_uid`. Each site installer carries a **rotatable per-site enrollment key** (a long
server-generated secret) plus a short human-readable **fingerprint** (`vN (XXXX)`) so an
operator can tell at a glance whether an installer is current. Rotating a site's key blocks
*new* enrollments from old installers while leaving already-enrolled machines untouched
(they hold their own `cak_`).
This is the missing piece that turns the v2 secure-session-core (SPEC-004 per-agent keys +
`machine_uid`) into a real product workflow, and it **resolves SPEC-007's open
signature-vs-appended-config question**: the agent binary is signed **once** in CI
(already shipped via `release.yml`), and per-site customization rides in a thin **signed
wrapper** that writes site config to the endpoint at install time — never appended into the
signed PE.
**Success criteria:**
1. A tech installs one site installer on N machines; all N appear in the console under the
correct company/site, each as a distinct, deduplicated machine — zero per-machine setup.
2. Re-installing / re-imaging the same hardware **reuses** the existing machine row (no
ghost duplicates — the failure mode SPEC-004 documents).
3. Rotating a site's enrollment key makes old installers unable to enroll new machines,
while every already-enrolled agent keeps working.
4. Every distributed installer is **validly Authenticode-signed** (SmartScreen/WDAC clean).
## Background — what exists today (confirmed in code)
- **Embedded config is append-based and breaks signing.** `server/src/api/downloads.rs`
(`download_agent`, ~`:152`) reads `static/downloads/guruconnect.exe` and **appends**
`MAGIC_MARKER` + `len:u32` + JSON (`:196`) to the end of the PE. The agent reads it back
in `agent/src/config.rs` (`read_embedded_config`, `:223`). Appending bytes after a signed
PE invalidates the Authenticode signature — so the current customization path and the
newly-shipped CI signing are mutually exclusive.
- **No self-registration exists.** Per-agent `cak_` keys are minted **admin-only** in
`server/src/api/machine_keys.rs` (`create_key`, `:119`; "Admin issued a per-agent key",
`:146`). There is no endpoint where an agent first-run exchanges an enrollment credential
for its own key.
- **Relay already accepts per-agent keys.** `server/src/relay/mod.rs`
(`validate_agent_api_key`, `:417`) calls `crate::auth::agent_keys::verify_agent_key`
(`:422`) — the `cak_` path — then falls back to the **deprecated** shared `AGENT_API_KEY`
(`:444`, logs a "migrate to per-agent `cak_`" warning).
- **Key primitives exist.** `server/src/auth/agent_keys.rs`: `generate_agent_key` mints a
`cak_`-prefixed high-entropy key (`:36`/`:46`); `verify_agent_key` (`:71`).
`server/src/db/agent_keys.rs` already inserts into `connect_agent_keys (machine_id,
key_hash, tenant_id)` (`:47`) — the v2 tenancy column is present (migration
`004_v2_secure_session_core.sql`).
- **Identity is a random config UUID, not machine-derived** — the root cause of duplicates
per SPEC-004 (`agent/src/config.rs` `generate_agent_id`, `:90`).
- **Agent mode dispatch:** `agent/src/main.rs` `Commands::Install` (`:160`) → `run_install`;
`agent/src/config.rs` `detect_run_mode` (`:162`) returns `RunMode::PermanentAgent` when
embedded config is present.
## Scope
### Included in v1 (CORE)
1. **`machine_uid` — deterministic machine identity (hardware-salted, per-tenant).** Derive
a stable id from the Windows `MachineGuid`
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`) **salted with stable hardware
signals** (SMBIOS UUID / motherboard + disk serial), independent of the config-file
`agent_id`. Hardware-derived salt is deliberate: it **survives an OS reinstall/re-image
on the same hardware** (so the row is reused — the re-image dedup goal) while keeping
distinct physical boxes distinct (a per-install *random* salt would break re-image dedup
and is rejected). Uniqueness is scoped **per-tenant** — dedup key `(tenant_id,
machine_uid)` — so the same hardware legitimately present in two tenants stays two
independent rows. (Shared root with SPEC-004; whichever lands first owns the impl, the
other consumes it.) Used as the dedup key for register/move.
**Collision-gated activation.** The residual collision case is VMs/templates that share a
hardware UUID (some hypervisors clone the SMBIOS UUID). When the server detects a
`machine_uid` collision (a seemingly-different endpoint resolving to an existing uid), the
endpoint does **not** auto-activate: it drops to a **pending** state, fires an alert, and
an operator must confirm in the dashboard that the collided endpoint may activate. This is
the one deliberate exception to auto-approve (see item 6).
2. **Per-site enrollment key + fingerprint.**
- Long (≥256-bit) server-generated secret per site, stored **hashed** (Argon2id, same
as `cak_`/passwords), never recoverable in plaintext after issue.
- A non-secret **fingerprint** = monotonic version + short derived code in **hex**,
rendered `vN (XXXX)` (e.g. `v3 (7F2A)`), shown in the dashboard, baked into the
installer filename, and reported by the agent at enrollment. Hex is deliberate —
**not** the RMM word-style code (`GREEN-FALCON`) — so GuruConnect and GuruRMM
artifacts are never visually conflated.
- **Rotate** regenerates the secret and bumps the version; old installers are rejected
for *new* enrollments; existing agents (holding `cak_`) are unaffected.
3. **Self-registration endpoint.** New `POST /api/enroll` (public, unauthenticated by JWT —
gated by the enrollment key) accepting `{ site_code, enrollment_key, machine_uid,
hostname, labels{company,site,department,device_type,tags} }`:
- Verify `(site_code, enrollment_key)` against the current per-site key.
- **Dedup by `machine_uid`** within the site: if the machine exists, reuse the row and
rotate its `cak_`; else create the machine row.
- Mint a `cak_` (reuse `generate_agent_key`), store hashed via `db::agent_keys` bound to
`machine_id` (+ `tenant_id` from the site), return the plaintext `cak_` **once**.
- Emit an audit event + **new-enrollment alert** (and a **site-move** alert when an
existing `machine_uid` enrolls under a different site).
- **Rate-limit + lockout** per `(site_code, source-IP)` as defense-in-depth (the key is
long, so this is belt-and-suspenders, not load-bearing).
4. **Agent first-run enrollment.** On `RunMode::PermanentAgent` with no stored `cak_`:
read site config → call `/api/enroll` with `machine_uid` → persist the returned `cak_`
to a SYSTEM-only protected store (HKLM under a SYSTEM-only ACL, or DPAPI-machine) →
connect to `wss://connect.azcomputerguru.com/ws/agent` using the `cak_`. On subsequent
runs, use the stored `cak_` directly (no re-enroll).
5. **Sign-once base + per-site signed wrapper (resolves SPEC-007 open question).**
- The base agent is signed once in CI (`release.yml`, already shipped) and stays
byte-identical for everyone.
- Per-site customization (labels + enrollment key + fingerprint) is delivered to the
endpoint **at install time** via a signing-safe channel — NOT appended to the signed
PE. **v1 produces BOTH a signed bootstrapper `.exe` and a signed MSI per site**
(ScreenConnect parity — manual installs grab the `.exe`, GPO/Intune fleet pushes take
the MSI), both wrapping the same sign-once agent and writing the site config to the
protected config location. The two differ only in packaging (bootstrapper stub vs. WiX
bundle); both are signed.
- **Deprecate the append path** in `downloads.rs` for managed installs (keep only for
attended/support-code if still needed), eliminating the signature-invalidation defect.
6. **Auto-approve posture (with collision-gate exception).** A self-registered machine is
live and controllable immediately (ScreenConnect parity); the new-enrollment alert is the
tripwire. The **one** exception is a detected `machine_uid` collision (item 1), which
gates the endpoint to **pending** until an operator confirms it in the dashboard.
### Explicitly out of scope (ANTICIPATED — reserve room, do NOT build in v1)
The v1 data model and agent mode-dispatch must leave room for these without building them:
- **Per-site enrollment POLICY** — a `sites.enrollment_policy` field (default
`auto-approve`; future `pending-approval`) plus per-seat/per-endpoint licensing controls.
Commercial, multi-tenant (the `tenant_id` column already exists). Its own future SPEC.
- **Flag overrides** — `--enroll-key` / `--site-code` (generic installer, key supplied on
the command line) and `--reassign` (move an existing machine to a new site, gated by
possession of the destination site's key, with an **explicit accidental-move guard**:
a different-site re-run refuses unless `--reassign` is passed) + cross-client move policy.
Backend (`machine_uid` + authorized site + `cak_`) is designed to support it; CLI surface
is deferred.
- **Technician-assisted interactive install** — `--technician` on a generic installer:
prompts for the tech's own server credentials, and on auth presents a **validated**
Company/Site/tags picker from the live authorized list (authz-by-identity, full audit
trail). Heaviest path (interactive UI + auth/list callback); deferred.
All three converge on the **same backend operation** delivered in v1: `machine_uid` +
authorized site + issued `cak_`. v1 only ships the per-site-embedded-key door.
## Architecture
- **Agent** (`agent/`): compute `machine_uid`; first-run enroll → store `cak_`; use stored
`cak_` thereafter; read site config from the wrapper-written location instead of an
appended PE blob. Touches `config.rs` (`EmbeddedConfig`/`detect_run_mode`/storage),
`main.rs` (`Install`/run-mode), a new `enroll` client module, transport auth.
- **Relay-server** (`server/`): new `POST /api/enroll`; per-site key issue/rotate/verify;
`machine_uid` dedup + site-move on register; audit + alert emission; rate-limit/lockout.
Touches `api/` (new `enroll.rs`, `sites` key endpoints), `auth/agent_keys.rs`,
`db/agent_keys.rs`, `relay/mod.rs` (enrollment vs. connect), `main.rs` routes.
- **Dashboard**: per-site enrollment-key display (fingerprint `vN (XXXX)`), **Rotate**
action, "current installer" download wired to the signed wrapper build. (Builder UI is
SPEC-007; this spec supplies the key/fingerprint/rotation it consumes.)
- **DB migration:** `site_enrollment_keys` (or columns on the site): `site_id`,
`key_hash`, `version`, `fingerprint`, `created_at`, `rotated_at`, `active`. Reserve
`sites.enrollment_policy` (nullable, default `auto-approve`) for the anticipated policy
work. `connect_machines` gains `machine_uid` (unique per tenant/site).
- **Protobuf** (`proto/guruconnect.proto`): no wire change required for enrollment if
`/api/enroll` is REST; `AgentStatus` label fields per SPEC-007 (`department`,
`device_type`) ride along if landed together.
## Security considerations
- **Two-tier credential model:** low-sensitivity **enrollment key** (gates "may register",
shared per site, rotatable) vs. high-sensitivity **per-machine `cak_`** (operating
credential, per-machine revocation). Compromise of an enrollment key is recovered by
rotating one site — no fleet-wide re-key.
- **Enrollment keys stored hashed** (Argon2id); plaintext shown once at issue/rotate.
- **`cak_` at rest on the endpoint** is stored as a **DPAPI-machine-encrypted blob inside a
SYSTEM-ACL'd location** (HKLM value or `ProgramData` file) — both layers: the SYSTEM ACL
stops non-admin users reading it, and DPAPI-machine encryption makes a copied file/export
inert off the box. (Local admin/SYSTEM can always recover it; that is accepted — blast
radius of one leaked `cak_` is a single, independently-revocable machine.)
- **`machine_uid` binding** is the spoof-guard SPEC-004 wants: a `cak_` is bound to a
`machine_uid`; a different box presenting another box's `cak_` is detectable.
- **Authorization model** for moves/enrolls is possession-of-destination-key in v1
(identity-based authz deferred to the technician-assisted path).
- **Open registration risk** is mitigated by requiring `(site_code + long key)` and
rate-limit/lockout; auto-approve is acceptable because the enrollment key is the gate and
every enrollment/site-move fires an alert.
- **Audit events:** enroll, re-enroll/reuse, site-move, key-rotate — all logged with
`machine_uid`, site, and source IP.
## Testing strategy
- **Unit:** `machine_uid` derivation stability; enrollment-key verify/rotate; fingerprint
derivation; `cak_` mint/hash/verify; dedup decision (new vs. reuse vs. move).
- **Integration:** enroll new → row + `cak_` issued; re-enroll same `machine_uid` → reuse,
no duplicate; enroll with rotated (old) key → rejected; old `cak_` still connects after
rotation; rate-limit/lockout trips; site-move emits alert.
- **Manual:** build a site wrapper installer → run on a clean VM → appears in console under
correct site, immediately controllable; re-image VM → same row reused; `signtool verify
/pa` passes on the distributed wrapper and the laid-down agent.
## Effort estimate & dependencies
- **Size:** X-Large (agent + relay + DB migration + CI build/sign wrapper + dashboard
key/rotation surface).
- **Depends on:** SPEC-004 `machine_uid` (shared root); the CI signing already shipped
(SPEC-001 §2 / `release.yml`).
- **Unblocks:** SPEC-007 (installer builder gets a real per-site key + the signing
resolution), and the parked managed-agent test deployment on the internal beta machines.
- **Relationship to v2 phases:** sits with the Phase-1 secure-session-core (per-agent keys
+ identity) and feeds Phase-2 dashboard work.
## Resolved decisions (2026-06-02, Mike)
1. **Wrapper shape — BOTH.** v1 ships a signed bootstrapper `.exe` *and* a signed MSI per
site (ScreenConnect offers both; manual installs use the `.exe`, GPO/Intune fleet pushes
use the MSI). Same sign-once agent inside each.
2. **`cak_` storage — BOTH layers.** DPAPI-machine-encrypted blob stored in a SYSTEM-ACL'd
location. Non-admins can't read it; a stolen copy is inert off the box.
3. **Fingerprint — hex (`7F2A`).** Deliberately *not* the RMM word-code style, so the two
products' artifacts are never visually conflated.
4. **`machine_uid` — per-tenant scope, hardware-derived salt, collision-gated.** Dedup key
`(tenant_id, machine_uid)`; salt from stable hardware signals (survives same-hardware
re-image, separates distinct boxes); detected collisions (e.g. template-cloned VMs
sharing a hardware UUID) drop to pending + alert and require dashboard confirmation to
activate.
5. **Attended (support-code) path — unchanged.** `download_support` is filename-based
(`GuruConnect-<code>.exe`), not append-based, so renaming never breaks the signature —
it is already signing-safe. Only the managed `download_agent` append path is retired.
## Remaining for planning
- Exact stable-hardware signal set for the salt (SMBIOS UUID alone vs. + motherboard/disk
serial) and hypervisor behavior matrix (which hypervisors duplicate the SMBIOS UUID on
clone → exercise the collision-gate).
- MSI authoring approach (WiX) and whether per-site config rides as a per-site MSI vs. a
base MSI + property/transform.