All checks were successful
Fold the 2026-06-02 interview decisions into SPEC-016: - Installer wrapper: ship BOTH signed .exe and signed MSI per site - cak_ at-rest storage: DPAPI-machine-encrypted blob in a SYSTEM-ACL'd location - Fingerprint: hex (7F2A), deliberately unlike RMM word-codes - machine_uid: per-tenant scope + hardware-derived salt (survives re-image, separates distinct boxes) + collision-gated activation (template-cloned VMs sharing a hardware UUID drop to pending + alert, need dashboard confirm) - Attended support-code path: unchanged (filename-based, already signing-safe) Open Questions section -> Resolved decisions + a short Remaining-for-planning list (exact hardware salt signal set, WiX/MSI authoring approach). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
245 lines
16 KiB
Markdown
245 lines
16 KiB
Markdown
# SPEC-016: Zero-Touch Per-Site Agent Enrollment
|
|
|
|
**Status:** Proposed
|
|
**Priority:** P1
|
|
**Requested By:** Mike (2026-06-02)
|
|
**Estimated Effort:** X-Large
|
|
|
|
## Overview
|
|
|
|
Give GuruConnect a ScreenConnect-class managed-agent enrollment flow: a technician runs
|
|
**one signed installer per site** on every machine at that site — no per-machine key
|
|
minting, no flags, no typing — and each machine **self-registers** on first run, the
|
|
server minting it a per-machine `cak_` key bound to a stable, machine-derived
|
|
`machine_uid`. Each site installer carries a **rotatable per-site enrollment key** (a long
|
|
server-generated secret) plus a short human-readable **fingerprint** (`vN (XXXX)`) so an
|
|
operator can tell at a glance whether an installer is current. Rotating a site's key blocks
|
|
*new* enrollments from old installers while leaving already-enrolled machines untouched
|
|
(they hold their own `cak_`).
|
|
|
|
This is the missing piece that turns the v2 secure-session-core (SPEC-004 per-agent keys +
|
|
`machine_uid`) into a real product workflow, and it **resolves SPEC-007's open
|
|
signature-vs-appended-config question**: the agent binary is signed **once** in CI
|
|
(already shipped via `release.yml`), and per-site customization rides in a thin **signed
|
|
wrapper** that writes site config to the endpoint at install time — never appended into the
|
|
signed PE.
|
|
|
|
**Success criteria:**
|
|
1. A tech installs one site installer on N machines; all N appear in the console under the
|
|
correct company/site, each as a distinct, deduplicated machine — zero per-machine setup.
|
|
2. Re-installing / re-imaging the same hardware **reuses** the existing machine row (no
|
|
ghost duplicates — the failure mode SPEC-004 documents).
|
|
3. Rotating a site's enrollment key makes old installers unable to enroll new machines,
|
|
while every already-enrolled agent keeps working.
|
|
4. Every distributed installer is **validly Authenticode-signed** (SmartScreen/WDAC clean).
|
|
|
|
## Background — what exists today (confirmed in code)
|
|
|
|
- **Embedded config is append-based and breaks signing.** `server/src/api/downloads.rs`
|
|
(`download_agent`, ~`:152`) reads `static/downloads/guruconnect.exe` and **appends**
|
|
`MAGIC_MARKER` + `len:u32` + JSON (`:196`) to the end of the PE. The agent reads it back
|
|
in `agent/src/config.rs` (`read_embedded_config`, `:223`). Appending bytes after a signed
|
|
PE invalidates the Authenticode signature — so the current customization path and the
|
|
newly-shipped CI signing are mutually exclusive.
|
|
- **No self-registration exists.** Per-agent `cak_` keys are minted **admin-only** in
|
|
`server/src/api/machine_keys.rs` (`create_key`, `:119`; "Admin issued a per-agent key",
|
|
`:146`). There is no endpoint where an agent first-run exchanges an enrollment credential
|
|
for its own key.
|
|
- **Relay already accepts per-agent keys.** `server/src/relay/mod.rs`
|
|
(`validate_agent_api_key`, `:417`) calls `crate::auth::agent_keys::verify_agent_key`
|
|
(`:422`) — the `cak_` path — then falls back to the **deprecated** shared `AGENT_API_KEY`
|
|
(`:444`, logs a "migrate to per-agent `cak_`" warning).
|
|
- **Key primitives exist.** `server/src/auth/agent_keys.rs`: `generate_agent_key` mints a
|
|
`cak_`-prefixed high-entropy key (`:36`/`:46`); `verify_agent_key` (`:71`).
|
|
`server/src/db/agent_keys.rs` already inserts into `connect_agent_keys (machine_id,
|
|
key_hash, tenant_id)` (`:47`) — the v2 tenancy column is present (migration
|
|
`004_v2_secure_session_core.sql`).
|
|
- **Identity is a random config UUID, not machine-derived** — the root cause of duplicates
|
|
per SPEC-004 (`agent/src/config.rs` `generate_agent_id`, `:90`).
|
|
- **Agent mode dispatch:** `agent/src/main.rs` `Commands::Install` (`:160`) → `run_install`;
|
|
`agent/src/config.rs` `detect_run_mode` (`:162`) returns `RunMode::PermanentAgent` when
|
|
embedded config is present.
|
|
|
|
## Scope
|
|
|
|
### Included in v1 (CORE)
|
|
|
|
1. **`machine_uid` — deterministic machine identity (hardware-salted, per-tenant).** Derive
|
|
a stable id from the Windows `MachineGuid`
|
|
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`) **salted with stable hardware
|
|
signals** (SMBIOS UUID / motherboard + disk serial), independent of the config-file
|
|
`agent_id`. Hardware-derived salt is deliberate: it **survives an OS reinstall/re-image
|
|
on the same hardware** (so the row is reused — the re-image dedup goal) while keeping
|
|
distinct physical boxes distinct (a per-install *random* salt would break re-image dedup
|
|
and is rejected). Uniqueness is scoped **per-tenant** — dedup key `(tenant_id,
|
|
machine_uid)` — so the same hardware legitimately present in two tenants stays two
|
|
independent rows. (Shared root with SPEC-004; whichever lands first owns the impl, the
|
|
other consumes it.) Used as the dedup key for register/move.
|
|
|
|
**Collision-gated activation.** The residual collision case is VMs/templates that share a
|
|
hardware UUID (some hypervisors clone the SMBIOS UUID). When the server detects a
|
|
`machine_uid` collision (a seemingly-different endpoint resolving to an existing uid), the
|
|
endpoint does **not** auto-activate: it drops to a **pending** state, fires an alert, and
|
|
an operator must confirm in the dashboard that the collided endpoint may activate. This is
|
|
the one deliberate exception to auto-approve (see item 6).
|
|
|
|
2. **Per-site enrollment key + fingerprint.**
|
|
- Long (≥256-bit) server-generated secret per site, stored **hashed** (Argon2id, same
|
|
as `cak_`/passwords), never recoverable in plaintext after issue.
|
|
- A non-secret **fingerprint** = monotonic version + short derived code in **hex**,
|
|
rendered `vN (XXXX)` (e.g. `v3 (7F2A)`), shown in the dashboard, baked into the
|
|
installer filename, and reported by the agent at enrollment. Hex is deliberate —
|
|
**not** the RMM word-style code (`GREEN-FALCON`) — so GuruConnect and GuruRMM
|
|
artifacts are never visually conflated.
|
|
- **Rotate** regenerates the secret and bumps the version; old installers are rejected
|
|
for *new* enrollments; existing agents (holding `cak_`) are unaffected.
|
|
|
|
3. **Self-registration endpoint.** New `POST /api/enroll` (public, unauthenticated by JWT —
|
|
gated by the enrollment key) accepting `{ site_code, enrollment_key, machine_uid,
|
|
hostname, labels{company,site,department,device_type,tags} }`:
|
|
- Verify `(site_code, enrollment_key)` against the current per-site key.
|
|
- **Dedup by `machine_uid`** within the site: if the machine exists, reuse the row and
|
|
rotate its `cak_`; else create the machine row.
|
|
- Mint a `cak_` (reuse `generate_agent_key`), store hashed via `db::agent_keys` bound to
|
|
`machine_id` (+ `tenant_id` from the site), return the plaintext `cak_` **once**.
|
|
- Emit an audit event + **new-enrollment alert** (and a **site-move** alert when an
|
|
existing `machine_uid` enrolls under a different site).
|
|
- **Rate-limit + lockout** per `(site_code, source-IP)` as defense-in-depth (the key is
|
|
long, so this is belt-and-suspenders, not load-bearing).
|
|
|
|
4. **Agent first-run enrollment.** On `RunMode::PermanentAgent` with no stored `cak_`:
|
|
read site config → call `/api/enroll` with `machine_uid` → persist the returned `cak_`
|
|
to a SYSTEM-only protected store (HKLM under a SYSTEM-only ACL, or DPAPI-machine) →
|
|
connect to `wss://connect.azcomputerguru.com/ws/agent` using the `cak_`. On subsequent
|
|
runs, use the stored `cak_` directly (no re-enroll).
|
|
|
|
5. **Sign-once base + per-site signed wrapper (resolves SPEC-007 open question).**
|
|
- The base agent is signed once in CI (`release.yml`, already shipped) and stays
|
|
byte-identical for everyone.
|
|
- Per-site customization (labels + enrollment key + fingerprint) is delivered to the
|
|
endpoint **at install time** via a signing-safe channel — NOT appended to the signed
|
|
PE. **v1 produces BOTH a signed bootstrapper `.exe` and a signed MSI per site**
|
|
(ScreenConnect parity — manual installs grab the `.exe`, GPO/Intune fleet pushes take
|
|
the MSI), both wrapping the same sign-once agent and writing the site config to the
|
|
protected config location. The two differ only in packaging (bootstrapper stub vs. WiX
|
|
bundle); both are signed.
|
|
- **Deprecate the append path** in `downloads.rs` for managed installs (keep only for
|
|
attended/support-code if still needed), eliminating the signature-invalidation defect.
|
|
|
|
6. **Auto-approve posture (with collision-gate exception).** A self-registered machine is
|
|
live and controllable immediately (ScreenConnect parity); the new-enrollment alert is the
|
|
tripwire. The **one** exception is a detected `machine_uid` collision (item 1), which
|
|
gates the endpoint to **pending** until an operator confirms it in the dashboard.
|
|
|
|
### Explicitly out of scope (ANTICIPATED — reserve room, do NOT build in v1)
|
|
|
|
The v1 data model and agent mode-dispatch must leave room for these without building them:
|
|
|
|
- **Per-site enrollment POLICY** — a `sites.enrollment_policy` field (default
|
|
`auto-approve`; future `pending-approval`) plus per-seat/per-endpoint licensing controls.
|
|
Commercial, multi-tenant (the `tenant_id` column already exists). Its own future SPEC.
|
|
- **Flag overrides** — `--enroll-key` / `--site-code` (generic installer, key supplied on
|
|
the command line) and `--reassign` (move an existing machine to a new site, gated by
|
|
possession of the destination site's key, with an **explicit accidental-move guard**:
|
|
a different-site re-run refuses unless `--reassign` is passed) + cross-client move policy.
|
|
Backend (`machine_uid` + authorized site + `cak_`) is designed to support it; CLI surface
|
|
is deferred.
|
|
- **Technician-assisted interactive install** — `--technician` on a generic installer:
|
|
prompts for the tech's own server credentials, and on auth presents a **validated**
|
|
Company/Site/tags picker from the live authorized list (authz-by-identity, full audit
|
|
trail). Heaviest path (interactive UI + auth/list callback); deferred.
|
|
|
|
All three converge on the **same backend operation** delivered in v1: `machine_uid` +
|
|
authorized site + issued `cak_`. v1 only ships the per-site-embedded-key door.
|
|
|
|
## Architecture
|
|
|
|
- **Agent** (`agent/`): compute `machine_uid`; first-run enroll → store `cak_`; use stored
|
|
`cak_` thereafter; read site config from the wrapper-written location instead of an
|
|
appended PE blob. Touches `config.rs` (`EmbeddedConfig`/`detect_run_mode`/storage),
|
|
`main.rs` (`Install`/run-mode), a new `enroll` client module, transport auth.
|
|
- **Relay-server** (`server/`): new `POST /api/enroll`; per-site key issue/rotate/verify;
|
|
`machine_uid` dedup + site-move on register; audit + alert emission; rate-limit/lockout.
|
|
Touches `api/` (new `enroll.rs`, `sites` key endpoints), `auth/agent_keys.rs`,
|
|
`db/agent_keys.rs`, `relay/mod.rs` (enrollment vs. connect), `main.rs` routes.
|
|
- **Dashboard**: per-site enrollment-key display (fingerprint `vN (XXXX)`), **Rotate**
|
|
action, "current installer" download wired to the signed wrapper build. (Builder UI is
|
|
SPEC-007; this spec supplies the key/fingerprint/rotation it consumes.)
|
|
- **DB migration:** `site_enrollment_keys` (or columns on the site): `site_id`,
|
|
`key_hash`, `version`, `fingerprint`, `created_at`, `rotated_at`, `active`. Reserve
|
|
`sites.enrollment_policy` (nullable, default `auto-approve`) for the anticipated policy
|
|
work. `connect_machines` gains `machine_uid` (unique per tenant/site).
|
|
- **Protobuf** (`proto/guruconnect.proto`): no wire change required for enrollment if
|
|
`/api/enroll` is REST; `AgentStatus` label fields per SPEC-007 (`department`,
|
|
`device_type`) ride along if landed together.
|
|
|
|
## Security considerations
|
|
|
|
- **Two-tier credential model:** low-sensitivity **enrollment key** (gates "may register",
|
|
shared per site, rotatable) vs. high-sensitivity **per-machine `cak_`** (operating
|
|
credential, per-machine revocation). Compromise of an enrollment key is recovered by
|
|
rotating one site — no fleet-wide re-key.
|
|
- **Enrollment keys stored hashed** (Argon2id); plaintext shown once at issue/rotate.
|
|
- **`cak_` at rest on the endpoint** is stored as a **DPAPI-machine-encrypted blob inside a
|
|
SYSTEM-ACL'd location** (HKLM value or `ProgramData` file) — both layers: the SYSTEM ACL
|
|
stops non-admin users reading it, and DPAPI-machine encryption makes a copied file/export
|
|
inert off the box. (Local admin/SYSTEM can always recover it; that is accepted — blast
|
|
radius of one leaked `cak_` is a single, independently-revocable machine.)
|
|
- **`machine_uid` binding** is the spoof-guard SPEC-004 wants: a `cak_` is bound to a
|
|
`machine_uid`; a different box presenting another box's `cak_` is detectable.
|
|
- **Authorization model** for moves/enrolls is possession-of-destination-key in v1
|
|
(identity-based authz deferred to the technician-assisted path).
|
|
- **Open registration risk** is mitigated by requiring `(site_code + long key)` and
|
|
rate-limit/lockout; auto-approve is acceptable because the enrollment key is the gate and
|
|
every enrollment/site-move fires an alert.
|
|
- **Audit events:** enroll, re-enroll/reuse, site-move, key-rotate — all logged with
|
|
`machine_uid`, site, and source IP.
|
|
|
|
## Testing strategy
|
|
|
|
- **Unit:** `machine_uid` derivation stability; enrollment-key verify/rotate; fingerprint
|
|
derivation; `cak_` mint/hash/verify; dedup decision (new vs. reuse vs. move).
|
|
- **Integration:** enroll new → row + `cak_` issued; re-enroll same `machine_uid` → reuse,
|
|
no duplicate; enroll with rotated (old) key → rejected; old `cak_` still connects after
|
|
rotation; rate-limit/lockout trips; site-move emits alert.
|
|
- **Manual:** build a site wrapper installer → run on a clean VM → appears in console under
|
|
correct site, immediately controllable; re-image VM → same row reused; `signtool verify
|
|
/pa` passes on the distributed wrapper and the laid-down agent.
|
|
|
|
## Effort estimate & dependencies
|
|
|
|
- **Size:** X-Large (agent + relay + DB migration + CI build/sign wrapper + dashboard
|
|
key/rotation surface).
|
|
- **Depends on:** SPEC-004 `machine_uid` (shared root); the CI signing already shipped
|
|
(SPEC-001 §2 / `release.yml`).
|
|
- **Unblocks:** SPEC-007 (installer builder gets a real per-site key + the signing
|
|
resolution), and the parked managed-agent test deployment on the internal beta machines.
|
|
- **Relationship to v2 phases:** sits with the Phase-1 secure-session-core (per-agent keys
|
|
+ identity) and feeds Phase-2 dashboard work.
|
|
|
|
## Resolved decisions (2026-06-02, Mike)
|
|
|
|
1. **Wrapper shape — BOTH.** v1 ships a signed bootstrapper `.exe` *and* a signed MSI per
|
|
site (ScreenConnect offers both; manual installs use the `.exe`, GPO/Intune fleet pushes
|
|
use the MSI). Same sign-once agent inside each.
|
|
2. **`cak_` storage — BOTH layers.** DPAPI-machine-encrypted blob stored in a SYSTEM-ACL'd
|
|
location. Non-admins can't read it; a stolen copy is inert off the box.
|
|
3. **Fingerprint — hex (`7F2A`).** Deliberately *not* the RMM word-code style, so the two
|
|
products' artifacts are never visually conflated.
|
|
4. **`machine_uid` — per-tenant scope, hardware-derived salt, collision-gated.** Dedup key
|
|
`(tenant_id, machine_uid)`; salt from stable hardware signals (survives same-hardware
|
|
re-image, separates distinct boxes); detected collisions (e.g. template-cloned VMs
|
|
sharing a hardware UUID) drop to pending + alert and require dashboard confirmation to
|
|
activate.
|
|
5. **Attended (support-code) path — unchanged.** `download_support` is filename-based
|
|
(`GuruConnect-<code>.exe`), not append-based, so renaming never breaks the signature —
|
|
it is already signing-safe. Only the managed `download_agent` append path is retired.
|
|
|
|
## Remaining for planning
|
|
|
|
- Exact stable-hardware signal set for the salt (SMBIOS UUID alone vs. + motherboard/disk
|
|
serial) and hypervisor behavior matrix (which hypervisors duplicate the SMBIOS UUID on
|
|
clone → exercise the collision-gate).
|
|
- MSI authoring approach (WiX) and whether per-site config rides as a per-site MSI vs. a
|
|
base MSI + property/transform.
|