Files
guru-connect/docs/specs/SPEC-016-zero-touch-enrollment.md
Mike Swanson c286a29b9d
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 14m25s
Build and Test / Build Server (Linux) (push) Successful in 20m31s
Build and Test / Security Audit (push) Successful in 8m28s
Build and Test / Build Summary (push) Successful in 30s
spec: SPEC-016 resolve all 5 open questions (enrollment design decisions)
Fold the 2026-06-02 interview decisions into SPEC-016:
- Installer wrapper: ship BOTH signed .exe and signed MSI per site
- cak_ at-rest storage: DPAPI-machine-encrypted blob in a SYSTEM-ACL'd location
- Fingerprint: hex (7F2A), deliberately unlike RMM word-codes
- machine_uid: per-tenant scope + hardware-derived salt (survives re-image,
  separates distinct boxes) + collision-gated activation (template-cloned VMs
  sharing a hardware UUID drop to pending + alert, need dashboard confirm)
- Attended support-code path: unchanged (filename-based, already signing-safe)

Open Questions section -> Resolved decisions + a short Remaining-for-planning
list (exact hardware salt signal set, WiX/MSI authoring approach).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:54:19 -07:00

16 KiB

SPEC-016: Zero-Touch Per-Site Agent Enrollment

Status: Proposed Priority: P1 Requested By: Mike (2026-06-02) Estimated Effort: X-Large

Overview

Give GuruConnect a ScreenConnect-class managed-agent enrollment flow: a technician runs one signed installer per site on every machine at that site — no per-machine key minting, no flags, no typing — and each machine self-registers on first run, the server minting it a per-machine cak_ key bound to a stable, machine-derived machine_uid. Each site installer carries a rotatable per-site enrollment key (a long server-generated secret) plus a short human-readable fingerprint (vN (XXXX)) so an operator can tell at a glance whether an installer is current. Rotating a site's key blocks new enrollments from old installers while leaving already-enrolled machines untouched (they hold their own cak_).

This is the missing piece that turns the v2 secure-session-core (SPEC-004 per-agent keys + machine_uid) into a real product workflow, and it resolves SPEC-007's open signature-vs-appended-config question: the agent binary is signed once in CI (already shipped via release.yml), and per-site customization rides in a thin signed wrapper that writes site config to the endpoint at install time — never appended into the signed PE.

Success criteria:

  1. A tech installs one site installer on N machines; all N appear in the console under the correct company/site, each as a distinct, deduplicated machine — zero per-machine setup.
  2. Re-installing / re-imaging the same hardware reuses the existing machine row (no ghost duplicates — the failure mode SPEC-004 documents).
  3. Rotating a site's enrollment key makes old installers unable to enroll new machines, while every already-enrolled agent keeps working.
  4. Every distributed installer is validly Authenticode-signed (SmartScreen/WDAC clean).

Background — what exists today (confirmed in code)

  • Embedded config is append-based and breaks signing. server/src/api/downloads.rs (download_agent, ~:152) reads static/downloads/guruconnect.exe and appends MAGIC_MARKER + len:u32 + JSON (:196) to the end of the PE. The agent reads it back in agent/src/config.rs (read_embedded_config, :223). Appending bytes after a signed PE invalidates the Authenticode signature — so the current customization path and the newly-shipped CI signing are mutually exclusive.
  • No self-registration exists. Per-agent cak_ keys are minted admin-only in server/src/api/machine_keys.rs (create_key, :119; "Admin issued a per-agent key", :146). There is no endpoint where an agent first-run exchanges an enrollment credential for its own key.
  • Relay already accepts per-agent keys. server/src/relay/mod.rs (validate_agent_api_key, :417) calls crate::auth::agent_keys::verify_agent_key (:422) — the cak_ path — then falls back to the deprecated shared AGENT_API_KEY (:444, logs a "migrate to per-agent cak_" warning).
  • Key primitives exist. server/src/auth/agent_keys.rs: generate_agent_key mints a cak_-prefixed high-entropy key (:36/:46); verify_agent_key (:71). server/src/db/agent_keys.rs already inserts into connect_agent_keys (machine_id, key_hash, tenant_id) (:47) — the v2 tenancy column is present (migration 004_v2_secure_session_core.sql).
  • Identity is a random config UUID, not machine-derived — the root cause of duplicates per SPEC-004 (agent/src/config.rs generate_agent_id, :90).
  • Agent mode dispatch: agent/src/main.rs Commands::Install (:160) → run_install; agent/src/config.rs detect_run_mode (:162) returns RunMode::PermanentAgent when embedded config is present.

Scope

Included in v1 (CORE)

  1. machine_uid — deterministic machine identity (hardware-salted, per-tenant). Derive a stable id from the Windows MachineGuid (HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid) salted with stable hardware signals (SMBIOS UUID / motherboard + disk serial), independent of the config-file agent_id. Hardware-derived salt is deliberate: it survives an OS reinstall/re-image on the same hardware (so the row is reused — the re-image dedup goal) while keeping distinct physical boxes distinct (a per-install random salt would break re-image dedup and is rejected). Uniqueness is scoped per-tenant — dedup key (tenant_id, machine_uid) — so the same hardware legitimately present in two tenants stays two independent rows. (Shared root with SPEC-004; whichever lands first owns the impl, the other consumes it.) Used as the dedup key for register/move.

    Collision-gated activation. The residual collision case is VMs/templates that share a hardware UUID (some hypervisors clone the SMBIOS UUID). When the server detects a machine_uid collision (a seemingly-different endpoint resolving to an existing uid), the endpoint does not auto-activate: it drops to a pending state, fires an alert, and an operator must confirm in the dashboard that the collided endpoint may activate. This is the one deliberate exception to auto-approve (see item 6).

  2. Per-site enrollment key + fingerprint.

    • Long (≥256-bit) server-generated secret per site, stored hashed (Argon2id, same as cak_/passwords), never recoverable in plaintext after issue.
    • A non-secret fingerprint = monotonic version + short derived code in hex, rendered vN (XXXX) (e.g. v3 (7F2A)), shown in the dashboard, baked into the installer filename, and reported by the agent at enrollment. Hex is deliberate — not the RMM word-style code (GREEN-FALCON) — so GuruConnect and GuruRMM artifacts are never visually conflated.
    • Rotate regenerates the secret and bumps the version; old installers are rejected for new enrollments; existing agents (holding cak_) are unaffected.
  3. Self-registration endpoint. New POST /api/enroll (public, unauthenticated by JWT — gated by the enrollment key) accepting { site_code, enrollment_key, machine_uid, hostname, labels{company,site,department,device_type,tags} }:

    • Verify (site_code, enrollment_key) against the current per-site key.
    • Dedup by machine_uid within the site: if the machine exists, reuse the row and rotate its cak_; else create the machine row.
    • Mint a cak_ (reuse generate_agent_key), store hashed via db::agent_keys bound to machine_id (+ tenant_id from the site), return the plaintext cak_ once.
    • Emit an audit event + new-enrollment alert (and a site-move alert when an existing machine_uid enrolls under a different site).
    • Rate-limit + lockout per (site_code, source-IP) as defense-in-depth (the key is long, so this is belt-and-suspenders, not load-bearing).
  4. Agent first-run enrollment. On RunMode::PermanentAgent with no stored cak_: read site config → call /api/enroll with machine_uid → persist the returned cak_ to a SYSTEM-only protected store (HKLM under a SYSTEM-only ACL, or DPAPI-machine) → connect to wss://connect.azcomputerguru.com/ws/agent using the cak_. On subsequent runs, use the stored cak_ directly (no re-enroll).

  5. Sign-once base + per-site signed wrapper (resolves SPEC-007 open question).

    • The base agent is signed once in CI (release.yml, already shipped) and stays byte-identical for everyone.
    • Per-site customization (labels + enrollment key + fingerprint) is delivered to the endpoint at install time via a signing-safe channel — NOT appended to the signed PE. v1 produces BOTH a signed bootstrapper .exe and a signed MSI per site (ScreenConnect parity — manual installs grab the .exe, GPO/Intune fleet pushes take the MSI), both wrapping the same sign-once agent and writing the site config to the protected config location. The two differ only in packaging (bootstrapper stub vs. WiX bundle); both are signed.
    • Deprecate the append path in downloads.rs for managed installs (keep only for attended/support-code if still needed), eliminating the signature-invalidation defect.
  6. Auto-approve posture (with collision-gate exception). A self-registered machine is live and controllable immediately (ScreenConnect parity); the new-enrollment alert is the tripwire. The one exception is a detected machine_uid collision (item 1), which gates the endpoint to pending until an operator confirms it in the dashboard.

Explicitly out of scope (ANTICIPATED — reserve room, do NOT build in v1)

The v1 data model and agent mode-dispatch must leave room for these without building them:

  • Per-site enrollment POLICY — a sites.enrollment_policy field (default auto-approve; future pending-approval) plus per-seat/per-endpoint licensing controls. Commercial, multi-tenant (the tenant_id column already exists). Its own future SPEC.
  • Flag overrides--enroll-key / --site-code (generic installer, key supplied on the command line) and --reassign (move an existing machine to a new site, gated by possession of the destination site's key, with an explicit accidental-move guard: a different-site re-run refuses unless --reassign is passed) + cross-client move policy. Backend (machine_uid + authorized site + cak_) is designed to support it; CLI surface is deferred.
  • Technician-assisted interactive install--technician on a generic installer: prompts for the tech's own server credentials, and on auth presents a validated Company/Site/tags picker from the live authorized list (authz-by-identity, full audit trail). Heaviest path (interactive UI + auth/list callback); deferred.

All three converge on the same backend operation delivered in v1: machine_uid + authorized site + issued cak_. v1 only ships the per-site-embedded-key door.

Architecture

  • Agent (agent/): compute machine_uid; first-run enroll → store cak_; use stored cak_ thereafter; read site config from the wrapper-written location instead of an appended PE blob. Touches config.rs (EmbeddedConfig/detect_run_mode/storage), main.rs (Install/run-mode), a new enroll client module, transport auth.
  • Relay-server (server/): new POST /api/enroll; per-site key issue/rotate/verify; machine_uid dedup + site-move on register; audit + alert emission; rate-limit/lockout. Touches api/ (new enroll.rs, sites key endpoints), auth/agent_keys.rs, db/agent_keys.rs, relay/mod.rs (enrollment vs. connect), main.rs routes.
  • Dashboard: per-site enrollment-key display (fingerprint vN (XXXX)), Rotate action, "current installer" download wired to the signed wrapper build. (Builder UI is SPEC-007; this spec supplies the key/fingerprint/rotation it consumes.)
  • DB migration: site_enrollment_keys (or columns on the site): site_id, key_hash, version, fingerprint, created_at, rotated_at, active. Reserve sites.enrollment_policy (nullable, default auto-approve) for the anticipated policy work. connect_machines gains machine_uid (unique per tenant/site).
  • Protobuf (proto/guruconnect.proto): no wire change required for enrollment if /api/enroll is REST; AgentStatus label fields per SPEC-007 (department, device_type) ride along if landed together.

Security considerations

  • Two-tier credential model: low-sensitivity enrollment key (gates "may register", shared per site, rotatable) vs. high-sensitivity per-machine cak_ (operating credential, per-machine revocation). Compromise of an enrollment key is recovered by rotating one site — no fleet-wide re-key.
  • Enrollment keys stored hashed (Argon2id); plaintext shown once at issue/rotate.
  • cak_ at rest on the endpoint is stored as a DPAPI-machine-encrypted blob inside a SYSTEM-ACL'd location (HKLM value or ProgramData file) — both layers: the SYSTEM ACL stops non-admin users reading it, and DPAPI-machine encryption makes a copied file/export inert off the box. (Local admin/SYSTEM can always recover it; that is accepted — blast radius of one leaked cak_ is a single, independently-revocable machine.)
  • machine_uid binding is the spoof-guard SPEC-004 wants: a cak_ is bound to a machine_uid; a different box presenting another box's cak_ is detectable.
  • Authorization model for moves/enrolls is possession-of-destination-key in v1 (identity-based authz deferred to the technician-assisted path).
  • Open registration risk is mitigated by requiring (site_code + long key) and rate-limit/lockout; auto-approve is acceptable because the enrollment key is the gate and every enrollment/site-move fires an alert.
  • Audit events: enroll, re-enroll/reuse, site-move, key-rotate — all logged with machine_uid, site, and source IP.

Testing strategy

  • Unit: machine_uid derivation stability; enrollment-key verify/rotate; fingerprint derivation; cak_ mint/hash/verify; dedup decision (new vs. reuse vs. move).
  • Integration: enroll new → row + cak_ issued; re-enroll same machine_uid → reuse, no duplicate; enroll with rotated (old) key → rejected; old cak_ still connects after rotation; rate-limit/lockout trips; site-move emits alert.
  • Manual: build a site wrapper installer → run on a clean VM → appears in console under correct site, immediately controllable; re-image VM → same row reused; signtool verify /pa passes on the distributed wrapper and the laid-down agent.

Effort estimate & dependencies

  • Size: X-Large (agent + relay + DB migration + CI build/sign wrapper + dashboard key/rotation surface).
  • Depends on: SPEC-004 machine_uid (shared root); the CI signing already shipped (SPEC-001 §2 / release.yml).
  • Unblocks: SPEC-007 (installer builder gets a real per-site key + the signing resolution), and the parked managed-agent test deployment on the internal beta machines.
  • Relationship to v2 phases: sits with the Phase-1 secure-session-core (per-agent keys
    • identity) and feeds Phase-2 dashboard work.

Resolved decisions (2026-06-02, Mike)

  1. Wrapper shape — BOTH. v1 ships a signed bootstrapper .exe and a signed MSI per site (ScreenConnect offers both; manual installs use the .exe, GPO/Intune fleet pushes use the MSI). Same sign-once agent inside each.
  2. cak_ storage — BOTH layers. DPAPI-machine-encrypted blob stored in a SYSTEM-ACL'd location. Non-admins can't read it; a stolen copy is inert off the box.
  3. Fingerprint — hex (7F2A). Deliberately not the RMM word-code style, so the two products' artifacts are never visually conflated.
  4. machine_uid — per-tenant scope, hardware-derived salt, collision-gated. Dedup key (tenant_id, machine_uid); salt from stable hardware signals (survives same-hardware re-image, separates distinct boxes); detected collisions (e.g. template-cloned VMs sharing a hardware UUID) drop to pending + alert and require dashboard confirmation to activate.
  5. Attended (support-code) path — unchanged. download_support is filename-based (GuruConnect-<code>.exe), not append-based, so renaming never breaks the signature — it is already signing-safe. Only the managed download_agent append path is retired.

Remaining for planning

  • Exact stable-hardware signal set for the salt (SMBIOS UUID alone vs. + motherboard/disk serial) and hypervisor behavior matrix (which hypervisors duplicate the SMBIOS UUID on clone → exercise the collision-gate).
  • MSI authoring approach (WiX) and whether per-site config rides as a per-site MSI vs. a base MSI + property/transform.