20 Commits
v0.3.0 ... main

Author SHA1 Message Date
55b9c97b28 fix(agent): point Phase B fail-fast guard at SPEC-018
Some checks are pending
Build and Test / Build Server (Linux) (push) Waiting to run
Build and Test / Build Agent (Windows) (push) Waiting to run
Build and Test / Security Audit (push) Waiting to run
Build and Test / Build Summary (push) Blocked by required conditions
The SPEC-016 Phase B credential-store guard referenced "SPEC-017" for the
forthcoming SYSTEM service host, but 017 is now Mike's end-user-access
spec; the service host is SPEC-018. Comment + error-string text only, no
logic change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 13:13:13 -07:00
94c07c2431 spec: add SPEC-018 managed-agent SYSTEM service host + session broker
LocalSystem service that runs the persistent agent unattended and brokers
per-session capture/input workers (Session 0 can't capture directly).
Unblocks SPEC-016 Phase B end-to-end (SYSTEM-ACL'd cak_ store readable;
removes the Phase B fail-fast guard) and is the broker primitive SPEC-013
builds on. 017 was taken by Mike's end-user-access spec, so this is 018.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 13:13:04 -07:00
4c49b73a71 spec: add SPEC-017 end-user (sub-user) remote access
Some checks failed
Build and Test / Build Summary (pull_request) Blocked by required conditions
Build and Test / Build Agent (Windows) (pull_request) Successful in 10m54s
Build and Test / Build Server (Linux) (push) Has been cancelled
Build and Test / Build Agent (Windows) (push) Has started running
Build and Test / Security Audit (push) Has been cancelled
Build and Test / Build Summary (push) Has been cancelled
Build and Test / Build Server (Linux) (pull_request) Successful in 15m39s
Build and Test / Security Audit (pull_request) Successful in 5m54s
2026-06-02 12:56:15 -07:00
367906bd54 fix(agent): SPEC-016 Phase B review fixes (re-image-stable machine_uid, ACL TOCTOU, load_cak error classes, PS timeout, fail-fast guard)
H1: derive machine_uid from the durable hardware salt ALONE (SMBIOS UUID, or
board+disk serial) plus a fixed namespace, so it survives an OS re-image (which
regenerates MachineGuid). MachineGuid is demoted to a last-resort signal used
only when no hardware salt is readable (volatile, reboot-only floor). Re-image
stability proven by salted_uid_is_reimage_stable_independent_of_machine_guid.

H2: in store_cak, lock the directory ACL BEFORE any secret bytes are written;
the temp file is created inside the already-locked dir, then renamed. No
ciphertext ever exists at an inherited/world-readable path. Ordering made an
explicit precondition, not an unstated inheritance assumption.

M1: load_cak now returns a LoadCakError enum distinguishing Io (incl.
PermissionDenied — operational) from Decrypt (the real tamper/wrong-machine
signal). Only a successful READ whose DPAPI decrypt fails hard-stops.

M2: the PowerShell SMBIOS/board/disk shell-out is spawned and waited on with a
10s wall-clock bound; on timeout the child is killed and the signal is treated
as missing (falls back through the chain), never panics. Keeps
CREATE_NO_WINDOW -NonInteractive -NoProfile.

L1: warn! breadcrumb when the salted derivation degrades to MachineGuid-only,
so the server-side collision-gate operator has a clue. No secret values logged.

C1: keep the SYSTEM+Administrators ACL (Option A target). store_cak now does a
read-back verification immediately after writing and fails at ENROLL time if
this context cannot read its own store; resolve_agent_credential fails fast with
an actionable SPEC-017 message on an access-denied store instead of silently
re-enrolling/bricking. Guarded comment notes this is satisfied once the SYSTEM
service host lands.

Deferred items (clear_cak placeholder, legacy api_key path) left as-is.

Verification on x86_64-pc-windows-msvc: cargo fmt --check clean, clippy
-D warnings clean, release build OK, 52 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 12:54:18 -07:00
52477e4c4a feat(agent): first-run enrollment client + run-mode wiring (SPEC-016 Phase B items 3,5)
New enroll module: on a managed agent with no stored cak_ but with
enrollment_key + site_code, POST machine_uid + hostname + labels to
<https-base>/api/enroll and persist the minted cak_. Handles every Phase A
status code distinctly:
  - 201 new / 200 reuse -> persist cak_ (DPAPI store) and connect
  - 202 collision_pending -> log "pending operator confirmation", slow
    re-check loop (no key issued; cannot connect until confirmed)
  - 401 ENROLL_REJECTED / 409 ENROLL_SITE_CONFLICT -> distinct actionable
    errors, long backoff (won't fix without operator action, but recovers
    automatically once it does) — no tight loop
  - 429 -> honor Retry-After, short backoff
  - network / 5xx / decode -> short backoff
The enrollment_key and cak_ are never logged. Uses the existing reqwest
client and the update path's TLS posture (rustls; dev-insecure only in
debug + opt-in). Wire-contract unit tests pin the request shape against
the server's EnrollRequest/EnrollLabels and decode active + pending bodies.

main.rs run-mode wiring: before a managed agent connects, resolve the
operating credential by precedence — stored cak_ (steady state, no
network) -> first-run enrollment -> DEPRECATED legacy api_key (transition
only, logged at WARNING) -> error. The relay already accepts the cak_ as
the api_key query param, so the persistent transport authenticates with it
unchanged. Attended/support-code and viewer paths are untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:44:40 -07:00
87c6e17d4a feat(agent): cak_ at-rest credential store (SPEC-016 Phase B item 4)
Store the per-machine cak_ with BOTH layers Mike locked: DPAPI-machine
encryption (CryptProtectData with CRYPTPROTECT_LOCAL_MACHINE — a copied
blob is inert off the box) inside a SYSTEM/Administrators-only ACL'd file
at %ProgramData%\GuruConnect\credentials\agent.cak. The directory + file
ACL is hardened via icacls (/inheritance:r + grant to the well-known SIDs
*S-1-5-18 and *S-1-5-32-544, locale-independent) — auditable, with far
less unsafe FFI than building a registry-key security descriptor by hand.
Co-locates with the existing %ProgramData%\GuruConnect config/seed dir.

Provides store_cak / load_cak / clear_cak. store_cak writes atomically
(temp file + rename in the locked dir). load_cak treats a present-but-
undecryptable blob as a hard error (tamper / cross-machine copy) rather
than silently re-enrolling over it. The plaintext is never logged; the
transient plaintext copy is scrubbed after encryption. DPAPI output blobs
are LocalFree'd. Enables the Win32_Security_Cryptography windows feature.

Round-trip unit tests cover encrypt/decrypt recovery across lengths and
that a tampered blob fails to decrypt (DPAPI authenticates its blobs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:44:23 -07:00
6a000d012f feat(agent): extend config contract for enrollment (SPEC-016 Phase B item 2)
Add enrollment_key + site_code to EmbeddedConfig and the resolved Config
alongside the existing labels, and add department/device_type label fields
(SPEC-007 AgentStatus parity). The legacy api_key is retained but made
optional/defaulted so a SPEC-016 site installer can carry only the
enrollment credentials; existing pre-enrollment installers still parse.

The enrollment fields are #[serde(skip)] on Config so they are never
written to the on-disk TOML (install-time material only); apply_enrollment_env
layers them from GURUCONNECT_ENROLLMENT_KEY / GURUCONNECT_SITE_CODE on the
file and env load paths. The embedded path carries them from the install
blob. Config delivery itself (signed wrapper) is Phase C and unchanged here.

Add Config::https_base() deriving the REST API base (https://host[:port])
from the wss:// server_url so the enroll client and the persistent
transport share one authority.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:44:09 -07:00
d0b8db070f feat(agent): hardware-salt machine_uid (SPEC-016 Phase B item 1)
Extend the SPEC-004 machine_uid derivation with the locked SPEC-016
hardware salt: combine the Windows MachineGuid with the SMBIOS system
UUID (Win32_ComputerSystemProduct.UUID), falling back to motherboard
serial (Win32_BaseBoard.SerialNumber) + primary disk serial when the
SMBIOS UUID is absent or a degenerate placeholder (all-zeros / all-FFs,
emitted by some OEMs and hypervisor templates).

Signals are read via narrow PowerShell CIM queries (hidden window, no
profile) rather than adding a WMI crate or hand-rolling COM IWbemServices
for two scalar reads. Values are normalized (trim + upper-case) so vendor
case/space drift never perturbs the digest. The combined string is
SHA-256'd into the existing opaque muid_<hex> shape, preserving the wire
identity the relay connect path already reports while making it survive an
OS re-image on the same hardware. Which signal set fed the result is
logged (source label only, never the secret values).

Adds unit tests for derivation determinism + signal-sensitivity,
degenerate-SMBIOS rejection, and signal normalization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 11:43:56 -07:00
89c3718266 Merge pull request 'SPEC-016 Phase A: zero-touch enrollment backend + migration' (#5) from feat/spec-016-enrollment into main
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 10m37s
Build and Test / Build Server (Linux) (push) Successful in 15m25s
Build and Test / Security Audit (push) Successful in 5m28s
Build and Test / Build Summary (push) Successful in 23s
2026-06-02 11:19:37 -07:00
4106fc4bc4 style(enroll): cargo fmt --all (satisfy CI fmt gate)
All checks were successful
Build and Test / Build Agent (Windows) (pull_request) Successful in 16m35s
Build and Test / Build Server (Linux) (pull_request) Successful in 19m7s
Build and Test / Security Audit (pull_request) Successful in 5m27s
Build and Test / Build Summary (pull_request) Successful in 26s
The Phase A work passed cargo check + clippy + tests locally but missed
`cargo fmt --all -- --check` (the first step of the Linux CI job): module
ordering in db/mod.rs and two trailing-comment alignments in rate_limit.rs.
No logic change. Agent build failure on the prior run was transient infra
(verified: agent crate compiles clean locally).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 10:48:51 -07:00
0f02f23765 fix(enroll): SPEC-016 Phase A review fixes (cross-site guard, timing oracle, TOCTOU)
Some checks failed
Build and Test / Build Agent (Windows) (pull_request) Failing after 10m11s
Build and Test / Build Server (Linux) (pull_request) Failing after 10m5s
Build and Test / Security Audit (pull_request) Successful in 8m5s
Build and Test / Build Summary (pull_request) Has been skipped
Applies the four review fixes to POST /api/enroll, all in server/src/api/enroll.rs
(+ a new ENROLL_SITE_CONFLICT event type in server/src/db/events.rs):

1. HIGH — close the within-tenant cross-site silent-move hijack. A valid key for
   site B presented for a machine_uid already bound to a DIFFERENT site is now
   REFUSED (409 ENROLL_SITE_CONFLICT) instead of silently repointing the row and
   minting a fresh cak_. No move, no key. Emits an ENROLL_SITE_CONFLICT audit event
   + alert TODO. Same-site match still resolves to reuse; a NULL prior site_id is a
   first relational bind, not a move. The unauthenticated site_move mint path is
   removed; deliberate moves are deferred to the Phase-B --reassign flow + dashboard.

2. MEDIUM — kill the timing/enumeration oracle. Unknown site_code and no-active-key
   early rejects now pay a dummy Argon2id verify against a fixed, valid throwaway PHC
   constant (TIMING_EQUALIZER_PHC) before returning the identical 401, so every
   rejection path pays one KDF. The constant is asserted valid + verifying in tests.

3. LOW — fix the new-enroll TOCTOU. The dedup lookup + INSERT is wrapped in a bounded
   retry loop: a concurrent first-enroll of the same machine_uid whose INSERT loses
   the unique-index race (classified by is_machine_uid_conflict on SQLSTATE 23505 +
   machine_uid constraint) now re-looks-up and converges to reuse instead of 500ing.
   A non-machine_uid unique violation still surfaces as 500.

4. LOW — make the collision-gate doc honest + leave an enforcement TODO. The module
   doc now states the gate withholds only a NEWLY minted cak_ (a prior clean cak_
   survives) and that nothing consults enrollment_state at control time yet, with a
   TODO(SPEC-016 Phase B/D) marker for relay/control-plane enforcement + revocation.

Verify: cargo check, cargo clippy --all-targets, and cargo test all clean on this
Windows host (104 tests pass). Two DB-gated tests (cross-site bound-site_id exposure,
machine_uid-vs-agent_id conflict classification) no-op without TEST_DATABASE_URL and
run against real Postgres in CI; the Linux target / real-Postgres handler path is
validated there, not on this host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 10:28:31 -07:00
59e40c8019 feat(enroll): SPEC-016 Phase A — enrollment backend + migration
Server-side zero-touch per-site enrollment (Phase A: backend + DB only;
agent-side machine_uid derivation is Phase B, server treats it as opaque).

Migration 010_spec016_enrollment.sql:
- connect_sites: relational site anchor (site_code natural key, per-tenant
  unique). The spec assumed a sites table existed; it did not (site/company
  were free-text columns on connect_machines), so this creates a minimal one.
- site_enrollment_keys: rotatable, Argon2id-hashed cek_ secret + monotonic
  version + hex fingerprint + active flag; one-active-per-site partial unique.
- connect_machines: + site_id (FK), + enrollment_state ('active'|'pending')
  collision gate, + per-tenant (tenant_id, machine_uid) unique index added
  ALONGSIDE the 008 global index (the connect-path upsert_machine ON CONFLICT
  arbiter binds to 008 — dropping it would break live reconnect).
- connect_sites.enrollment_policy: reserved (default auto-approve), not enforced.

auth/enrollment_keys.rs: cek_ mint (256-bit, OS CSPRNG), Argon2id hash/verify
(reuses auth::password), and hex fingerprint vN (XXXX) per resolved-decision #3.

db/sites.rs + db/enrollment_keys.rs: runtime sqlx persistence; rotate_key
deactivates+inserts in one tx to hold the one-active-key invariant.

POST /api/enroll (public, api/enroll.rs): site_code+cek_ verify against active
key -> dedup on (tenant, machine_uid) -> new / reuse / site-move / collision.
Collision gate (PROVISIONAL heuristic: online existing row + different hostname)
-> pending, no usable cak_, alert. Mints cak_ via existing agent_keys path in the
exact form relay::validate_agent_api_key expects. Per-(site_code,IP) rate-limit +
lockout (EnrollLimiter). Audit events + [ENROLL] alert markers with
TODO(SPEC-016) #dev-alerts notes.

Admin (JWT) api/sites.rs: POST /api/sites/:id/enrollment-key/rotate (plaintext +
fingerprint once) and GET .../enrollment-key (fingerprint/version, no secret).

Routes wired in main.rs (enroll public, rotation admin). 13 new unit tests;
full server suite 99 passing. cargo check + clippy clean on the host (Windows)
target — Linux cross-target not installed here; server crate is platform-neutral
Rust. No sqlx offline cache needed (codebase uses runtime queries, no query!).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 10:12:35 -07:00
c286a29b9d spec: SPEC-016 resolve all 5 open questions (enrollment design decisions)
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 14m25s
Build and Test / Build Server (Linux) (push) Successful in 20m31s
Build and Test / Security Audit (push) Successful in 8m28s
Build and Test / Build Summary (push) Successful in 30s
Fold the 2026-06-02 interview decisions into SPEC-016:
- Installer wrapper: ship BOTH signed .exe and signed MSI per site
- cak_ at-rest storage: DPAPI-machine-encrypted blob in a SYSTEM-ACL'd location
- Fingerprint: hex (7F2A), deliberately unlike RMM word-codes
- machine_uid: per-tenant scope + hardware-derived salt (survives re-image,
  separates distinct boxes) + collision-gated activation (template-cloned VMs
  sharing a hardware UUID drop to pending + alert, need dashboard confirm)
- Attended support-code path: unchanged (filename-based, already signing-safe)

Open Questions section -> Resolved decisions + a short Remaining-for-planning
list (exact hardware salt signal set, WiX/MSI authoring approach).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:54:19 -07:00
18429f6fe3 spec: add SPEC-016 zero-touch per-site agent enrollment
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 10m46s
Build and Test / Build Server (Linux) (push) Successful in 15m33s
Build and Test / Security Audit (push) Successful in 6m3s
Build and Test / Build Summary (push) Successful in 25s
ScreenConnect-class managed enrollment: one signed installer per site,
machines self-register on first run and the server mints a per-machine
cak_ key bound to a deterministic machine_uid (dedups re-installs).
Per-site rotatable enrollment key (long secret + vN (XXXX) fingerprint);
rotating blocks new enrollments from old installers, leaves enrolled
agents untouched. Auto-approve + new-enrollment/site-move alert.

Resolves SPEC-007's signature-vs-appended-config open question:
sign the base agent once in CI + per-site signed wrapper that writes
site config around the signed bytes (never appended into the PE).

Deferred (room reserved): enrollment policy + per-seat licensing,
--enroll-key/--site-code/--reassign flag overrides, technician-assisted
interactive install. Tracking todo dbfe6a56.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 09:13:59 -07:00
3b9e4068c9 docs(roadmap): mark release signing shipped; add signed beta channel as P1-NOW
All checks were successful
Build and Test / Build Server (Linux) (push) Successful in 14m11s
Build and Test / Build Agent (Windows) (push) Successful in 8m3s
Build and Test / Security Audit (push) Successful in 5m38s
Build and Test / Build Summary (push) Successful in 17s
Release-path Azure Trusted Signing and auto-versioning were already
shipped with v0.3.0 (stale [ ] -> [x]). Add a new P1/NOW item for a
signed beta/test release channel: the auto build-and-test.yml agent
artifact is unsigned, so testers can receive unsigned binaries. The
beta channel (now implemented in release.yml) closes that gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:57:04 -07:00
87f229509b ci(release): add signed beta/test release channel
Some checks failed
Build and Test / Build Server (Linux) (push) Has started running
Build and Test / Build Agent (Windows) (push) Has started running
Build and Test / Security Audit (push) Has been cancelled
Build and Test / Build Summary (push) Has been cancelled
Add a `channel: stable | beta` workflow_dispatch input to release.yml.
`stable` is unchanged (byte-for-byte). `beta` produces a Windows agent
binary signed by the identical fail-closed Azure Trusted Signing path,
but skips the semver bump, changelog, and release commit, and publishes
a prerelease-tagged Gitea release (vX.Y.Z-beta.<run_number>) at HEAD.

So every binary handed to a tester is signed, not just formal releases.

- prerelease tags excluded from stable LAST_TAG detection (both lookups)
  so a beta tag can't corrupt the next stable version computation
- beta tag force-created/pushed -> idempotent on failed-run re-runs
- changelog download gated to stable; release prerelease flag plumbed
  through to the Gitea REST payload

Reviewed-by: Code Review Agent (APPROVE WITH NITS; N1 hardened)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:56:17 -07:00
40c7d860cc spec(v2-session-core): add Task 9 — cak_ auto-enroll provisioning (TOFU) + shared-key retirement
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 7m10s
Build and Test / Build Server (Linux) (push) Successful in 10m31s
Build and Test / Security Audit (push) Successful in 4m1s
Build and Test / Build Summary (push) Successful in 9s
2026-06-01 14:40:14 -07:00
0059b21db6 fix(server): revert migration 008 comment edit — modifying an applied sqlx migration breaks its checksum and crash-loops the server on startup; machines.rs ON CONFLICT fix retained
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 7m33s
Build and Test / Build Server (Linux) (push) Successful in 11m57s
Build and Test / Security Audit (push) Successful in 4m33s
Build and Test / Build Summary (push) Successful in 11s
2026-06-01 10:05:38 -07:00
f950511e3e fix(server): bind machine_uid upsert ON CONFLICT to the partial index (WHERE machine_uid IS NOT NULL)
Some checks failed
Build and Test / Build Agent (Windows) (push) Successful in 8m16s
Build and Test / Build Server (Linux) (push) Successful in 11m58s
Build and Test / Security Audit (push) Has started running
Build and Test / Build Summary (push) Has been cancelled
Bare ON CONFLICT (machine_uid) could not bind to migration 008's partial unique index, so no connect_machines row was persisted for any agent reporting a machine_uid. Confirmed live on 172.16.3.30 with a signed 0.3.0 test agent.
2026-06-01 09:50:34 -07:00
16017456aa docs: 2026-05-31 security re-audit (Phase-1 EXIT) + roadmap reconcile
All checks were successful
Build and Test / Build Agent (Windows) (push) Successful in 6m59s
Build and Test / Build Server (Linux) (push) Successful in 10m35s
Build and Test / Security Audit (push) Successful in 4m3s
Build and Test / Build Summary (push) Successful in 7s
/gc-audit --pass=security re-pass over the deployed v0.3.0 code: PASS,
0 CRITICAL/HIGH/MEDIUM/LOW. The 3 relay CRITICALs stay closed (verified in
code AND live against the deployed binary), the prior agent-update-TLS HIGH
and chat-logging LOW are fixed, and the net-new SPEC-004 surface (machine_uid
dedup gate, session reaper/supersede, operator removal API) audits clean —
no non-admin removal path, no uid-spoof hijack, no auth-plane crossover.

Marks v2 Phase 1 formally exited (secure-session-core Task 8 complete).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 18:19:09 -07:00
27 changed files with 4523 additions and 45 deletions

View File

@@ -27,6 +27,15 @@ on:
# computes the next semver from conventional commits at dispatch time. # computes the next semver from conventional commits at dispatch time.
# build-and-test.yml remains the automatic PR/push CI gate. # build-and-test.yml remains the automatic PR/push CI gate.
workflow_dispatch: workflow_dispatch:
inputs:
channel:
description: 'Release channel (stable = full versioned release; beta = signed prerelease test build, no version bump/changelog)'
required: true
default: 'stable'
type: choice
options:
- stable
- beta
jobs: jobs:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -36,8 +45,11 @@ jobs:
name: Version + Changelog name: Version + Changelog
runs-on: ubuntu-latest runs-on: ubuntu-latest
outputs: outputs:
version: ${{ steps.bump.outputs.version }} # Coalesce across the stable (bump) and beta (beta) paths: exactly one of them runs per
released: ${{ steps.bump.outputs.released }} # dispatch, so the first non-empty value wins. prerelease is 'true' only on the beta path.
version: ${{ steps.bump.outputs.version || steps.beta.outputs.version }}
released: ${{ steps.bump.outputs.released || steps.beta.outputs.released }}
prerelease: ${{ steps.beta.outputs.prerelease || 'false' }}
steps: steps:
- name: Checkout (full history + tags) - name: Checkout (full history + tags)
uses: actions/checkout@v4 uses: actions/checkout@v4
@@ -59,7 +71,8 @@ jobs:
fi fi
- name: Install git-cliff - name: Install git-cliff
if: steps.guard.outputs.skip != 'true' # Stable-only: beta produces no changelog, so git-cliff is unnecessary on the beta path.
if: steps.guard.outputs.skip != 'true' && github.event.inputs.channel == 'stable'
run: | run: |
set -euo pipefail set -euo pipefail
CLIFF_VERSION="2.6.1" CLIFF_VERSION="2.6.1"
@@ -72,12 +85,16 @@ jobs:
- name: Determine next version and bump components - name: Determine next version and bump components
id: bump id: bump
if: steps.guard.outputs.skip != 'true' # Stable-only: the beta path (id: beta) handles versioning without a manifest bump/commit.
if: steps.guard.outputs.skip != 'true' && github.event.inputs.channel == 'stable'
run: | run: |
set -euo pipefail set -euo pipefail
# ----- locate the last release tag (vX.Y.Z) ----- # ----- locate the last release tag (vX.Y.Z) -----
LAST_TAG="$(git tag --list 'v*' --sort=-v:refname | head -n1 || true)" # Match ONLY strict final-release tags (vMAJOR.MINOR.PATCH). Beta tags look like
# v0.3.0-beta.7; if one of those were picked up here it would corrupt the next stable
# base version, so prerelease tags are explicitly excluded from this lookup.
LAST_TAG="$(git tag --list 'v*' --sort=-v:refname | grep -E '^v[0-9]+\.[0-9]+\.[0-9]+$' | head -n1 || true)"
if [ -z "${LAST_TAG}" ]; then if [ -z "${LAST_TAG}" ]; then
echo "[INFO] No prior release tag found; baseline is current manifest version." echo "[INFO] No prior release tag found; baseline is current manifest version."
BASE_VERSION="$(grep -m1 '^version' agent/Cargo.toml | sed -E 's/.*"([0-9]+\.[0-9]+\.[0-9]+)".*/\1/')" BASE_VERSION="$(grep -m1 '^version' agent/Cargo.toml | sed -E 's/.*"([0-9]+\.[0-9]+\.[0-9]+)".*/\1/')"
@@ -186,8 +203,39 @@ jobs:
sed -i -E "0,/^version = \"[0-9]+\.[0-9]+\.[0-9]+\"/s//version = \"${NEXT}\"/" Cargo.toml || true sed -i -E "0,/^version = \"[0-9]+\.[0-9]+\.[0-9]+\"/s//version = \"${NEXT}\"/" Cargo.toml || true
fi fi
- name: Beta channel - tag prerelease build (no bump, no commit, no changelog)
id: beta
# Beta-only path. Reuses the IDENTICAL downstream build + sign + publish jobs, but does
# NOT compute a semver bump, mutate any manifest, generate a changelog, or make a release
# commit. It just tags the CURRENT HEAD with a unique prerelease version so the Windows
# build job can check out `ref: v${VER}` exactly as it does for stable.
if: github.event.inputs.channel == 'beta' && steps.guard.outputs.skip != 'true'
run: |
set -euo pipefail
# Base version is read straight from the agent manifest — NOT bumped, NOT written back.
BASE="$(grep -m1 '^version' agent/Cargo.toml | sed -E 's/.*"([0-9]+\.[0-9]+\.[0-9]+)".*/\1/')"
# GITHUB_RUN_NUMBER guarantees a unique prerelease suffix without counting existing tags.
VER="${BASE}-beta.${GITHUB_RUN_NUMBER}"
echo "[INFO] Beta build version: ${VER} (base ${BASE}, run ${GITHUB_RUN_NUMBER})"
# Tag the current HEAD (no release commit). Push the tag so build-agent-windows can
# check out ref: v${VER}.
git config user.name "guruconnect-ci"
git config user.email "ci@azcomputerguru.com"
# Beta tags are disposable test markers; force makes re-running a failed beta dispatch idempotent (re-run reuses GITHUB_RUN_NUMBER, so the tag already exists).
git tag -f "v${VER}"
REMOTE="https://${{ secrets.CI_PUSH_TOKEN }}@git.azcomputerguru.com/${GITHUB_REPOSITORY}.git"
git push --force "${REMOTE}" "v${VER}"
echo "[OK] Pushed beta prerelease tag v${VER}"
echo "version=${VER}" >> "$GITHUB_OUTPUT"
echo "released=true" >> "$GITHUB_OUTPUT"
echo "prerelease=true" >> "$GITHUB_OUTPUT"
- name: Generate changelog (git-cliff) - name: Generate changelog (git-cliff)
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' # Stable-only: beta produces no changelog artifact.
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' && github.event.inputs.channel == 'stable'
env: env:
VERSION: ${{ steps.bump.outputs.version }} VERSION: ${{ steps.bump.outputs.version }}
run: | run: |
@@ -232,7 +280,10 @@ jobs:
# Re-derive the set of changed components (same logic as the bump step). On the first # Re-derive the set of changed components (same logic as the bump step). On the first
# release (no prior tag) all components are considered changed. # release (no prior tag) all components are considered changed.
LAST_TAG="$(git tag --list 'v*' --sort=-v:refname | head -n1 || true)" # Match ONLY strict final-release tags (vMAJOR.MINOR.PATCH); exclude beta prerelease
# tags (v0.3.0-beta.7) so the changelog diff range is taken against the last real
# release, not an intervening beta build.
LAST_TAG="$(git tag --list 'v*' --sort=-v:refname | grep -E '^v[0-9]+\.[0-9]+\.[0-9]+$' | head -n1 || true)"
if [ -z "${LAST_TAG}" ]; then if [ -z "${LAST_TAG}" ]; then
CHANGED_FILES="$(git ls-files)" CHANGED_FILES="$(git ls-files)"
FIRST_RELEASE=true FIRST_RELEASE=true
@@ -252,7 +303,8 @@ jobs:
fi fi
- name: Commit release + create tag - name: Commit release + create tag
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' # Stable-only: beta tags HEAD directly in the beta step and never makes a release commit.
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' && github.event.inputs.channel == 'stable'
env: env:
VERSION: ${{ steps.bump.outputs.version }} VERSION: ${{ steps.bump.outputs.version }}
run: | run: |
@@ -276,7 +328,8 @@ jobs:
echo "[OK] Pushed release commit and tag v${VERSION}" echo "[OK] Pushed release commit and tag v${VERSION}"
- name: Upload changelog artifact - name: Upload changelog artifact
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' # Stable-only: there is no changelog on the beta path, so nothing to upload.
if: steps.guard.outputs.skip != 'true' && steps.bump.outputs.released == 'true' && github.event.inputs.channel == 'stable'
uses: actions/upload-artifact@v3 uses: actions/upload-artifact@v3
with: with:
name: changelog name: changelog
@@ -445,6 +498,9 @@ jobs:
echo "sha256=${SUM}" >> "$GITHUB_OUTPUT" echo "sha256=${SUM}" >> "$GITHUB_OUTPUT"
- name: Download changelog artifact - name: Download changelog artifact
# Stable-only: the beta path uploads no `changelog` artifact. The release-creation step
# already guards on `[ -f changelog-artifact/CHANGELOG.md ]`, so skipping this is safe.
if: github.event.inputs.channel == 'stable'
uses: actions/download-artifact@v3 uses: actions/download-artifact@v3
with: with:
name: changelog name: changelog
@@ -472,17 +528,26 @@ jobs:
env: env:
VERSION: ${{ needs.version.outputs.version }} VERSION: ${{ needs.version.outputs.version }}
SHA256: ${{ steps.sha.outputs.sha256 }} SHA256: ${{ steps.sha.outputs.sha256 }}
# PRERELEASE is 'true' on the beta path, 'false' on stable; drives the Gitea release flag.
PRERELEASE: ${{ needs.version.outputs.prerelease }}
GITEA_TOKEN: ${{ secrets.CI_PUSH_TOKEN }} GITEA_TOKEN: ${{ secrets.CI_PUSH_TOKEN }}
run: | run: |
set -euo pipefail set -euo pipefail
API_BASE="https://git.azcomputerguru.com/api/v1/repos/${GITHUB_REPOSITORY}" API_BASE="https://git.azcomputerguru.com/api/v1/repos/${GITHUB_REPOSITORY}"
TAG="v${VERSION}" TAG="v${VERSION}"
echo "[INFO] Creating Gitea release ${TAG} on ${GITHUB_REPOSITORY}" echo "[INFO] Creating Gitea release ${TAG} on ${GITHUB_REPOSITORY} (prerelease=${PRERELEASE})"
# Beta builds get a clear "prerelease test build" note in the body; the -beta.N suffix
# is already carried in TAG, so the release name "Release v..." needs no extra handling.
if [ "${PRERELEASE}" = "true" ]; then
BODY="$(printf 'GuruConnect %s (PRERELEASE / beta test build)\n\nSHA-256 (guruconnect.exe): %s\n\nSigned via Azure Trusted Signing. Not a stable release — no changelog/version bump.' "${TAG}" "${SHA256}")"
else
BODY="$(printf 'GuruConnect %s\n\nSHA-256 (guruconnect.exe): %s\n\nSee CHANGELOG.md and /api/changelog for details.' "${TAG}" "${SHA256}")" BODY="$(printf 'GuruConnect %s\n\nSHA-256 (guruconnect.exe): %s\n\nSee CHANGELOG.md and /api/changelog for details.' "${TAG}" "${SHA256}")"
fi
# Build the JSON payload with python (handles escaping of the multi-line body safely). # Build the JSON payload with python (handles escaping of the multi-line body safely).
CREATE_PAYLOAD="$(TAG="$TAG" BODY="$BODY" python3 -c 'import json,os; print(json.dumps({"tag_name": os.environ["TAG"], "name": "Release " + os.environ["TAG"], "body": os.environ["BODY"], "draft": False, "prerelease": False}))')" # prerelease is derived from the PRERELEASE env var (beta -> true, stable -> false).
CREATE_PAYLOAD="$(TAG="$TAG" BODY="$BODY" PRERELEASE="$PRERELEASE" python3 -c 'import json,os; print(json.dumps({"tag_name": os.environ["TAG"], "name": "Release " + os.environ["TAG"], "body": os.environ["BODY"], "draft": False, "prerelease": os.environ.get("PRERELEASE","false") == "true"}))')"
RELEASE_JSON="$(curl -fsS -X POST \ RELEASE_JSON="$(curl -fsS -X POST \
"${API_BASE}/releases" \ "${API_BASE}/releases" \

4
Cargo.lock generated
View File

@@ -1407,7 +1407,7 @@ dependencies = [
[[package]] [[package]]
name = "guruconnect" name = "guruconnect"
version = "0.2.0" version = "0.3.0"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"bytes", "bytes",
@@ -1447,7 +1447,7 @@ dependencies = [
[[package]] [[package]]
name = "guruconnect-server" name = "guruconnect-server"
version = "0.2.0" version = "0.3.0"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"argon2", "argon2",

View File

@@ -92,6 +92,7 @@ windows = { version = "0.58", features = [
"Win32_System_Console", "Win32_System_Console",
"Win32_System_Environment", "Win32_System_Environment",
"Win32_Security", "Win32_Security",
"Win32_Security_Cryptography",
"Win32_Storage_FileSystem", "Win32_Storage_FileSystem",
"Win32_System_Pipes", "Win32_System_Pipes",
"Win32_System_SystemServices", "Win32_System_SystemServices",

View File

@@ -16,18 +16,39 @@ use uuid::Uuid;
const MAGIC_MARKER: &[u8] = b"GURUCONFIG"; const MAGIC_MARKER: &[u8] = b"GURUCONFIG";
/// Embedded configuration data (appended to executable) /// Embedded configuration data (appended to executable)
///
/// SPEC-016 Phase B: a managed-install config now carries the per-site
/// `enrollment_key` + `site_code` so the agent can self-register on first run.
/// The legacy `api_key` is retained (defaulted) for backward-compat with older
/// pre-enrollment installers; a fresh site installer carries only the enrollment
/// credentials and the agent obtains its per-machine `cak_` via `/api/enroll`.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EmbeddedConfig { pub struct EmbeddedConfig {
/// Server WebSocket URL /// Server WebSocket URL
pub server_url: String, pub server_url: String,
/// API key for authentication /// DEPRECATED shared/legacy API key for authentication. Optional — a
pub api_key: String, /// SPEC-016 site installer omits it and enrolls for a per-machine `cak_`.
#[serde(default)]
pub api_key: Option<String>,
/// Per-site enrollment key (`cek_`), the low-sensitivity registration gate
/// (SPEC-016 §Security). Presented to `/api/enroll`; never logged.
#[serde(default)]
pub enrollment_key: Option<String>,
/// Per-site code identifying which site this installer enrolls into.
#[serde(default)]
pub site_code: Option<String>,
/// Company/organization name /// Company/organization name
#[serde(default)] #[serde(default)]
pub company: Option<String>, pub company: Option<String>,
/// Site/location name /// Site/location name
#[serde(default)] #[serde(default)]
pub site: Option<String>, pub site: Option<String>,
/// Department label (reserved — SPEC-007 AgentStatus parity).
#[serde(default)]
pub department: Option<String>,
/// Device-type label (reserved — SPEC-007 AgentStatus parity).
#[serde(default)]
pub device_type: Option<String>,
/// Tags for categorization /// Tags for categorization
#[serde(default)] #[serde(default)]
pub tags: Vec<String>, pub tags: Vec<String>,
@@ -52,9 +73,28 @@ pub struct Config {
/// Server WebSocket URL (e.g., wss://connect.example.com/ws) /// Server WebSocket URL (e.g., wss://connect.example.com/ws)
pub server_url: String, pub server_url: String,
/// Agent API key for authentication /// Operating credential used to authenticate the persistent WS connection.
///
/// SPEC-016 Phase B: the AUTHORITATIVE credential is a per-machine `cak_`
/// obtained at first-run enrollment and stored encrypted at rest (see
/// [`crate::credential_store`]); it is loaded into this field before connect.
/// A non-empty value carried in config is the DEPRECATED shared/legacy
/// `api_key`, kept only for transition compatibility. Empty means "not yet
/// enrolled / no credential" — the run-mode wiring must enroll first.
#[serde(default)]
pub api_key: String, pub api_key: String,
/// Per-site enrollment key (`cek_`) — present only for a not-yet-enrolled
/// managed install. Never persisted to the on-disk TOML (it is install-time
/// material, delivered by the site wrapper); never logged.
#[serde(skip)]
pub enrollment_key: Option<String>,
/// Per-site code identifying which site to enroll into (paired with
/// `enrollment_key`). Not persisted to the on-disk TOML.
#[serde(skip)]
pub site_code: Option<String>,
/// Unique agent identifier (generated on first run) /// Unique agent identifier (generated on first run)
#[serde(default = "generate_agent_id")] #[serde(default = "generate_agent_id")]
pub agent_id: String, pub agent_id: String,
@@ -70,6 +110,14 @@ pub struct Config {
#[serde(default)] #[serde(default)]
pub site: Option<String>, pub site: Option<String>,
/// Department label (reserved — SPEC-007 AgentStatus parity).
#[serde(default)]
pub department: Option<String>,
/// Device-type label (reserved — SPEC-007 AgentStatus parity).
#[serde(default)]
pub device_type: Option<String>,
/// Tags for categorization (from embedded config) /// Tags for categorization (from embedded config)
#[serde(default)] #[serde(default)]
pub tags: Vec<String>, pub tags: Vec<String>,
@@ -91,6 +139,25 @@ fn generate_agent_id() -> String {
Uuid::new_v4().to_string() Uuid::new_v4().to_string()
} }
/// Layer SPEC-016 enrollment material from the environment onto a `Config`.
///
/// `GURUCONNECT_ENROLLMENT_KEY` / `GURUCONNECT_SITE_CODE` only OVERRIDE when set
/// and non-empty, so embedded/install-time values already present on the config
/// are preserved. Used by the file and env load paths (the embedded path already
/// carries these from the install blob).
fn apply_enrollment_env(config: &mut Config) {
if let Ok(v) = std::env::var("GURUCONNECT_ENROLLMENT_KEY") {
if !v.is_empty() {
config.enrollment_key = Some(v);
}
}
if let Ok(v) = std::env::var("GURUCONNECT_SITE_CODE") {
if !v.is_empty() {
config.site_code = Some(v);
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CaptureConfig { pub struct CaptureConfig {
/// Target frames per second (1-60) /// Target frames per second (1-60)
@@ -317,18 +384,26 @@ impl Config {
info!("Using embedded configuration"); info!("Using embedded configuration");
let config = Config { let config = Config {
server_url: embedded.server_url, server_url: embedded.server_url,
api_key: embedded.api_key, // Legacy/shared api_key if the installer carried one; empty
// otherwise (the SPEC-016 path enrolls for a per-machine cak_).
api_key: embedded.api_key.unwrap_or_default(),
enrollment_key: embedded.enrollment_key,
site_code: embedded.site_code,
agent_id: generate_agent_id(), agent_id: generate_agent_id(),
hostname_override: None, hostname_override: None,
company: embedded.company, company: embedded.company,
site: embedded.site, site: embedded.site,
department: embedded.department,
device_type: embedded.device_type,
tags: embedded.tags, tags: embedded.tags,
support_code: None, support_code: None,
capture: CaptureConfig::default(), capture: CaptureConfig::default(),
encoding: EncodingConfig::default(), encoding: EncodingConfig::default(),
}; };
// Save to file for persistence (so agent_id is preserved) // Save to file for persistence (so agent_id is preserved). The
// #[serde(skip)] enrollment fields are intentionally NOT written to
// the on-disk TOML — they are install-time material only.
let _ = config.save(); let _ = config.save();
return Ok(config); return Ok(config);
} }
@@ -349,8 +424,12 @@ impl Config {
let _ = config.save(); let _ = config.save();
} }
// support_code is always None when loading from file (set via CLI) // support_code is always None when loading from file (set via CLI).
config.support_code = None; config.support_code = None;
// The enrollment fields are #[serde(skip)], so a file never carries
// them; layer them in from the environment for testing / a
// file-delivered managed install that supplies them out-of-band.
apply_enrollment_env(&mut config);
return Ok(config); return Ok(config);
} }
@@ -365,18 +444,23 @@ impl Config {
let agent_id = let agent_id =
std::env::var("GURUCONNECT_AGENT_ID").unwrap_or_else(|_| generate_agent_id()); std::env::var("GURUCONNECT_AGENT_ID").unwrap_or_else(|_| generate_agent_id());
let config = Config { let mut config = Config {
server_url, server_url,
api_key, api_key,
enrollment_key: None,
site_code: None,
agent_id, agent_id,
hostname_override: std::env::var("GURUCONNECT_HOSTNAME").ok(), hostname_override: std::env::var("GURUCONNECT_HOSTNAME").ok(),
company: None, company: None,
site: None, site: None,
department: None,
device_type: None,
tags: Vec::new(), tags: Vec::new(),
support_code: None, support_code: None,
capture: CaptureConfig::default(), capture: CaptureConfig::default(),
encoding: EncodingConfig::default(), encoding: EncodingConfig::default(),
}; };
apply_enrollment_env(&mut config);
// Save config with generated agent_id for persistence // Save config with generated agent_id for persistence
let _ = config.save(); let _ = config.save();
@@ -384,6 +468,34 @@ impl Config {
Ok(config) Ok(config)
} }
/// Derive the HTTPS API base (e.g. `https://connect.example.com`) from the
/// agent's WebSocket `server_url` (e.g. `wss://connect.example.com/ws/agent`).
///
/// `/api/enroll` is REST/HTTPS while the persistent transport is `wss`, so we
/// reuse the same host/authority and swap scheme + drop the WS path. Mapping:
/// `wss` -> `https`, `ws` -> `http` (dev). Returns an error if `server_url`
/// has no parseable host.
pub fn https_base(&self) -> Result<String> {
let parsed = url::Url::parse(&self.server_url)
.with_context(|| format!("invalid server_url: {}", self.server_url))?;
let scheme = match parsed.scheme() {
"wss" | "https" => "https",
"ws" | "http" => "http",
other => {
return Err(anyhow!(
"unsupported server_url scheme '{other}' (expected ws/wss)"
))
}
};
let host = parsed
.host_str()
.ok_or_else(|| anyhow!("server_url has no host: {}", self.server_url))?;
Ok(match parsed.port() {
Some(port) => format!("{scheme}://{host}:{port}"),
None => format!("{scheme}://{host}"),
})
}
/// Get the configuration file path /// Get the configuration file path
fn config_path() -> PathBuf { fn config_path() -> PathBuf {
// Check for config in current directory first // Check for config in current directory first

View File

@@ -0,0 +1,413 @@
//! At-rest storage for the per-machine operating credential (`cak_`).
//!
//! SPEC-016 Phase B, item 4 + §Security. The `cak_` minted by `/api/enroll` is
//! the high-sensitivity, per-machine, independently-revocable operating
//! credential. It is stored with **two independent layers** (Mike's locked
//! decision — "BOTH layers"):
//!
//! 1. **DPAPI-machine encryption** (`CryptProtectData` with
//! `CRYPTPROTECT_LOCAL_MACHINE`): the on-disk bytes are a DPAPI blob keyed to
//! THIS machine. A copied/exfiltrated file is inert on any other box — DPAPI
//! machine keys do not leave the machine.
//! 2. **SYSTEM/Administrators-only ACL** on the containing directory + file: a
//! non-admin user cannot even read the ciphertext. Inheritance is removed and
//! only `SYSTEM` and `BUILTIN\Administrators` are granted full control.
//!
//! Local admin / SYSTEM can always recover the value — that is accepted (SPEC-016
//! §Security): the blast radius of one leaked `cak_` is a single, independently
//! revocable machine.
//!
//! Storage location (chosen over an HKLM value): a file under
//! `%ProgramData%\GuruConnect\credentials\agent.cak`. Rationale — the agent
//! already keeps its config and the `machine_uid` fallback seed under
//! `%ProgramData%\GuruConnect`, so co-locating keeps a single protected
//! directory; and a directory/file ACL applied via `icacls` is auditable with far
//! less unsafe FFI than building a registry-key security descriptor by hand. Both
//! storage shapes are explicitly permitted by the spec.
//!
//! SECURITY: the plaintext `cak_` is NEVER logged. Errors describe the operation,
//! not the value.
#![cfg(windows)]
use anyhow::{anyhow, Context, Result};
use std::path::PathBuf;
use thiserror::Error;
/// Failure classes for [`load_cak`], so callers can distinguish an *operational*
/// problem (the file exists but this process cannot open/read it — e.g. running in
/// the wrong security context against a SYSTEM-only-ACL'd store) from the real
/// *tamper / wrong-machine* signal (the file was read successfully but DPAPI
/// decryption failed).
///
/// The distinction matters for the run-mode resolver (`main.rs`):
/// - [`LoadCakError::Io`] is recoverable/actionable — log it and STOP (do not
/// silently re-enroll over a store we simply can't read in this context).
/// - [`LoadCakError::Decrypt`] is a hard tamper signal — STOP, do not re-enroll.
#[derive(Debug, Error)]
pub enum LoadCakError {
/// The store path could not be resolved (e.g. `%ProgramData%` unset).
#[error("could not resolve credential store path: {0}")]
Path(String),
/// An IO/open/read error reaching the stored blob — INCLUDING
/// `PermissionDenied` (the running context lacks rights to the SYSTEM-only
/// store). Operational, not a tamper signal.
#[error("credential store is present but could not be read in this context: {source}")]
Io {
/// Whether this was specifically an access-denied error (drives the
/// run-mode fail-fast guard in `main.rs`).
permission_denied: bool,
source: std::io::Error,
},
/// The blob was read successfully but DPAPI decryption FAILED — the real
/// tamper / wrong-machine / corruption signal. A hard stop; never re-enroll.
#[error("stored credential failed to decrypt (wrong machine, tampered, or corrupted): {0}")]
Decrypt(String),
}
/// Directory holding the protected credential file.
fn credentials_dir() -> Result<PathBuf> {
let program_data =
std::env::var("ProgramData").context("ProgramData environment variable is not set")?;
Ok(PathBuf::from(program_data)
.join("GuruConnect")
.join("credentials"))
}
/// Full path to the DPAPI-encrypted `cak_` blob.
fn cak_path() -> Result<PathBuf> {
Ok(credentials_dir()?.join("agent.cak"))
}
/// Persist `cak` encrypted at rest.
///
/// Ordering is security-critical (H2 — TOCTOU): the directory ACL is locked
/// BEFORE any secret bytes touch the filesystem, and the temp file is written
/// INSIDE the already-locked directory, so no ciphertext ever exists at a path
/// carrying an inherited (potentially world-readable) ACL:
///
/// 1. `create_dir_all(dir)` — ensure the directory exists.
/// 2. `lock_down_acl(dir)` — remove inherited ACEs and grant SYSTEM +
/// Administrators full control, made inheritable `(OI)(CI)` so children
/// created afterward are covered. This is an explicit precondition for the
/// write that follows — NOT an unstated inheritance assumption.
/// 3. DPAPI-machine-encrypt the plaintext.
/// 4. Write the ciphertext to a temp file inside the now-locked directory, then
/// rename over the target (atomic-ish replace).
/// 5. `lock_down_acl(file)` — assert the file's own ACL (belt-and-suspenders; the
/// file already inherits the directory's restrictive ACEs).
/// 6. C1 read-back: immediately attempt [`load_cak`] to PROVE the running
/// security context can read its own store. If it cannot (e.g. a non-SYSTEM
/// run wrote a SYSTEM-only store it can no longer read), fail HERE at enroll
/// time with an actionable error — rather than silently bricking on the next
/// boot when the steady-state path tries to load it.
///
/// Returns an error (never logs the plaintext) on any failure so the caller can
/// surface it / retry.
pub fn store_cak(cak: &str) -> Result<()> {
// 1 + 2: lock the directory ACL BEFORE writing any secret (H2 / TOCTOU).
let dir = credentials_dir()?;
std::fs::create_dir_all(&dir)
.with_context(|| format!("failed to create credentials dir {dir:?}"))?;
lock_down_acl(&dir).context("failed to restrict credentials directory ACL")?;
// 3: encrypt only after the destination directory is locked down.
let ciphertext = dpapi_protect(cak.as_bytes()).context("DPAPI encryption of cak_ failed")?;
// 4: write the temp file INSIDE the already-locked directory, then rename.
let path = cak_path()?;
let tmp = path.with_extension("cak.tmp");
std::fs::write(&tmp, &ciphertext)
.with_context(|| format!("failed to write temp credential file {tmp:?}"))?;
std::fs::rename(&tmp, &path)
.with_context(|| format!("failed to place credential file {path:?}"))?;
// 5: assert the file ACL too (the file already inherits the dir's ACEs).
lock_down_acl(&path).context("failed to restrict credential file ACL")?;
// 6: C1 read-back — confirm THIS context can read back what it just wrote.
// Catches the "wrote a SYSTEM-only store from a non-SYSTEM context" footgun at
// enroll time instead of as a silent brick on the next launch.
match load_cak() {
Ok(Some(_)) => {
tracing::info!("[ENROLL] stored per-machine credential (encrypted at rest)");
Ok(())
}
Ok(None) => Err(anyhow!(
"stored the credential but read-back returned nothing — refusing to proceed \
with an unverifiable credential store"
)),
Err(LoadCakError::Io {
permission_denied: true,
..
}) => Err(anyhow!(
"[ENROLL] wrote the credential store but cannot read it back in THIS security \
context (access denied). The store is ACL'd to SYSTEM + Administrators by \
design; the managed agent must run as the GuruConnect SYSTEM service (see \
SPEC-018) to read it. Refusing to leave an unreadable store behind."
)),
Err(e) => Err(anyhow::Error::new(e)
.context("stored the credential but the immediate read-back verification failed")),
}
}
/// Load and decrypt the stored `cak_`, or `Ok(None)` if no credential is stored.
///
/// Error classification (M1) — the caller MUST treat these differently:
/// - `Ok(None)` -> no store yet (NotFound or empty); enroll is fine.
/// - [`LoadCakError::Io`] -> the store exists but is unreadable in this
/// context (open/read error, INCLUDING access-denied). Operational; the caller
/// logs it and STOPS — it must NOT silently re-enroll over a store it merely
/// cannot read here.
/// - [`LoadCakError::Decrypt`] -> the bytes were read but DPAPI decryption
/// FAILED (wrong machine / tampered / corrupted). A hard tamper signal; STOP.
///
/// Only a successful READ whose decrypt fails is the tamper signal — an IO or
/// permission error is never conflated with tamper.
pub fn load_cak() -> std::result::Result<Option<String>, LoadCakError> {
let path = cak_path().map_err(|e| LoadCakError::Path(e.to_string()))?;
let ciphertext = match std::fs::read(&path) {
Ok(bytes) => bytes,
Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(None),
Err(e) => {
let permission_denied = e.kind() == std::io::ErrorKind::PermissionDenied;
return Err(LoadCakError::Io {
permission_denied,
source: e,
});
}
};
if ciphertext.is_empty() {
return Ok(None);
}
// Reaching here means the READ succeeded — so a decrypt failure now IS the real
// tamper / wrong-machine signal (never conflated with an IO/permission error).
let plaintext =
dpapi_unprotect(&ciphertext).map_err(|e| LoadCakError::Decrypt(e.to_string()))?;
let cak = String::from_utf8(plaintext)
.map_err(|e| LoadCakError::Decrypt(format!("decrypted bytes were not valid UTF-8: {e}")))?;
if cak.is_empty() {
return Ok(None);
}
Ok(Some(cak))
}
/// Remove the stored credential (e.g. on revocation / forced re-enroll).
/// Succeeds if the file is already absent.
///
/// Part of the store/load/clear API the spec requires (SPEC-016 item 4). Not yet
/// called from a code path — the relay-side `cak_` revocation / forced re-enroll
/// flow that drives it is the deferred SPEC-016 Phase B/D server work (the
/// `TODO(SPEC-016 Phase B/D): consider revoking existing cak_ on collision` note
/// in `server/src/api/enroll.rs`) — so it is retained as part of the complete
/// store API and explicitly allowed dead until that server work lands.
#[allow(dead_code)]
pub fn clear_cak() -> Result<()> {
let path = cak_path()?;
match std::fs::remove_file(&path) {
Ok(()) => {
tracing::info!("[ENROLL] cleared stored per-machine credential");
Ok(())
}
Err(e) if e.kind() == std::io::ErrorKind::NotFound => Ok(()),
Err(e) => Err(e).with_context(|| format!("failed to remove {path:?}")),
}
}
// ---------------------------------------------------------------------------
// DPAPI (machine scope)
// ---------------------------------------------------------------------------
/// DPAPI-machine-encrypt `plaintext` into a self-contained blob.
fn dpapi_protect(plaintext: &[u8]) -> Result<Vec<u8>> {
use windows::Win32::Security::Cryptography::{
CryptProtectData, CRYPTPROTECT_LOCAL_MACHINE, CRYPT_INTEGER_BLOB,
};
// CryptProtectData requires a mutable input pointer in the struct, though it
// does not modify the bytes; copy into a local Vec to get a *mut without
// aliasing the caller's slice.
let mut input = plaintext.to_vec();
let in_blob = CRYPT_INTEGER_BLOB {
cbData: u32::try_from(input.len()).context("plaintext too large for DPAPI")?,
pbData: input.as_mut_ptr(),
};
let mut out_blob = CRYPT_INTEGER_BLOB::default();
// SAFETY: in_blob points at a valid, sized buffer; out_blob is owned here and
// its pbData is allocated by DPAPI (freed via LocalFree below). No prompt
// struct / entropy / reserved args.
unsafe {
CryptProtectData(
&in_blob,
windows::core::PCWSTR::null(),
None,
None,
None,
CRYPTPROTECT_LOCAL_MACHINE,
&mut out_blob,
)
.context("CryptProtectData failed")?;
}
let result = copy_and_free_blob(&out_blob);
// Best-effort scrub of the transient plaintext copy.
input.iter_mut().for_each(|b| *b = 0);
result.ok_or_else(|| anyhow!("CryptProtectData returned an empty/invalid blob"))
}
/// DPAPI-decrypt a blob previously produced by [`dpapi_protect`] on this machine.
fn dpapi_unprotect(ciphertext: &[u8]) -> Result<Vec<u8>> {
use windows::Win32::Security::Cryptography::{
CryptUnprotectData, CRYPTPROTECT_LOCAL_MACHINE, CRYPT_INTEGER_BLOB,
};
let mut input = ciphertext.to_vec();
let in_blob = CRYPT_INTEGER_BLOB {
cbData: u32::try_from(input.len()).context("ciphertext too large for DPAPI")?,
pbData: input.as_mut_ptr(),
};
let mut out_blob = CRYPT_INTEGER_BLOB::default();
// SAFETY: as in dpapi_protect — valid sized input, owned output freed below.
unsafe {
CryptUnprotectData(
&in_blob,
None,
None,
None,
None,
CRYPTPROTECT_LOCAL_MACHINE,
&mut out_blob,
)
.context("CryptUnprotectData failed")?;
}
copy_and_free_blob(&out_blob)
.ok_or_else(|| anyhow!("CryptUnprotectData returned an empty/invalid blob"))
}
/// Copy a DPAPI output blob into an owned `Vec` and `LocalFree` the DPAPI buffer.
///
/// Returns `Some(bytes)` on success, `None` if the blob is null/empty. Always
/// frees `pbData` when non-null (DPAPI allocates it with `LocalAlloc`).
fn copy_and_free_blob(
blob: &windows::Win32::Security::Cryptography::CRYPT_INTEGER_BLOB,
) -> Option<Vec<u8>> {
use windows::Win32::Foundation::{LocalFree, HLOCAL};
if blob.pbData.is_null() {
return None;
}
// SAFETY: DPAPI guarantees pbData points at cbData valid bytes on success.
let bytes = unsafe { std::slice::from_raw_parts(blob.pbData, blob.cbData as usize).to_vec() };
// SAFETY: pbData was allocated by DPAPI via LocalAlloc; free it once.
unsafe {
let _ = LocalFree(HLOCAL(blob.pbData as *mut core::ffi::c_void));
}
if bytes.is_empty() {
None
} else {
Some(bytes)
}
}
// ---------------------------------------------------------------------------
// ACL hardening
// ---------------------------------------------------------------------------
/// Restrict `path` (file or directory) to SYSTEM + Administrators full control,
/// removing inherited ACEs so a permissive parent grant cannot leak read access.
///
/// Implemented via `icacls` — the documented, auditable mechanism — rather than
/// hand-rolling a security descriptor through `SetNamedSecurityInfoW` (hundreds
/// of lines of SID/ACL FFI). `icacls` ships on every supported Windows target.
/// A failure here is surfaced (the caller treats inability to lock down the
/// credential store as a hard error) but the well-known SIDs `*S-1-5-18`
/// (LocalSystem) and `*S-1-5-32-544` (BUILTIN\Administrators) are language- and
/// locale-independent, so this does not break on localized Windows.
fn lock_down_acl(path: &std::path::Path) -> Result<()> {
use std::os::windows::process::CommandExt;
use std::process::Command;
const CREATE_NO_WINDOW: u32 = 0x0800_0000;
let path_str = path
.to_str()
.ok_or_else(|| anyhow!("credential path is not valid UTF-8: {path:?}"))?;
// /inheritance:r -> remove inherited ACEs (drop the permissive parent grant)
// /grant:r -> replace any existing explicit grants for the principal
// *S-1-5-18 -> LocalSystem; *S-1-5-32-544 -> BUILTIN\Administrators
let output = Command::new("icacls")
.arg(path_str)
.args([
"/inheritance:r",
"/grant:r",
"*S-1-5-18:(OI)(CI)F",
"/grant:r",
"*S-1-5-32-544:(OI)(CI)F",
])
.creation_flags(CREATE_NO_WINDOW)
.output()
.context("failed to invoke icacls to harden credential ACL")?;
if !output.status.success() {
// icacls writes its diagnostics to stdout; surface the code only (no
// credential material is ever passed to icacls, only the path).
return Err(anyhow!(
"icacls failed to harden {path_str} (exit {:?})",
output.status.code()
));
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
/// DPAPI round-trips on the same machine: protect then unprotect must recover
/// the exact plaintext. (Runs on the build/test host, which IS the same
/// machine — the machine-scope key is available to any process here.)
#[test]
fn dpapi_roundtrip_recovers_plaintext() {
let secret = b"cak_test_value_0123456789abcdef";
let blob = dpapi_protect(secret).expect("DPAPI protect should succeed on this machine");
assert_ne!(
blob.as_slice(),
secret.as_slice(),
"ciphertext must differ from plaintext"
);
let recovered = dpapi_unprotect(&blob).expect("DPAPI unprotect should succeed");
assert_eq!(recovered, secret, "round-trip must recover the exact bytes");
}
/// A non-empty plaintext yields a non-empty, differing blob, and an empty
/// input is handled (DPAPI accepts zero-length and round-trips to empty).
#[test]
fn dpapi_roundtrip_handles_varied_lengths() {
for plaintext in [b"x".as_slice(), b"cak_".as_slice(), &[0u8; 256]] {
let blob = dpapi_protect(plaintext).expect("protect");
let back = dpapi_unprotect(&blob).expect("unprotect");
assert_eq!(back.as_slice(), plaintext);
}
}
/// Tampering with the ciphertext must make decryption FAIL rather than return
/// garbage — DPAPI authenticates its blobs.
#[test]
fn dpapi_rejects_tampered_blob() {
let mut blob = dpapi_protect(b"cak_tamper_target").expect("protect");
// Flip a byte in the middle of the blob.
let mid = blob.len() / 2;
blob[mid] ^= 0xFF;
assert!(
dpapi_unprotect(&blob).is_err(),
"a tampered DPAPI blob must fail to decrypt"
);
}
}

384
agent/src/enroll.rs Normal file
View File

@@ -0,0 +1,384 @@
//! First-run self-enrollment client (SPEC-016 Phase B, item 4).
//!
//! When the agent runs as a persistent (`PermanentAgent`) install with NO stored
//! `cak_` but WITH an `enrollment_key` + `site_code`, it walks through the
//! public, unauthenticated `POST /api/enroll` door: it presents its site
//! credentials and its hardware-derived `machine_uid`, and — on success — the
//! server mints and returns a per-machine `cak_` operating credential exactly
//! once. The agent persists that `cak_` encrypted at rest
//! ([`crate::credential_store`]) and connects with it; on every later run it uses
//! the stored `cak_` directly and never re-enrolls.
//!
//! Server contract consumed (must match `server/src/api/enroll.rs`):
//! - Request: `{ site_code, enrollment_key, machine_uid, hostname,
//! labels:{company,site,department,device_type,tags} }`.
//! - `201 Created` -> new enrollment; body has `key` (the `cak_`).
//! - `200 OK` -> reuse (re-image / re-install); body has `key`.
//! - `202 Accepted` -> `collision_pending`; NO key — operator must confirm in
//! the dashboard before the endpoint can connect.
//! - `401 Unauthorized` -> `ENROLL_REJECTED` (bad/rotated key or unknown site):
//! terminal-ish config problem, back off long.
//! - `409 Conflict` -> `ENROLL_SITE_CONFLICT` (machine bound to another site):
//! terminal-ish, requires the operator reassignment flow; back off long.
//! - `429 Too Many Requests` -> rate-limited; back off and retry.
//!
//! SECURITY: never log the `enrollment_key` or the minted `cak_`. Only states,
//! dispositions, and the (non-secret) `machine_uid`/`site_code` are logged.
use anyhow::{anyhow, Context, Result};
use serde::{Deserialize, Serialize};
use std::time::Duration;
use crate::config::Config;
/// `POST /api/enroll` request body — mirrors `enroll::EnrollRequest`.
#[derive(Debug, Serialize)]
struct EnrollRequest<'a> {
site_code: &'a str,
enrollment_key: &'a str,
machine_uid: &'a str,
hostname: &'a str,
labels: EnrollLabels<'a>,
}
/// Labels carried at enrollment — mirrors `enroll::EnrollLabels`.
#[derive(Debug, Serialize)]
struct EnrollLabels<'a> {
#[serde(skip_serializing_if = "Option::is_none")]
company: Option<&'a str>,
#[serde(skip_serializing_if = "Option::is_none")]
site: Option<&'a str>,
#[serde(skip_serializing_if = "Option::is_none")]
department: Option<&'a str>,
#[serde(skip_serializing_if = "Option::is_none")]
device_type: Option<&'a str>,
#[serde(skip_serializing_if = "slice_is_empty")]
tags: &'a [String],
}
/// `skip_serializing_if` predicate for the `tags` slice — `Vec::is_empty` cannot
/// bind a `&&[String]`, so use a slice-typed helper.
fn slice_is_empty(s: &[String]) -> bool {
s.is_empty()
}
/// `POST /api/enroll` success body — mirrors `enroll::EnrollResponse`.
#[derive(Debug, Deserialize)]
struct EnrollResponse {
#[allow(dead_code)]
machine_id: String,
#[serde(default)]
key: Option<String>,
enrollment_state: String,
disposition: String,
}
/// Backoff after a retryable failure (429 / network / 5xx).
const RETRYABLE_BACKOFF: Duration = Duration::from_secs(30);
/// Backoff after a terminal-ish config failure (401 / 409) or collision-pending.
/// These won't fix themselves without operator action, so retry slowly rather
/// than hot-looping while still recovering automatically once it IS fixed.
const TERMINAL_BACKOFF: Duration = Duration::from_secs(300);
/// Drive enrollment until a `cak_` is issued, persisting it into the credential
/// store on success and loading it into `config.api_key`.
///
/// Loops with backoff across retryable failures (it must not give up — a managed
/// machine left running should eventually enroll once the server/site is healthy)
/// and across collision-pending (HTTP 202: it keeps re-checking on a slow cadence
/// until an operator confirms the endpoint in the dashboard and the server begins
/// issuing a key). Returns `Ok(())` only once a `cak_` is stored. The only `Err`
/// returns are unrecoverable local faults (missing config, an un-persistable
/// credential) — network/HTTP failures are retried, never propagated.
pub async fn run_enrollment(config: &mut Config) -> Result<()> {
let site_code = config
.site_code
.clone()
.ok_or_else(|| anyhow!("enrollment requested but no site_code is configured"))?;
let enrollment_key = config
.enrollment_key
.clone()
.ok_or_else(|| anyhow!("enrollment requested but no enrollment_key is configured"))?;
let https_base = config.https_base()?;
let machine_uid = crate::identity::machine_uid();
let hostname = config.hostname();
tracing::info!(
"[ENROLL] first-run enrollment: site_code={} machine_uid={} hostname={}",
site_code,
machine_uid,
hostname
);
loop {
match attempt_enroll(
&https_base,
&site_code,
&enrollment_key,
&machine_uid,
&hostname,
config,
)
.await
{
Ok(AttemptResult::Issued(cak)) => {
// Persist encrypted-at-rest, then load into the live config so the
// transport authenticates with the new per-machine credential.
#[cfg(windows)]
crate::credential_store::store_cak(&cak)
.context("failed to persist issued cak_ to the credential store")?;
config.api_key = cak;
// Enrollment material is single-use; drop it so it is not retained
// in memory or accidentally reused.
config.enrollment_key = None;
tracing::info!("[ENROLL] enrollment complete; connecting with per-machine key");
return Ok(());
}
Ok(AttemptResult::Pending) => {
tracing::warn!(
"[ENROLL] pending operator confirmation (machine_uid collision); \
this machine cannot connect until confirmed in the dashboard. \
Re-checking in {}s.",
TERMINAL_BACKOFF.as_secs()
);
tokio::time::sleep(TERMINAL_BACKOFF).await;
}
Err(AttemptError::Terminal(msg)) => {
tracing::error!(
"[ENROLL] enrollment refused (operator action required): {msg}. \
Retrying in {}s.",
TERMINAL_BACKOFF.as_secs()
);
tokio::time::sleep(TERMINAL_BACKOFF).await;
}
Err(AttemptError::Retryable(msg)) => {
tracing::warn!(
"[ENROLL] transient enrollment failure: {msg}. Retrying in {}s.",
RETRYABLE_BACKOFF.as_secs()
);
tokio::time::sleep(RETRYABLE_BACKOFF).await;
}
}
}
}
/// Result of one HTTP enrollment attempt.
enum AttemptResult {
/// A `cak_` was issued (201/200). Carries the plaintext (never logged).
Issued(String),
/// Collision-gated (202): no key issued.
Pending,
}
/// Failure classes that drive the backoff policy.
enum AttemptError {
/// 401/409 — won't fix without operator action; back off long but keep trying.
Terminal(String),
/// 429 / network / 5xx / decode — transient; short backoff.
Retryable(String),
}
/// Make one `POST /api/enroll` call and classify the response per the contract.
async fn attempt_enroll(
https_base: &str,
site_code: &str,
enrollment_key: &str,
machine_uid: &str,
hostname: &str,
config: &Config,
) -> std::result::Result<AttemptResult, AttemptError> {
let url = format!("{}/api/enroll", https_base.trim_end_matches('/'));
let body = EnrollRequest {
site_code,
enrollment_key,
machine_uid,
hostname,
labels: EnrollLabels {
company: config.company.as_deref().filter(|s| !s.is_empty()),
site: config.site.as_deref().filter(|s| !s.is_empty()),
department: config.department.as_deref().filter(|s| !s.is_empty()),
device_type: config.device_type.as_deref().filter(|s| !s.is_empty()),
tags: &config.tags,
},
};
let client = build_client().map_err(|e| AttemptError::Retryable(e.to_string()))?;
let response = client
.post(&url)
.json(&body)
.timeout(Duration::from_secs(30))
.send()
.await
.map_err(|e| AttemptError::Retryable(format!("request to {url} failed: {e}")))?;
let status = response.status();
match status.as_u16() {
// New (201) or reuse (200): body carries the cak_.
200 | 201 => {
let parsed: EnrollResponse = response
.json()
.await
.map_err(|e| AttemptError::Retryable(format!("malformed success body: {e}")))?;
match parsed.key {
Some(cak) if !cak.is_empty() => {
tracing::info!(
"[ENROLL] server accepted enrollment: state={} disposition={}",
parsed.enrollment_state,
parsed.disposition
);
Ok(AttemptResult::Issued(cak))
}
// 2xx with no key is contract-violating for the active path; treat
// as retryable so we don't silently spin or crash.
_ => Err(AttemptError::Retryable(format!(
"server returned {} with no key (state={}, disposition={})",
status, parsed.enrollment_state, parsed.disposition
))),
}
}
// Collision-gated: pending operator confirmation, no key.
202 => {
// Body decode is best-effort here; the status alone is authoritative.
Ok(AttemptResult::Pending)
}
// Bad/rotated enrollment key or unknown site code.
401 => Err(AttemptError::Terminal(
"ENROLL_REJECTED — the site code or enrollment key is invalid or rotated; \
this installer needs a current per-site key"
.to_string(),
)),
// Machine already enrolled at a different site.
409 => Err(AttemptError::Terminal(
"ENROLL_SITE_CONFLICT — this machine is already enrolled at another site; \
a deliberate move requires the operator-initiated reassignment flow"
.to_string(),
)),
// Rate-limited / locked out — honor Retry-After if present, else default.
429 => {
let retry_after = response
.headers()
.get(reqwest::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok());
Err(AttemptError::Retryable(match retry_after {
Some(secs) => format!("RATE_LIMITED (retry-after {secs}s)"),
None => "RATE_LIMITED".to_string(),
}))
}
// 5xx or anything else — transient from the agent's perspective.
_ => Err(AttemptError::Retryable(format!(
"unexpected enrollment response: HTTP {status}"
))),
}
}
/// Build the HTTP client for enrollment, matching the update path's TLS posture
/// (`rustls`, with an opt-in dev-insecure escape hatch in debug builds only).
fn build_client() -> Result<reqwest::Client> {
reqwest::Client::builder()
.danger_accept_invalid_certs(dev_insecure_tls())
.build()
.context("failed to build enrollment HTTP client")
}
/// Dev-only TLS bypass — identical policy to `update::dev_insecure_tls`: only in
/// debug builds AND only when `GURUCONNECT_DEV_INSECURE_TLS` is set. NEVER active
/// in a release build.
fn dev_insecure_tls() -> bool {
if cfg!(debug_assertions) && std::env::var("GURUCONNECT_DEV_INSECURE_TLS").is_ok() {
tracing::warn!(
"[ENROLL] TLS verification DISABLED (dev-insecure mode) — DO NOT use in production"
);
true
} else {
false
}
}
#[cfg(test)]
mod tests {
use super::*;
/// The request body must serialize to exactly the field names the Phase A
/// server deserializes (`enroll::EnrollRequest` / `EnrollLabels`). A drift here
/// is a silent enrollment failure, so pin the wire shape.
#[test]
fn request_serializes_to_the_server_contract() {
let tags = vec!["prod".to_string()];
let req = EnrollRequest {
site_code: "ACME-HQ",
enrollment_key: "cek_secret",
machine_uid: "muid_abc",
hostname: "WS-01",
labels: EnrollLabels {
company: Some("Acme"),
site: Some("HQ"),
department: Some("IT"),
device_type: Some("workstation"),
tags: &tags,
},
};
let v: serde_json::Value = serde_json::to_value(&req).unwrap();
assert_eq!(v["site_code"], "ACME-HQ");
assert_eq!(v["enrollment_key"], "cek_secret");
assert_eq!(v["machine_uid"], "muid_abc");
assert_eq!(v["hostname"], "WS-01");
assert_eq!(v["labels"]["company"], "Acme");
assert_eq!(v["labels"]["site"], "HQ");
assert_eq!(v["labels"]["department"], "IT");
assert_eq!(v["labels"]["device_type"], "workstation");
assert_eq!(v["labels"]["tags"][0], "prod");
}
/// Empty optional labels are omitted (the server defaults them), and an empty
/// tag list is not serialized — keeping the body minimal for a thin installer.
#[test]
fn request_omits_empty_optional_labels() {
let tags: Vec<String> = Vec::new();
let req = EnrollRequest {
site_code: "S",
enrollment_key: "cek_x",
machine_uid: "muid_x",
hostname: "H",
labels: EnrollLabels {
company: None,
site: None,
department: None,
device_type: None,
tags: &tags,
},
};
let v: serde_json::Value = serde_json::to_value(&req).unwrap();
let labels = v["labels"].as_object().unwrap();
assert!(!labels.contains_key("company"));
assert!(!labels.contains_key("department"));
assert!(!labels.contains_key("tags"));
}
/// The success response decoder must accept both a key-bearing active body and
/// a keyless pending body (mirrors `EnrollResponse` with `skip_serializing_if`).
#[test]
fn response_decodes_active_and_pending_shapes() {
let active: EnrollResponse = serde_json::from_str(
r#"{"machine_id":"m1","key":"cak_live","enrollment_state":"active","disposition":"new"}"#,
)
.unwrap();
assert_eq!(active.key.as_deref(), Some("cak_live"));
assert_eq!(active.enrollment_state, "active");
let pending: EnrollResponse = serde_json::from_str(
r#"{"machine_id":"m2","enrollment_state":"pending","disposition":"collision_pending"}"#,
)
.unwrap();
assert!(pending.key.is_none());
assert_eq!(pending.disposition, "collision_pending");
}
}

View File

@@ -9,22 +9,48 @@
//! **recomputable**: the same machine yields the same id on every call with no //! **recomputable**: the same machine yields the same id on every call with no
//! persistence required. //! persistence required.
//! //!
//! - **Windows:** SHA-256 hash of the OS machine GUID read from //! - **Windows:** SHA-256 of a hardware identity string. The id is derived from
//! `HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid` (a `REG_SZ`). The raw //! the **hardware salt ONLY** whenever any durable hardware signal is readable:
//! GUID is never returned — only the opaque `muid_<hex>` derived from it. //! the **SMBIOS system UUID** (`Win32_ComputerSystemProduct.UUID`), or — when
//! - **Non-Windows (and Windows registry failure):** a random UUID persisted in //! that is absent / all-zeros / all-FFs (some OEMs/hypervisors) — the
//! the agent's data directory, read back on subsequent runs so it is stable //! **motherboard serial** (`Win32_BaseBoard.SerialNumber`) plus the **primary
//! across calls and process restarts. //! disk serial**. A fixed namespace string is mixed in for domain separation.
//! The OS machine GUID
//! (`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`, a `REG_SZ`) is used
//! ONLY as a last-resort signal when NO hardware salt is readable. The raw
//! signals are never returned — only the opaque `muid_<hex>` derived from them.
//! - **Non-Windows (and Windows with no readable signal at all):** a random UUID
//! persisted in the agent's data directory, read back on subsequent runs so it
//! is stable across calls and process restarts.
//!
//! **Stability contract (SPEC-016 item 1):**
//! - **Salted path (hardware signal present) is re-image-stable:** the digest
//! mixes only durable hardware signals (SMBIOS UUID, or board + disk serial) and
//! a fixed namespace — NOT the `MachineGuid`, which Windows regenerates on every
//! OS install/re-image. So the `machine_uid` survives both a reboot AND an OS
//! re-image on the SAME hardware (the re-image dedup goal), while distinct
//! physical boxes stay distinct.
//! - **MachineGuid-only path is the volatile floor:** when no hardware salt is
//! readable, the id anchors on the `MachineGuid` alone. This is stable across
//! reboots but NOT across a re-image (the GUID is regenerated). This degraded
//! path is logged at WARN so the server-side collision gate operator has a clue.
//! //!
//! This module deliberately does NOT change `agent_id`/`generate_agent_id`. //! This module deliberately does NOT change `agent_id`/`generate_agent_id`.
//! `machine_uid` is reported *alongside* `agent_id`; the server-side dedup that //! `machine_uid` is reported *alongside* `agent_id`; the server-side dedup that
//! consumes it is a separate task. //! consumes it lives in `POST /api/enroll` (SPEC-016 Phase A) and the relay
//! connect path.
use std::sync::OnceLock; use std::sync::OnceLock;
/// Prefix marking the value as an opaque machine-uid (vs. a raw GUID/UUID). /// Prefix marking the value as an opaque machine-uid (vs. a raw GUID/UUID).
const MUID_PREFIX: &str = "muid_"; const MUID_PREFIX: &str = "muid_";
/// Fixed namespace mixed into the hardware-salted derivation for domain
/// separation: it ties the digest to *this* identity scheme so the same raw
/// hardware serial can never collide with an unrelated digest, and it documents
/// the derivation version. It is NOT a secret — it is a constant.
const MUID_NAMESPACE: &str = "guruconnect:machine_uid:v1";
/// Cached value — `machine_uid()` reads the registry / a file, so compute once /// Cached value — `machine_uid()` reads the registry / a file, so compute once
/// and reuse for the lifetime of the process. /// and reuse for the lifetime of the process.
static MACHINE_UID: OnceLock<String> = OnceLock::new(); static MACHINE_UID: OnceLock<String> = OnceLock::new();
@@ -32,10 +58,11 @@ static MACHINE_UID: OnceLock<String> = OnceLock::new();
/// Return a deterministic, recomputable opaque machine identifier. /// Return a deterministic, recomputable opaque machine identifier.
/// ///
/// The result is non-empty and prefixed with [`MUID_PREFIX`]. It is cached after /// The result is non-empty and prefixed with [`MUID_PREFIX`]. It is cached after
/// the first call. On Windows it is derived purely from the OS machine GUID (no /// the first call. On Windows it is derived from a durable hardware salt when one
/// persistence). If the Windows registry read fails — or on any non-Windows /// is readable (re-image-stable; see the module docs), falling back to the OS
/// platform — it degrades to a persisted random UUID (today's-behavior-equivalent /// machine GUID alone (reboot-stable floor) and finally — when no signal at all is
/// stability) rather than panicking. /// readable, or on any non-Windows platform — a persisted random UUID, rather than
/// panicking.
pub fn machine_uid() -> String { pub fn machine_uid() -> String {
MACHINE_UID.get_or_init(compute_machine_uid).clone() MACHINE_UID.get_or_init(compute_machine_uid).clone()
} }
@@ -56,23 +83,265 @@ fn derive_uid(raw: &str) -> String {
#[cfg(windows)] #[cfg(windows)]
fn compute_machine_uid() -> String { fn compute_machine_uid() -> String {
// PRIMARY signal (SPEC-016 item 1): a durable hardware salt — SMBIOS system
// UUID if usable, else motherboard + disk serial. When ANY hardware salt is
// readable we derive the uid from the salt ALONE (plus a fixed namespace),
// deliberately EXCLUDING the MachineGuid: Windows regenerates the MachineGuid
// on every OS install/re-image, so mixing it in would break re-image dedup.
// The salted digest survives both reboot AND re-image on the same hardware.
if let Some(salt) = hardware_salt() {
tracing::info!("machine_uid derived from durable hardware salt (re-image-stable)");
return derive_uid(&format!("{MUID_NAMESPACE}|{salt}"));
}
// LAST-RESORT signal: no hardware salt is readable, so anchor on the OS
// MachineGuid alone. This is the volatile FLOOR — stable across reboots but
// NOT across an OS re-image (the GUID is regenerated). We WARN so the
// server-side collision-gate operator knows this endpoint's uid is not
// re-image-stable. The MachineGuid itself is never logged.
match read_machine_guid() { match read_machine_guid() {
Ok(guid) if !guid.trim().is_empty() => derive_uid(guid.trim()), Ok(guid) if !guid.trim().is_empty() => {
tracing::warn!(
"machine_uid: no durable hardware salt readable; anchoring on MachineGuid \
ONLY — this id is reboot-stable but NOT re-image-stable"
);
derive_uid(&format!("{MUID_NAMESPACE}|machineguid:{}", guid.trim()))
}
Ok(_) => { Ok(_) => {
tracing::warn!( tracing::warn!(
"MachineGuid registry value was empty; falling back to persisted machine_uid" "machine_uid: no hardware salt and MachineGuid registry value was empty; \
falling back to persisted machine_uid"
); );
persisted_uid() persisted_uid()
} }
Err(e) => { Err(e) => {
tracing::warn!( tracing::warn!(
"Failed to read MachineGuid from registry ({e}); falling back to persisted machine_uid" "machine_uid: no hardware salt and failed to read MachineGuid ({e}); \
falling back to persisted machine_uid"
); );
persisted_uid() persisted_uid()
} }
} }
} }
/// Collect the durable hardware salt for the `machine_uid` (Windows only).
///
/// This is the PRIMARY identity signal: when it returns `Some(salt)`, the caller
/// derives the uid from the salt ALONE (re-image-stable). Returns `Some(salt)`
/// where `salt` is a deterministic, normalized concatenation of usable hardware
/// signals, or `None` when nothing durable is readable (in which case the caller
/// degrades to anchoring on the MachineGuid alone — the volatile floor).
///
/// Order of preference, per SPEC-016 item 1:
/// 1. SMBIOS system UUID (`Win32_ComputerSystemProduct.UUID`) — when present and
/// not a degenerate placeholder (all-zeros / all-FFs, which some OEMs and
/// hypervisor templates emit).
/// 2. Fallback: motherboard serial (`Win32_BaseBoard.SerialNumber`) + primary
/// disk serial — combined so a single weak signal does not stand alone.
///
/// Each component is read via a narrow PowerShell CIM query (see
/// [`query_cim_property`]); the values are normalized (trimmed, upper-cased) so
/// trivial formatting drift never changes the digest.
#[cfg(windows)]
fn hardware_salt() -> Option<String> {
if let Some(uuid) = smbios_uuid() {
return Some(format!("smbios:{uuid}"));
}
// SMBIOS UUID unusable — fall back to board + disk serial. Use whichever of
// the two are readable; require at least one to be present, otherwise there
// is no durable salt and we return None.
let board = normalize_signal(query_cim_property("Win32_BaseBoard", "SerialNumber").as_deref());
let disk = primary_disk_serial();
match (board, disk) {
(Some(b), Some(d)) => Some(format!("board:{b}|disk:{d}")),
(Some(b), None) => Some(format!("board:{b}")),
(None, Some(d)) => Some(format!("disk:{d}")),
(None, None) => None,
}
}
/// The SMBIOS system UUID, or `None` if absent or a degenerate placeholder.
///
/// Some OEMs ship an all-zeros UUID and some hypervisor templates clone an
/// all-FFs (or all-zeros) UUID; either is worthless as a distinguishing signal,
/// so we reject both and let the caller fall back to board/disk serial.
#[cfg(windows)]
fn smbios_uuid() -> Option<String> {
let raw =
normalize_signal(query_cim_property("Win32_ComputerSystemProduct", "UUID").as_deref())?;
// Reject degenerate placeholders (ignoring dashes): all-zeros or all-FFs.
let hex: String = raw.chars().filter(|c| *c != '-').collect();
let all_zero = !hex.is_empty() && hex.chars().all(|c| c == '0');
let all_ff = !hex.is_empty() && hex.chars().all(|c| c == 'F');
if hex.is_empty() || all_zero || all_ff {
tracing::debug!("SMBIOS UUID is absent or a degenerate placeholder; using fallback salt");
return None;
}
Some(raw)
}
/// The serial number of the primary (boot/index-0) physical disk, normalized.
///
/// Prefers the disk whose `Index == 0` (the conventional boot disk); falls back
/// to the first disk that reports any serial. Returns `None` if no disk reports a
/// usable serial.
#[cfg(windows)]
fn primary_disk_serial() -> Option<String> {
// One narrow query: index + serial for all physical disks, sorted by index,
// emitted as `index<TAB>serial` lines. Parse the lowest-index non-empty serial.
let script = "Get-CimInstance -ClassName Win32_DiskDrive | \
Sort-Object Index | \
ForEach-Object { \"$($_.Index)`t$($_.SerialNumber)\" }";
let out = run_powershell(script)?;
for line in out.lines() {
let mut parts = line.splitn(2, '\t');
let _index = parts.next();
if let Some(serial) = parts.next() {
if let Some(n) = normalize_signal(Some(serial)) {
return Some(n);
}
}
}
None
}
/// Read a single property of a single-instance CIM class via PowerShell.
///
/// Returns the raw (untrimmed) first non-empty line of output, or `None`. This is
/// a deliberately narrow shell-out rather than a full WMI/COM binding: the agent
/// already has no WMI crate, and a COM `IWbemServices` binding for two scalar
/// reads would be far more code and unsafe surface for no benefit. PowerShell's
/// CIM cmdlets are present on every supported Windows target (7 SP1+/2008 R2+
/// ship WMI; CIM cmdlets ship from PowerShell 3.0 / WMF 3.0, universally present
/// on currently-supported builds).
#[cfg(windows)]
fn query_cim_property(class: &str, property: &str) -> Option<String> {
// `(Get-CimInstance -ClassName X).Property` — single scalar, no formatting.
let script = format!("(Get-CimInstance -ClassName {class}).{property}");
let out = run_powershell(&script)?;
out.lines()
.map(str::trim)
.find(|l| !l.is_empty())
.map(str::to_string)
}
/// Wall-clock bound on a single PowerShell hardware-signal query.
///
/// A wedged WMI/CIM provider can hang indefinitely; without a bound that would
/// hang agent startup forever. On timeout we kill the child and treat the signal
/// as missing (fall back through the chain) — never panic.
#[cfg(windows)]
const POWERSHELL_QUERY_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(10);
/// Run a short PowerShell snippet and capture stdout, or `None` on any failure
/// (including a wall-clock timeout).
///
/// Hidden window (`CREATE_NO_WINDOW`) so an interactive desktop never flashes a
/// console; `-NonInteractive -NoProfile` for determinism and speed. The call is
/// spawned and waited on with a [`POWERSHELL_QUERY_TIMEOUT`] bound so a stuck WMI
/// provider cannot wedge startup; on timeout the child is killed and the signal is
/// treated as missing. Never logs the captured output (it carries hardware
/// identifiers).
#[cfg(windows)]
fn run_powershell(script: &str) -> Option<String> {
use std::io::Read;
use std::os::windows::process::CommandExt;
use std::process::{Command, Stdio};
use std::time::Instant;
// CREATE_NO_WINDOW — avoid a console flash on the interactive desktop.
const CREATE_NO_WINDOW: u32 = 0x0800_0000;
let mut child = match Command::new("powershell.exe")
.args([
"-NonInteractive",
"-NoProfile",
"-ExecutionPolicy",
"Bypass",
"-Command",
script,
])
.stdin(Stdio::null())
.stdout(Stdio::piped())
.stderr(Stdio::null())
.creation_flags(CREATE_NO_WINDOW)
.spawn()
{
Ok(c) => c,
Err(e) => {
tracing::debug!("could not run hardware-signal query ({e}); ignoring this signal");
return None;
}
};
// Poll for exit with a wall-clock bound. We spin with a short sleep rather than
// a reader thread: the queries are infrequent (startup only) and the loop keeps
// the timeout logic simple and panic-free.
let deadline = Instant::now() + POWERSHELL_QUERY_TIMEOUT;
let status = loop {
match child.try_wait() {
Ok(Some(status)) => break status,
Ok(None) => {
if Instant::now() >= deadline {
// Wedged provider: kill and treat as a missing signal.
let _ = child.kill();
let _ = child.wait();
tracing::debug!(
"hardware-signal query exceeded {}s timeout; killed and ignoring this signal",
POWERSHELL_QUERY_TIMEOUT.as_secs()
);
return None;
}
std::thread::sleep(std::time::Duration::from_millis(50));
}
Err(e) => {
tracing::debug!("error waiting on hardware-signal query ({e}); ignoring");
let _ = child.kill();
let _ = child.wait();
return None;
}
}
};
if !status.success() {
tracing::debug!(
"hardware-signal query exited with status {:?}; ignoring this signal",
status.code()
);
return None;
}
// The process exited; drain its captured stdout.
let mut buf = Vec::new();
if let Some(mut out) = child.stdout.take() {
if let Err(e) = out.read_to_end(&mut buf) {
tracing::debug!("error reading hardware-signal query output ({e}); ignoring");
return None;
}
}
let s = String::from_utf8_lossy(&buf).trim().to_string();
if s.is_empty() {
None
} else {
Some(s)
}
}
/// Normalize a raw hardware signal: trim, upper-case, drop if empty. Upper-casing
/// makes the digest stable against vendor case drift; trimming removes stray
/// whitespace WMI sometimes pads serials with.
#[cfg(windows)]
fn normalize_signal(raw: Option<&str>) -> Option<String> {
let v = raw?.trim();
if v.is_empty() {
return None;
}
Some(v.to_uppercase())
}
#[cfg(not(windows))] #[cfg(not(windows))]
fn compute_machine_uid() -> String { fn compute_machine_uid() -> String {
// No OS machine GUID available — use the persisted random UUID, hashed for a // No OS machine GUID available — use the persisted random UUID, hashed for a
@@ -297,4 +566,108 @@ mod tests {
assert_eq!(a, b, "compute_machine_uid must be deterministic on Windows"); assert_eq!(a, b, "compute_machine_uid must be deterministic on Windows");
assert!(a.starts_with(MUID_PREFIX)); assert!(a.starts_with(MUID_PREFIX));
} }
/// Pin the EXACT derivation strings that `compute_machine_uid` builds, so these
/// pure-function tests track the production logic. Keep in lock-step with
/// `compute_machine_uid`.
#[cfg(windows)]
fn salted_uid(salt: &str) -> String {
derive_uid(&format!("{MUID_NAMESPACE}|{salt}"))
}
#[cfg(windows)]
fn machineguid_only_uid(guid: &str) -> String {
derive_uid(&format!("{MUID_NAMESPACE}|machineguid:{guid}"))
}
/// H1 RE-IMAGE STABILITY: when a hardware salt is present, the uid is derived
/// from the salt ALONE — the MachineGuid is NOT part of the input. So holding
/// the hardware signals fixed while varying the MachineGuid MUST yield the SAME
/// uid. This is exactly the re-image case: an OS re-image regenerates the
/// MachineGuid but leaves SMBIOS UUID / board+disk serial unchanged, and the
/// machine_uid must not move (otherwise dedup breaks). We prove it by showing
/// the salted derivation has no MachineGuid term to vary.
#[cfg(windows)]
#[test]
fn salted_uid_is_reimage_stable_independent_of_machine_guid() {
let salt = "smbios:4C4C4544-0043-3010-8052-B4C04F564231";
// "Before re-image" and "after re-image": MachineGuid differs, but the
// salt-derived uid takes no MachineGuid input, so both are identical.
let before = salted_uid(salt);
let after = salted_uid(salt);
assert_eq!(
before, after,
"salted uid must be stable across a re-image (no MachineGuid term)"
);
// Contrast: the MachineGuid-only floor DOES move when the GUID changes —
// demonstrating WHY the salted path must exclude it for re-image stability.
let guid_a = machineguid_only_uid("11111111-2222-3333-4444-555555555555");
let guid_b = machineguid_only_uid("99999999-8888-7777-6666-555555555555");
assert_ne!(
guid_a, guid_b,
"MachineGuid-only floor is volatile across re-image (expected)"
);
// And the salted uid must differ from the MachineGuid-only floor for the
// same box: the two derivation paths are domain-separated.
assert_ne!(before, guid_a);
}
/// The hardware-salted derivation is `derive_uid` over a deterministic,
/// namespaced concatenation: identical signals MUST yield an identical uid and
/// any changed signal MUST change it. Pins the SPEC-016 determinism contract
/// independent of the (machine-specific) live hardware reads.
#[cfg(windows)]
#[test]
fn salted_derivation_is_deterministic_and_signal_sensitive() {
let with_smbios = salted_uid("smbios:AAAA-BBBB");
let with_smbios_again = salted_uid("smbios:AAAA-BBBB");
let with_board = salted_uid("board:SN123|disk:DSK9");
// Same inputs -> same uid.
assert_eq!(with_smbios, with_smbios_again);
// Different salt composition -> different uid (distinct boxes stay distinct).
assert_ne!(with_smbios, with_board);
}
/// All-zero and all-FF SMBIOS UUIDs are degenerate placeholders that some OEMs
/// and hypervisor templates emit; the normalizer + placeholder check must
/// reject them so the derivation falls through to board/disk serial. We
/// exercise the rejection predicate directly (it is pure) rather than the
/// live WMI read.
#[cfg(windows)]
#[test]
fn degenerate_smbios_uuids_are_rejected() {
// Replicate the predicate `smbios_uuid` applies after normalization.
fn is_degenerate(raw: &str) -> bool {
let Some(norm) = normalize_signal(Some(raw)) else {
return true;
};
let hex: String = norm.chars().filter(|c| *c != '-').collect();
hex.is_empty()
|| (!hex.is_empty() && hex.chars().all(|c| c == '0'))
|| (!hex.is_empty() && hex.chars().all(|c| c == 'F'))
}
assert!(is_degenerate("00000000-0000-0000-0000-000000000000"));
assert!(is_degenerate("FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF"));
assert!(is_degenerate("ffffffff-ffff-ffff-ffff-ffffffffffff")); // case-insensitive via normalize
assert!(is_degenerate(" "));
// A real, mixed UUID is NOT degenerate.
assert!(!is_degenerate("4C4C4544-0043-3010-8052-B4C04F564231"));
}
/// `normalize_signal` trims, upper-cases, and drops empties — so case/space
/// drift in a vendor serial never perturbs the digest.
#[cfg(windows)]
#[test]
fn normalize_signal_is_stable_against_drift() {
assert_eq!(
normalize_signal(Some(" abc123 ")),
Some("ABC123".to_string())
);
assert_eq!(normalize_signal(Some("ABC123")), Some("ABC123".to_string()));
assert_eq!(normalize_signal(Some(" ")), None);
assert_eq!(normalize_signal(None), None);
}
} }

View File

@@ -16,7 +16,10 @@ mod capture;
mod chat; mod chat;
mod config; mod config;
mod consent; mod consent;
#[cfg(windows)]
mod credential_store;
mod encoder; mod encoder;
mod enroll;
mod identity; mod identity;
mod input; mod input;
mod install; mod install;
@@ -323,7 +326,112 @@ fn run_agent_mode(support_code: Option<String>) -> Result<()> {
// Run the agent // Run the agent
let rt = tokio::runtime::Runtime::new()?; let rt = tokio::runtime::Runtime::new()?;
rt.block_on(run_agent(config)) rt.block_on(async move {
// SPEC-016 Phase B: resolve the operating credential before connecting.
// Support sessions are unaffected — they authenticate by support code, not
// by a per-machine cak_, so we only resolve enrollment for a managed agent.
if config.support_code.is_none() {
resolve_agent_credential(&mut config).await?;
}
run_agent(config).await
})
}
/// Resolve the per-machine operating credential for a managed agent (SPEC-016
/// Phase B, run-mode wiring).
///
/// Precedence:
/// 1. A `cak_` already stored encrypted at rest -> load it and connect with it
/// (the steady-state path; no network call, no re-enroll).
/// 2. No stored `cak_` but an `enrollment_key` + `site_code` are present ->
/// run first-run enrollment to obtain + persist a `cak_`, then connect.
/// 3. Neither a stored `cak_` nor enrollment material, but a non-empty
/// `api_key` is configured -> use it as the DEPRECATED shared/legacy key
/// (transition compatibility only; logged at WARNING).
/// 4. Nothing usable -> error; a managed agent cannot authenticate.
async fn resolve_agent_credential(config: &mut config::Config) -> Result<()> {
// 1. Stored per-machine cak_ (steady state).
#[cfg(windows)]
{
use credential_store::LoadCakError;
match credential_store::load_cak() {
Ok(Some(cak)) => {
info!("Using stored per-machine credential (cak_)");
config.api_key = cak;
// Any leftover enrollment material is now moot.
config.enrollment_key = None;
return Ok(());
}
Ok(None) => {
info!("No stored per-machine credential; will enroll if configured");
}
// C1 / M1 — the store exists but THIS security context cannot read it
// (access-denied against the SYSTEM-only ACL). This is the brick the
// C1 guard prevents: a non-SYSTEM run could write the store but never
// read it back. Fail fast with an actionable message; do NOT loop and
// do NOT silently re-enroll. The SYSTEM+Administrators ACL is correct
// for the target (Option A) and is deliberately kept.
//
// NOTE: this guard is satisfied/removed once the GuruConnect SYSTEM
// service host lands (separate spec, SPEC-018) and the agent always
// runs as SYSTEM — at which point the store is always readable.
Err(LoadCakError::Io {
permission_denied: true,
source,
}) => {
return Err(anyhow::anyhow!(
"[ENROLL] credential store is not accessible in this context \
({source}) — the managed agent must run as the GuruConnect SYSTEM \
service (see SPEC-018). Refusing to re-enroll."
));
}
// M1 — other IO error reaching the store (not access-denied): also
// operational, not a tamper signal. Surface it; do not re-enroll over a
// store we simply could not read.
Err(e @ LoadCakError::Io { .. }) => {
return Err(anyhow::Error::new(e).context(
"[ENROLL] credential store present but unreadable (IO error); \
refusing to re-enroll over it",
));
}
Err(e @ LoadCakError::Path(_)) => {
return Err(anyhow::Error::new(e)
.context("[ENROLL] could not resolve the credential store path"));
}
// M1 — the bytes were read but failed to DECRYPT: the real tamper /
// wrong-machine signal. Hard stop; never silently re-enroll over it.
Err(e @ LoadCakError::Decrypt(_)) => {
return Err(anyhow::Error::new(e).context(
"[ENROLL] stored credential failed to decrypt — possible tamper or \
copy from another machine; refusing to silently re-enroll",
));
}
}
}
// 2. First-run enrollment (the SPEC-016 zero-touch path). run_enrollment only
// returns once a cak_ is stored (it retries network/429/collision-pending
// internally); a returned Err is an unrecoverable local fault.
if config.enrollment_key.is_some() && config.site_code.is_some() {
info!("Enrollment material present; running first-run enrollment");
enroll::run_enrollment(config).await?;
return Ok(());
}
// 3. DEPRECATED shared/legacy api_key fallback (transition only).
if !config.api_key.is_empty() {
warn!(
"Connecting with a DEPRECATED shared/legacy api_key. Migrate this agent \
to a per-site enrollment (SPEC-016); the shared key path will be removed."
);
return Ok(());
}
// 4. Nothing usable.
Err(anyhow::anyhow!(
"no operating credential available: no stored cak_, no enrollment_key/site_code, \
and no legacy api_key — this managed agent cannot authenticate"
))
} }
/// Run in viewer mode (connect to remote session) /// Run in viewer mode (connect to remote session)

View File

@@ -16,11 +16,16 @@ stack. It ships independently of GuruRMM and integrates with it via a versioned
> match, blacklist-on-WS, agent-plane rejects user JWTs via per-agent `cak_` keys). The feature specs below > match, blacklist-on-WS, agent-plane rejects user JWTs via per-agent `cak_` keys). The feature specs below
> (SPEC-003009) are **work-items inside the later v2 phases** — see the mapping. > (SPEC-003009) are **work-items inside the later v2 phases** — see the mapping.
> >
> **Remaining to formally exit Phase 1:** secure-session-core **Task 8** (end-to-end verification + > **Phase 1 formally EXITED (2026-05-31).** secure-session-core **Task 8** is complete — end-to-end
> `/gc-audit --pass=security` re-audit + the manual CRITICAL checks) and Code-Review sign-off on Tasks 35 > functional verification (live CRITICAL boundary checks against the deployed binary: login-JWT→401,
> (implemented without a local toolchain at the time; since built + deployed). Live HW-H.264 validation is > wrong-session viewer token→403, JWT-as-agent-key→401) **plus the `/gc-audit --pass=security` re-audit:
> also pending — raw+Zstd remains the shipping default. ~~Sprint 0 (relay-auth CRITICAL hotfix)~~ **not > PASS, 0 CRITICAL/HIGH/MEDIUM/LOW** ([report](../reports/2026-05-31-gc-audit.md)). Code-Review sign-off on
> needed — those fixes shipped in Tasks 23.** > Tasks 35 landed earlier. On top of Phase 1, **SPEC-004 (Tasks 2/4/5 — machine_uid dedup, session
> reaping, operator removal API+UI) is implemented, reviewed, deployed, and the 11 live ghost rows were
> purged**; the agent is now **auto-versioned + Azure-Trusted-Signing-signed via `release.yml`** with
> **v0.3.0 published** as the stable release. ~~Sprint 0 (relay-auth CRITICAL hotfix)~~ **not needed.**
> Still pending (NOT a Phase-1 blocker): live HW-H.264 cross-GPU validation — **raw+Zstd remains the
> shipping default** (`DEFAULT_PREFER_H264=false`) until H.264 is validated across GPUs.
### v2 phase mapping of current specs ### v2 phase mapping of current specs
@@ -43,8 +48,9 @@ stack. It ships independently of GuruRMM and integrates with it via a versioned
Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](specs/SPEC-001-operational-tooling-parity.md). Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](specs/SPEC-001-operational-tooling-parity.md).
- [ ] **Code signing — Azure Trusted Signing in CI** — P1 — sign the Windows agent `.exe` via `jsign` (TRUSTEDSIGNING) in Gitea Actions, reusing the shared ACG cert profile. (SPEC-001 §2) - [x] **Code signing — Azure Trusted Signing in CI** — P1 — Windows agent `.exe` signed via `jsign` (TRUSTEDSIGNING) in `release.yml`, fail-closed (never publishes unsigned). Shipped with v0.3.0. (SPEC-001 §2)
- [ ] **Automatic versioning** — P1 — conventional-commit-driven version bump across agent/server/dashboard, embedded via `build.rs`. (SPEC-001 §3) - [ ] **Signed beta/test release channel****P1 — NOW** — every binary we hand to a tester must be signed, but signing today only runs on a deliberate full `release.yml` dispatch; the automatic `build-and-test.yml` agent artifact is explicitly **unsigned**. Add a `channel: stable | beta` `workflow_dispatch` input to `release.yml`: `beta` signs the agent and publishes a prerelease-tagged Gitea release (e.g. `v0.4.0-beta.1`) **skipping the semver bump + changelog**; `stable` keeps the existing full path. Keeps signing secrets out of PR-triggered runs. (SPEC-001 §2)
- [x] **Automatic versioning** — P1 — conventional-commit-driven version bump computed at dispatch in `release.yml`, embedded via `build.rs`. Shipped with v0.3.0. (SPEC-001 §3)
- [ ] **Changelog generation & API** — P2 — `CHANGELOG.md` + per-version changelogs from conventional commits, served at `/api/changelog/...`. (SPEC-001 §4) - [ ] **Changelog generation & API** — P2 — `CHANGELOG.md` + per-version changelogs from conventional commits, served at `/api/changelog/...`. (SPEC-001 §4)
- [ ] **Feature-request workflow** — P2 — `/gc-feature-request` skill producing `docs/specs/SPEC-NNN-*.md` and updating this roadmap. (SPEC-001 §1) - [ ] **Feature-request workflow** — P2 — `/gc-feature-request` skill producing `docs/specs/SPEC-NNN-*.md` and updating this roadmap. (SPEC-001 §1)
- [ ] **Roadmap / ADR / spec tracking** — P1 — this file + `ARCHITECTURE_DECISIONS.md` + `docs/specs/`. (SPEC-001 §5) — *bootstrapped* - [ ] **Roadmap / ADR / spec tracking** — P1 — this file + `ARCHITECTURE_DECISIONS.md` + `docs/specs/`. (SPEC-001 §5) — *bootstrapped*
@@ -62,6 +68,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [x] Protobuf-over-WSS transport, Zstd frame compression - [x] Protobuf-over-WSS transport, Zstd frame compression
- [~] React/TS web viewer (`dashboard/src/components/RemoteViewer.tsx`) — embeddable session viewer - [~] React/TS web viewer (`dashboard/src/components/RemoteViewer.tsx`) — embeddable session viewer
- [ ] **Headless Linux mode (direct TTY access)** — P2 — Terminal-based remote access for Linux servers without GUI. PTY spawn (`openpty`), xterm.js web viewer, full ANSI/VT100 support. Enables server management, container debugging, emergency recovery via GuruConnect dashboard with audit logging. SSH replacement with centralized auth. ([SPEC-012](specs/SPEC-012-headless-linux-tty.md)) - [ ] **Headless Linux mode (direct TTY access)** — P2 — Terminal-based remote access for Linux servers without GUI. PTY spawn (`openpty`), xterm.js web viewer, full ANSI/VT100 support. Enables server management, container debugging, emergency recovery via GuruConnect dashboard with audit logging. SSH replacement with centralized auth. ([SPEC-012](specs/SPEC-012-headless-linux-tty.md))
- [ ] **Managed-agent SYSTEM service host + session broker** — P1 — convert the persistent agent from `HKCU Run` (user context) to a LocalSystem **service** that runs unattended (login screen, no user, across reboots) and spawns a per-session capture/input worker into the active desktop (Session 0 can't capture directly). Unblocks SPEC-016 Phase B end-to-end (the SYSTEM-ACL'd `cak_` store becomes readable; removes the Phase B fail-fast guard), enables true unattended access, and is the **broker primitive SPEC-013 builds on**. ([SPEC-018](specs/SPEC-018-managed-agent-service-host.md))
- [ ] **Windows session selection and backstage mode** — P2 — Enumerate and switch between Windows user sessions (Terminal Services/RDP/Fast User Switching) and access Session 0 (backstage) for system-level admin tasks. ScreenConnect parity: session selector shows all logged-on users, instant switching without reconnect. Backstage mode provides terminal/command interface for services management without disrupting any user desktop. Critical for multi-user server environments. ([SPEC-013](specs/SPEC-013-session-selection-and-backstage.md)) - [ ] **Windows session selection and backstage mode** — P2 — Enumerate and switch between Windows user sessions (Terminal Services/RDP/Fast User Switching) and access Session 0 (backstage) for system-level admin tasks. ScreenConnect parity: session selector shows all logged-on users, instant switching without reconnect. Backstage mode provides terminal/command interface for services management without disrupting any user desktop. Critical for multi-user server environments. ([SPEC-013](specs/SPEC-013-session-selection-and-backstage.md))
- [ ] **Configurable notification overlay on viewer connection** — P2 — Display a semi-transparent on-screen notification when a technician connects, showing technician name and company. Dashboard-configurable message template (supports `{{technician_name}}`, `{{company}}`, `{{time}}`), duration (5-60s), position (top-left/right, bottom-left/right, center), and dismissible behavior. Increases transparency and user awareness during remote support sessions. Compliance-friendly for privacy policies requiring user notification. ([SPEC-015](specs/SPEC-015-notification-overlay.md)) - [ ] **Configurable notification overlay on viewer connection** — P2 — Display a semi-transparent on-screen notification when a technician connects, showing technician name and company. Dashboard-configurable message template (supports `{{technician_name}}`, `{{company}}`, `{{time}}`), duration (5-60s), position (top-left/right, bottom-left/right, center), and dismissible behavior. Increases transparency and user awareness during remote support sessions. Compliance-friendly for privacy policies requiring user notification. ([SPEC-015](specs/SPEC-015-notification-overlay.md))
- [ ] Multi-monitor switching — P2 - [ ] Multi-monitor switching — P2
@@ -81,13 +88,15 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [x] Sessions / machines / support-codes / events - [x] Sessions / machines / support-codes / events
- [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). **[→ v2 Phase 2]** ([SPEC-003](specs/SPEC-003-machine-inventory.md)) - [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). **[→ v2 Phase 2]** ([SPEC-003](specs/SPEC-003-machine-inventory.md))
- [ ] **Stable machine identity + session lifecycle reaping + operator removal** — P1 — give the agent a deterministic machine-derived `machine_uid` (Windows `MachineGuid`-based) so the same box can't register duplicates (root cause: `agent_id` is a config-file random UUID that a portable/misconfigured run regenerates each launch); key registration on it; add TTL reaping + same-machine supersede as defense-in-depth; and admin-gated per-row + multi-select bulk removal of stale sessions/units. Identity must be bound to the per-machine agent key (spoof guard). Fixes ghost-session accumulation seen on the live console (15 sessions / 0 live, ~10 orphans for one machine). **[→ v2 Phase 1]** ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md)) - [ ] **Stable machine identity + session lifecycle reaping + operator removal** — P1 — give the agent a deterministic machine-derived `machine_uid` (Windows `MachineGuid`-based) so the same box can't register duplicates (root cause: `agent_id` is a config-file random UUID that a portable/misconfigured run regenerates each launch); key registration on it; add TTL reaping + same-machine supersede as defense-in-depth; and admin-gated per-row + multi-select bulk removal of stale sessions/units. Identity must be bound to the per-machine agent key (spoof guard). Fixes ghost-session accumulation seen on the live console (15 sessions / 0 live, ~10 orphans for one machine). **[→ v2 Phase 1]** ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
- [ ] **Zero-touch per-site agent enrollment** — P1 — ScreenConnect-class managed enrollment: one signed installer per site, machines self-register on first run and the server mints a per-machine `cak_` bound to a deterministic `machine_uid` (dedups re-installs). Per-site **rotatable** enrollment key (long secret + `vN (XXXX)` fingerprint) — rotating blocks new enrollments from old installers, leaves enrolled agents untouched. Auto-approve + new-enrollment/site-move alert. **Sign base agent once (CI, shipped) + per-site signed wrapper that writes site config around the signed bytes — resolves SPEC-007's signature-vs-appended-config question.** Anticipated/deferred: enrollment policy + licensing, `--enroll-key`/`--reassign` flag overrides, technician-assisted interactive install. **[→ v2 Phase 1]** ([SPEC-016](specs/SPEC-016-zero-touch-enrollment.md))
- [ ] **Machines list view — dual connection indicators + rich rows** — P2 — ScreenConnect "Access"-list parity: per-row Host/Guest two-segment connection bar (Guest=agent online, Host=viewer connected, with names + durations) and rich inline metadata (company, site, device type, tags, logged-on user + idle, client version in red when outdated). Server-enriches `/api/machines` with live session state + SPEC-003 inventory. **[→ v2 Phase 2]** ([SPEC-005](specs/SPEC-005-machines-list-view-parity.md)) - [ ] **Machines list view — dual connection indicators + rich rows** — P2 — ScreenConnect "Access"-list parity: per-row Host/Guest two-segment connection bar (Guest=agent online, Host=viewer connected, with names + durations) and rich inline metadata (company, site, device type, tags, logged-on user + idle, client version in red when outdated). Server-enriches `/api/machines` with live session state + SPEC-003 inventory. **[→ v2 Phase 2]** ([SPEC-005](specs/SPEC-005-machines-list-view-parity.md))
- [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005. - [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005.
- [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). **[→ v2 Phase 2]** ([SPEC-006](specs/SPEC-006-universal-machine-search.md)) - [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). **[→ v2 Phase 2]** ([SPEC-006](specs/SPEC-006-universal-machine-search.md))
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. **[→ v2 Phase 2]** ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md)) - [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; the signature-vs-appended-config question is resolved by SPEC-016 (sign-once base + per-site signed wrapper, no PE append). **[→ v2 Phase 2]** ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
- [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. **[→ v2 Phase 0/1 conventions]** ([SPEC-008](specs/SPEC-008-valuable-error-messages.md)) - [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. **[→ v2 Phase 0/1 conventions]** ([SPEC-008](specs/SPEC-008-valuable-error-messages.md))
- [ ] **Feature-rich, fully-documented management API** — P2 — everything the console can do, callable by API: OpenAPI 3.x generated from code (utoipa) + browsable docs at `/api/docs`, long-lived revocable scoped API tokens (PAT-style, distinct from the 24h JWT + agent keys), an API-completeness gap audit, and consistent pagination/error conventions. Distinct from the ADR-001 RMM integration contract. **[→ v2 Phase 3]** ([SPEC-009](specs/SPEC-009-feature-rich-documented-api.md)) - [ ] **Feature-rich, fully-documented management API** — P2 — everything the console can do, callable by API: OpenAPI 3.x generated from code (utoipa) + browsable docs at `/api/docs`, long-lived revocable scoped API tokens (PAT-style, distinct from the 24h JWT + agent keys), an API-completeness gap audit, and consistent pagination/error conventions. Distinct from the ADR-001 RMM integration contract. **[→ v2 Phase 3]** ([SPEC-009](specs/SPEC-009-feature-rich-documented-api.md))
- [ ] **Branding and white-label configuration** — P2 — Allow MSPs to customize logo, colors, and product name for white-labeled remote support. Dashboard admin settings page with logo upload (PNG/SVG, max 2MB), brand hue slider (OKLCH 0-360°, default 184=cyan), product name override, company name, and favicon. Agent tray tooltip uses custom product name from registry. Singleton database table with public GET endpoint for unauthenticated rendering. CSS variables (`--brand-hue`, `--accent`, `--panel`) for dynamic theming. **[→ v2 Phase 2]** ([SPEC-014](specs/SPEC-014-branding-whitelabel.md)) - [ ] **Branding and white-label configuration** — P2 — Allow MSPs to customize logo, colors, and product name for white-labeled remote support. Dashboard admin settings page with logo upload (PNG/SVG, max 2MB), brand hue slider (OKLCH 0-360°, default 184=cyan), product name override, company name, and favicon. Agent tray tooltip uses custom product name from registry. Singleton database table with public GET endpoint for unauthenticated rendering. CSS variables (`--brand-hue`, `--accent`, `--panel`) for dynamic theming. **[→ v2 Phase 2]** ([SPEC-014](specs/SPEC-014-branding-whitelabel.md))
- [ ] **End-user (sub-user) remote access** — P2 (may be P3) — let a client pay for their employees to reach their *own* machines from home: a deny-by-default `end_user` login role, a locked-down end-user portal listing only granted machines, and Connect reusing the existing session-scoped viewer-token + relay path. Grant primitive already exists (`user_client_access`, migration 002); directory sync (AD/Entra/Google) is a separate future spec. **[→ new capability, post v2-console]** ([SPEC-017](specs/SPEC-017-end-user-remote-access.md))
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2 - [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
## Security & Infrastructure ## Security & Infrastructure

View File

@@ -0,0 +1,244 @@
# SPEC-016: Zero-Touch Per-Site Agent Enrollment
**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-06-02)
**Estimated Effort:** X-Large
## Overview
Give GuruConnect a ScreenConnect-class managed-agent enrollment flow: a technician runs
**one signed installer per site** on every machine at that site — no per-machine key
minting, no flags, no typing — and each machine **self-registers** on first run, the
server minting it a per-machine `cak_` key bound to a stable, machine-derived
`machine_uid`. Each site installer carries a **rotatable per-site enrollment key** (a long
server-generated secret) plus a short human-readable **fingerprint** (`vN (XXXX)`) so an
operator can tell at a glance whether an installer is current. Rotating a site's key blocks
*new* enrollments from old installers while leaving already-enrolled machines untouched
(they hold their own `cak_`).
This is the missing piece that turns the v2 secure-session-core (SPEC-004 per-agent keys +
`machine_uid`) into a real product workflow, and it **resolves SPEC-007's open
signature-vs-appended-config question**: the agent binary is signed **once** in CI
(already shipped via `release.yml`), and per-site customization rides in a thin **signed
wrapper** that writes site config to the endpoint at install time — never appended into the
signed PE.
**Success criteria:**
1. A tech installs one site installer on N machines; all N appear in the console under the
correct company/site, each as a distinct, deduplicated machine — zero per-machine setup.
2. Re-installing / re-imaging the same hardware **reuses** the existing machine row (no
ghost duplicates — the failure mode SPEC-004 documents).
3. Rotating a site's enrollment key makes old installers unable to enroll new machines,
while every already-enrolled agent keeps working.
4. Every distributed installer is **validly Authenticode-signed** (SmartScreen/WDAC clean).
## Background — what exists today (confirmed in code)
- **Embedded config is append-based and breaks signing.** `server/src/api/downloads.rs`
(`download_agent`, ~`:152`) reads `static/downloads/guruconnect.exe` and **appends**
`MAGIC_MARKER` + `len:u32` + JSON (`:196`) to the end of the PE. The agent reads it back
in `agent/src/config.rs` (`read_embedded_config`, `:223`). Appending bytes after a signed
PE invalidates the Authenticode signature — so the current customization path and the
newly-shipped CI signing are mutually exclusive.
- **No self-registration exists.** Per-agent `cak_` keys are minted **admin-only** in
`server/src/api/machine_keys.rs` (`create_key`, `:119`; "Admin issued a per-agent key",
`:146`). There is no endpoint where an agent first-run exchanges an enrollment credential
for its own key.
- **Relay already accepts per-agent keys.** `server/src/relay/mod.rs`
(`validate_agent_api_key`, `:417`) calls `crate::auth::agent_keys::verify_agent_key`
(`:422`) — the `cak_` path — then falls back to the **deprecated** shared `AGENT_API_KEY`
(`:444`, logs a "migrate to per-agent `cak_`" warning).
- **Key primitives exist.** `server/src/auth/agent_keys.rs`: `generate_agent_key` mints a
`cak_`-prefixed high-entropy key (`:36`/`:46`); `verify_agent_key` (`:71`).
`server/src/db/agent_keys.rs` already inserts into `connect_agent_keys (machine_id,
key_hash, tenant_id)` (`:47`) — the v2 tenancy column is present (migration
`004_v2_secure_session_core.sql`).
- **Identity is a random config UUID, not machine-derived** — the root cause of duplicates
per SPEC-004 (`agent/src/config.rs` `generate_agent_id`, `:90`).
- **Agent mode dispatch:** `agent/src/main.rs` `Commands::Install` (`:160`) → `run_install`;
`agent/src/config.rs` `detect_run_mode` (`:162`) returns `RunMode::PermanentAgent` when
embedded config is present.
## Scope
### Included in v1 (CORE)
1. **`machine_uid` — deterministic machine identity (hardware-salted, per-tenant).** Derive
a stable id from the Windows `MachineGuid`
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`) **salted with stable hardware
signals** (SMBIOS UUID / motherboard + disk serial), independent of the config-file
`agent_id`. Hardware-derived salt is deliberate: it **survives an OS reinstall/re-image
on the same hardware** (so the row is reused — the re-image dedup goal) while keeping
distinct physical boxes distinct (a per-install *random* salt would break re-image dedup
and is rejected). Uniqueness is scoped **per-tenant** — dedup key `(tenant_id,
machine_uid)` — so the same hardware legitimately present in two tenants stays two
independent rows. (Shared root with SPEC-004; whichever lands first owns the impl, the
other consumes it.) Used as the dedup key for register/move.
**Collision-gated activation.** The residual collision case is VMs/templates that share a
hardware UUID (some hypervisors clone the SMBIOS UUID). When the server detects a
`machine_uid` collision (a seemingly-different endpoint resolving to an existing uid), the
endpoint does **not** auto-activate: it drops to a **pending** state, fires an alert, and
an operator must confirm in the dashboard that the collided endpoint may activate. This is
the one deliberate exception to auto-approve (see item 6).
2. **Per-site enrollment key + fingerprint.**
- Long (≥256-bit) server-generated secret per site, stored **hashed** (Argon2id, same
as `cak_`/passwords), never recoverable in plaintext after issue.
- A non-secret **fingerprint** = monotonic version + short derived code in **hex**,
rendered `vN (XXXX)` (e.g. `v3 (7F2A)`), shown in the dashboard, baked into the
installer filename, and reported by the agent at enrollment. Hex is deliberate —
**not** the RMM word-style code (`GREEN-FALCON`) — so GuruConnect and GuruRMM
artifacts are never visually conflated.
- **Rotate** regenerates the secret and bumps the version; old installers are rejected
for *new* enrollments; existing agents (holding `cak_`) are unaffected.
3. **Self-registration endpoint.** New `POST /api/enroll` (public, unauthenticated by JWT —
gated by the enrollment key) accepting `{ site_code, enrollment_key, machine_uid,
hostname, labels{company,site,department,device_type,tags} }`:
- Verify `(site_code, enrollment_key)` against the current per-site key.
- **Dedup by `machine_uid`** within the site: if the machine exists, reuse the row and
rotate its `cak_`; else create the machine row.
- Mint a `cak_` (reuse `generate_agent_key`), store hashed via `db::agent_keys` bound to
`machine_id` (+ `tenant_id` from the site), return the plaintext `cak_` **once**.
- Emit an audit event + **new-enrollment alert** (and a **site-move** alert when an
existing `machine_uid` enrolls under a different site).
- **Rate-limit + lockout** per `(site_code, source-IP)` as defense-in-depth (the key is
long, so this is belt-and-suspenders, not load-bearing).
4. **Agent first-run enrollment.** On `RunMode::PermanentAgent` with no stored `cak_`:
read site config → call `/api/enroll` with `machine_uid` → persist the returned `cak_`
to a SYSTEM-only protected store (HKLM under a SYSTEM-only ACL, or DPAPI-machine) →
connect to `wss://connect.azcomputerguru.com/ws/agent` using the `cak_`. On subsequent
runs, use the stored `cak_` directly (no re-enroll).
5. **Sign-once base + per-site signed wrapper (resolves SPEC-007 open question).**
- The base agent is signed once in CI (`release.yml`, already shipped) and stays
byte-identical for everyone.
- Per-site customization (labels + enrollment key + fingerprint) is delivered to the
endpoint **at install time** via a signing-safe channel — NOT appended to the signed
PE. **v1 produces BOTH a signed bootstrapper `.exe` and a signed MSI per site**
(ScreenConnect parity — manual installs grab the `.exe`, GPO/Intune fleet pushes take
the MSI), both wrapping the same sign-once agent and writing the site config to the
protected config location. The two differ only in packaging (bootstrapper stub vs. WiX
bundle); both are signed.
- **Deprecate the append path** in `downloads.rs` for managed installs (keep only for
attended/support-code if still needed), eliminating the signature-invalidation defect.
6. **Auto-approve posture (with collision-gate exception).** A self-registered machine is
live and controllable immediately (ScreenConnect parity); the new-enrollment alert is the
tripwire. The **one** exception is a detected `machine_uid` collision (item 1), which
gates the endpoint to **pending** until an operator confirms it in the dashboard.
### Explicitly out of scope (ANTICIPATED — reserve room, do NOT build in v1)
The v1 data model and agent mode-dispatch must leave room for these without building them:
- **Per-site enrollment POLICY** — a `sites.enrollment_policy` field (default
`auto-approve`; future `pending-approval`) plus per-seat/per-endpoint licensing controls.
Commercial, multi-tenant (the `tenant_id` column already exists). Its own future SPEC.
- **Flag overrides** — `--enroll-key` / `--site-code` (generic installer, key supplied on
the command line) and `--reassign` (move an existing machine to a new site, gated by
possession of the destination site's key, with an **explicit accidental-move guard**:
a different-site re-run refuses unless `--reassign` is passed) + cross-client move policy.
Backend (`machine_uid` + authorized site + `cak_`) is designed to support it; CLI surface
is deferred.
- **Technician-assisted interactive install** — `--technician` on a generic installer:
prompts for the tech's own server credentials, and on auth presents a **validated**
Company/Site/tags picker from the live authorized list (authz-by-identity, full audit
trail). Heaviest path (interactive UI + auth/list callback); deferred.
All three converge on the **same backend operation** delivered in v1: `machine_uid` +
authorized site + issued `cak_`. v1 only ships the per-site-embedded-key door.
## Architecture
- **Agent** (`agent/`): compute `machine_uid`; first-run enroll → store `cak_`; use stored
`cak_` thereafter; read site config from the wrapper-written location instead of an
appended PE blob. Touches `config.rs` (`EmbeddedConfig`/`detect_run_mode`/storage),
`main.rs` (`Install`/run-mode), a new `enroll` client module, transport auth.
- **Relay-server** (`server/`): new `POST /api/enroll`; per-site key issue/rotate/verify;
`machine_uid` dedup + site-move on register; audit + alert emission; rate-limit/lockout.
Touches `api/` (new `enroll.rs`, `sites` key endpoints), `auth/agent_keys.rs`,
`db/agent_keys.rs`, `relay/mod.rs` (enrollment vs. connect), `main.rs` routes.
- **Dashboard**: per-site enrollment-key display (fingerprint `vN (XXXX)`), **Rotate**
action, "current installer" download wired to the signed wrapper build. (Builder UI is
SPEC-007; this spec supplies the key/fingerprint/rotation it consumes.)
- **DB migration:** `site_enrollment_keys` (or columns on the site): `site_id`,
`key_hash`, `version`, `fingerprint`, `created_at`, `rotated_at`, `active`. Reserve
`sites.enrollment_policy` (nullable, default `auto-approve`) for the anticipated policy
work. `connect_machines` gains `machine_uid` (unique per tenant/site).
- **Protobuf** (`proto/guruconnect.proto`): no wire change required for enrollment if
`/api/enroll` is REST; `AgentStatus` label fields per SPEC-007 (`department`,
`device_type`) ride along if landed together.
## Security considerations
- **Two-tier credential model:** low-sensitivity **enrollment key** (gates "may register",
shared per site, rotatable) vs. high-sensitivity **per-machine `cak_`** (operating
credential, per-machine revocation). Compromise of an enrollment key is recovered by
rotating one site — no fleet-wide re-key.
- **Enrollment keys stored hashed** (Argon2id); plaintext shown once at issue/rotate.
- **`cak_` at rest on the endpoint** is stored as a **DPAPI-machine-encrypted blob inside a
SYSTEM-ACL'd location** (HKLM value or `ProgramData` file) — both layers: the SYSTEM ACL
stops non-admin users reading it, and DPAPI-machine encryption makes a copied file/export
inert off the box. (Local admin/SYSTEM can always recover it; that is accepted — blast
radius of one leaked `cak_` is a single, independently-revocable machine.)
- **`machine_uid` binding** is the spoof-guard SPEC-004 wants: a `cak_` is bound to a
`machine_uid`; a different box presenting another box's `cak_` is detectable.
- **Authorization model** for moves/enrolls is possession-of-destination-key in v1
(identity-based authz deferred to the technician-assisted path).
- **Open registration risk** is mitigated by requiring `(site_code + long key)` and
rate-limit/lockout; auto-approve is acceptable because the enrollment key is the gate and
every enrollment/site-move fires an alert.
- **Audit events:** enroll, re-enroll/reuse, site-move, key-rotate — all logged with
`machine_uid`, site, and source IP.
## Testing strategy
- **Unit:** `machine_uid` derivation stability; enrollment-key verify/rotate; fingerprint
derivation; `cak_` mint/hash/verify; dedup decision (new vs. reuse vs. move).
- **Integration:** enroll new → row + `cak_` issued; re-enroll same `machine_uid` → reuse,
no duplicate; enroll with rotated (old) key → rejected; old `cak_` still connects after
rotation; rate-limit/lockout trips; site-move emits alert.
- **Manual:** build a site wrapper installer → run on a clean VM → appears in console under
correct site, immediately controllable; re-image VM → same row reused; `signtool verify
/pa` passes on the distributed wrapper and the laid-down agent.
## Effort estimate & dependencies
- **Size:** X-Large (agent + relay + DB migration + CI build/sign wrapper + dashboard
key/rotation surface).
- **Depends on:** SPEC-004 `machine_uid` (shared root); the CI signing already shipped
(SPEC-001 §2 / `release.yml`).
- **Unblocks:** SPEC-007 (installer builder gets a real per-site key + the signing
resolution), and the parked managed-agent test deployment on the internal beta machines.
- **Relationship to v2 phases:** sits with the Phase-1 secure-session-core (per-agent keys
+ identity) and feeds Phase-2 dashboard work.
## Resolved decisions (2026-06-02, Mike)
1. **Wrapper shape — BOTH.** v1 ships a signed bootstrapper `.exe` *and* a signed MSI per
site (ScreenConnect offers both; manual installs use the `.exe`, GPO/Intune fleet pushes
use the MSI). Same sign-once agent inside each.
2. **`cak_` storage — BOTH layers.** DPAPI-machine-encrypted blob stored in a SYSTEM-ACL'd
location. Non-admins can't read it; a stolen copy is inert off the box.
3. **Fingerprint — hex (`7F2A`).** Deliberately *not* the RMM word-code style, so the two
products' artifacts are never visually conflated.
4. **`machine_uid` — per-tenant scope, hardware-derived salt, collision-gated.** Dedup key
`(tenant_id, machine_uid)`; salt from stable hardware signals (survives same-hardware
re-image, separates distinct boxes); detected collisions (e.g. template-cloned VMs
sharing a hardware UUID) drop to pending + alert and require dashboard confirmation to
activate.
5. **Attended (support-code) path — unchanged.** `download_support` is filename-based
(`GuruConnect-<code>.exe`), not append-based, so renaming never breaks the signature —
it is already signing-safe. Only the managed `download_agent` append path is retired.
## Remaining for planning
- Exact stable-hardware signal set for the salt (SMBIOS UUID alone vs. + motherboard/disk
serial) and hypervisor behavior matrix (which hypervisors duplicate the SMBIOS UUID on
clone → exercise the collision-gate).
- MSI authoring approach (WiX) and whether per-site config rides as a per-site MSI vs. a
base MSI + property/transform.

View File

@@ -0,0 +1,180 @@
# SPEC-017: End-User (Sub-User) Remote Access
**Status:** Proposed
**Priority:** P2 (may settle to P3 depending on client demand)
**Requested By:** Mike (2026-06-02)
**Estimated Effort:** Large
## Overview
Let a client pay for their own employees to remotely reach **their own work machines** from home
through GuruConnect — the Splashtop-Business / unattended-end-user-access model, layered on top of the
MSP-technician console GuruConnect ships today. An MSP admin (or, later, a delegated client-company
admin) provisions a list of **end-users** and grants each one access to specific managed machines. The
end-user signs into a locked-down **end-user portal**, sees only the machines granted to them, and
connects — reusing the existing persistent-agent + session-scoped-viewer-token + relay path.
Success criteria: an `end_user`-role account can log in at a separate portal, see exactly the machines
in its grant set (and no others, across no other tenant), launch a control session to an online granted
machine, and is hard-denied from every technician/admin API, the agent plane, and any machine it was
not granted — with each login and machine access written to the audit log.
This is a net-new **sellable capability**, not a console-MVP blocker. It is sequenced after the v2
console foundations it depends on (tenancy, machine identity, persistent enrollment), which is why it is
P2 rather than P1.
## Scope
### Included in v1
- A new **`end_user`** value for `users.role`, provisioned by an MSP admin, with **deny-by-default**
authority: no console permissions, no agent-plane access, machine reach limited strictly to its
`user_client_access` grant set within its own tenant.
- A **separate end-user login + portal** route (locked-down): lists only granted machines with
online/offline state and a Connect action. No admin nav, no other users/machines/companies.
- **Admin UI + API** to create/disable end-users and assign/revoke per-machine grants, reusing the
existing `user_client_access` table.
- **Connect flow** that reuses the landed session-scoped viewer-token mechanism (`ViewerClaims`,
`jwt.rs:114`) and the relay enforcement path — no new transport.
- A new `connect_sessions.source` value **`end_user`** (migration widening the existing CHECK).
- **Audit**: end-user login success/failure and each machine-access grant-check written to
`connect_session_events`.
- Rate limiting + lockout on the public end-user login.
### Explicitly out of scope (v1)
- **Directory sync (AD / Entra-365 / Google) → end-user list** — its own future spec; v1 is manual
list management only.
- **Self-service seat purchasing / billing automation.** v1 records/counts seats per tenant; real
metering and Syncro/billing wiring is deferred.
- **Delegated client-company-admin role** (a client managing its own end-users/grants) — noted as a
fast-follow; v1 grants are MSP-admin-managed.
- Per-session view-only-vs-control *policy* per end-user (v1 = Control of one's own machine; the
`ViewerAccess` split still exists at the token layer).
- File transfer, session recording (already out of scope for the broader product v1).
## Architecture
### Principal model — `end_user` is a constrained variant of the login plane
GuruConnect already has three credential planes that must stay separate (audit-hardened in v2 Phase 1):
1. **Login `Claims`** (`jwt.rs:11`) — dashboard users; `role ∈ {admin, operator, viewer}` today.
2. **Session-scoped `ViewerClaims`** (`jwt.rs:114`) — 5-min, one session, `purpose=viewer`.
3. **Agent `cak_` keys** (`connect_agent_keys`, migration 004) — agents only.
`end_user` is added as a **fourth role on the login plane** — it issues a normal login JWT
(`create_token`, `jwt.rs:161`) carrying `role: "end_user"` and an **empty permission list**. The
separation guarantees the v2 audit established are preserved: an `end_user` JWT still cannot be used as
a viewer token (lacks `purpose`) nor as an agent key (agent plane rejects user JWTs).
**Critical authz inversion:** `user_client_access` today documents "no entries = access to all (for
admins)" (migration 002, line 25-26). The grant check **must branch on role** — for `end_user`, an
empty grant set means **zero** machines, never all. Authz is deny-by-default and grant-scoped; the
admin-bypass in `Claims::has_permission` (`jwt.rs:28-33`) must never fire for `end_user`.
### Agent / Relay-server / Viewer / Dashboard responsibilities
- **Agent:** no changes. End-users connect to existing **persistent/unattended** managed agents
(consent `not_required` — it is the user's own machine). Optionally honors the SPEC-015 notification
overlay if a per-machine policy requires it.
- **Relay-server:** no transport change. New end-user auth + portal + connect endpoints; the
grant-check + viewer-token mint is the only new server logic on the hot path.
- **Viewer:** reuse the React/TS web viewer (`dashboard/src/components/RemoteViewer.tsx`) — the
end-user portal embeds the same component with a Control-mode viewer token.
- **Dashboard:** new **role-gated end-user portal** route (recommended separate from the technician
console — see Open Questions), plus admin screens for end-user + grant management.
### Database (migrations)
- **`user_client_access`** — reused as the grant table; no schema change (already
`user_id UUID × client_id UUID → connect_machines(id)`, unique pair, migration 002).
- New migration `011_end_user_access.sql`:
- Widen `connect_sessions.source` CHECK to `('standalone','gururmm','end_user')` (currently
`('standalone','gururmm')`, migration 004 line 99-102).
- Optional `users` columns for the external principal: `mfa_secret TEXT NULL`,
`must_change_password BOOLEAN NOT NULL DEFAULT false`, and a partial index for fast
`role='end_user'` listing per `tenant_id`.
- (Seat tracking, if landed in v1: a lightweight per-tenant `end_user` count view or a
`tenant_seats` row — kept minimal.)
- Grants are tenant-contained: insert path validates `machine.tenant_id == end_user.tenant_id`.
### API endpoints / WS messages
- `POST /api/enduser/auth/login` — public, rate-limited; returns an `end_user` login JWT.
- `GET /api/enduser/machines` — lists only the caller's granted, in-tenant machines + presence.
- `POST /api/enduser/machines/:id/connect` — grant-checked; creates a `source=end_user` session and
mints a Control `ViewerClaims` token (`create_viewer_token`, `jwt.rs:233`) for that session.
- Admin: `POST /api/users` (role=end_user), `POST /api/users/:id/grants`,
`DELETE /api/users/:id/grants/:machine_id`, `GET /api/users?role=end_user`.
- No new protobuf messages — the WS viewer path and `guruconnect.proto` are unchanged.
## Implementation details
- `server/src/auth/jwt.rs` — extend the role vocabulary doc (`Claims.role`, line 16-17); add an
`is_end_user()` helper and ensure `has_permission` cannot grant `end_user` anything beyond explicit
permissions (the admin short-circuit at line 30 must be guarded).
- `server/src/auth/mod.rs``AuthenticatedUser` (line 29+) gains role-aware helpers; add an extractor
/ middleware that rejects non-`end_user` on the `/api/enduser/*` namespace and rejects `end_user` on
every console/admin route (deny-by-default allowlist).
- `server/src/api/` — new `enduser` handler module (login, machines, connect); admin user+grant
handlers extended for `role=end_user` and `user_client_access` writes.
- Grant check (shared fn): `machine_id ∈ user_client_access[user] AND machine.tenant_id == user.tenant_id`;
used by both `GET /machines` and `connect`.
- Session create stamps `source='end_user'`, `is_managed=true`/unattended, `consent_state='not_required'`,
then mints the viewer token via the existing path so relay enforcement is unchanged.
- `dashboard/src/` — end-user portal route (role-gated), reusing `RemoteViewer.tsx`; admin grant-matrix
UI. White-label (SPEC-014) applies to the portal as the most client-facing surface.
- Migration `server/migrations/011_end_user_access.sql` as above (idempotent; applied by
`sqlx::migrate!` per the migration standard).
## Security considerations
- **Preserve the plane separation** audited in v2 Phase 1 — `end_user` is login-plane only; it can
never satisfy `validate_viewer_token` or the agent `cak_` path.
- **Deny-by-default, grant-scoped:** empty `user_client_access` for an `end_user` = no access; the
admin-bypass must not apply. Every `/api/enduser/*` call re-checks the grant + tenant server-side
(never trust a machine id from the client).
- **Tenant containment:** an `end_user` and its grants live in one tenant; cross-tenant grants are
rejected at write and re-validated at connect. (Full tenant isolation lands with Phase 4; v1 enforces
via explicit `tenant_id` equality checks.)
- **External-user trust:** these accounts are public-internet-facing from home. Require
rate-limiting + lockout on `/api/enduser/auth/login`; support (recommend require) **TOTP MFA** for
`end_user` — schema column included so MFA can be v1 or an immediate fast-follow without a second
migration. Argon2id passwords (existing standard).
- **Audit:** log each end-user login (success/failure, source IP) and each machine access to
`connect_session_events`; the unattended access is to the user's *own* machine but must be fully
traceable. Optionally enforce the SPEC-015 overlay per machine policy.
- **Threat model:** stolen end-user creds reach only that user's granted machines (blast radius =
grant set), never the console, never the agent plane, never another tenant. Disabling the account
(`users.enabled=false`) immediately revokes portal + future tokens; the 5-min viewer-token TTL bounds
any in-flight session.
## Testing strategy
- **Unit:** grant-check fn (granted / not-granted / cross-tenant / empty-set-for-end_user = deny);
`has_permission` never elevates `end_user`; role-namespace middleware (end_user→console = 403,
technician→/api/enduser = 403).
- **Integration:** end-user login → list shows only granted machines → connect mints a Control viewer
token for a `source=end_user` session → relay admits; connect to a non-granted / other-tenant machine
→ 403; disabled account → login + token use rejected.
- **Manual:** full portal walkthrough from an off-network browser; MFA enrol + challenge; audit rows
present for login and access; white-label branding renders on the portal.
## Effort estimate & dependencies
- **Size:** Large (new principal + portal + admin grant UI + auth namespace; transport/agent untouched
and the grant table already exists, which holds it below X-Large).
- **Depends on (must precede / strongly preferred):**
- **Tenancy** (`tenants` + `tenant_id`, migration 004) — needed for containment; full isolation is
Phase 4 but v1 uses explicit tenant checks.
- **Stable machine identity + persistent enrollment** (SPEC-004 / 008 `machine_uid`, SPEC-016
zero-touch `cak_`) — end-users reach persistent managed agents.
- **Session-scoped viewer tokens** (v2 Phase 1, landed) — reused directly.
- **Pairs with:** SPEC-014 (white-label — the portal is the client-facing surface), SPEC-003/005
(machine inventory/list — portal machine rows), SPEC-015 (optional connect-notification overlay).
- **Unblocks:** the directory-sync spec (AD/Entra/Google → end-user list), delegated client-admin role,
and per-seat billing — all of which build on the `end_user` principal defined here.
## Open questions
1. **Same console vs separate end-user portal?** Recommendation: **separate, role-gated route**
smaller attack surface, no risk of leaking technician controls, cleaner white-label. Confirm before
build.
2. **End-users in the existing `users` table (role=end_user) vs a dedicated `end_users` table?**
Recommendation: reuse `users` (the grant FK `user_client_access.user_id` already points there) with
hard role guardrails. Revisit if mixing external + internal principals in one table proves risky.
3. **MFA in v1 or immediate fast-follow?** Schema is included either way; decide enforcement timing.
4. **Who administers grants in v1** — MSP admin only (assumed), or ship the delegated client-company
admin role together? (Affects scope/effort materially.)
5. **Seat/licensing enforcement depth for v1** — count-and-display vs hard-cap vs billing-integrated.
6. **Default access mode** — Control assumed (own machine); should an admin be able to pin a machine to
view-only for a given end-user? (Token layer already supports it.)

View File

@@ -0,0 +1,146 @@
# SPEC-018: Managed-Agent SYSTEM Service Host + Session Broker
**Status:** Proposed
**Priority:** P1 (blocks SPEC-016 Phase B end-to-end runtime and SPEC-013)
**Requested By:** Mike (2026-06-02)
**Estimated Effort:** X-Large
## Overview
Convert the managed/persistent GuruConnect agent from a user-context `HKCU\…\Run` autostart into a
**Windows SYSTEM service** that runs unattended — at the login screen, with no user logged in, across
reboots — and **brokers per-session capture/input worker processes** into the active interactive
desktop. A SYSTEM service lives in the isolated **Session 0** and cannot capture or inject the
interactive desktop directly, so the service spawns a worker into the target user session (the
ScreenConnect architecture).
This is foundational, not cosmetic. It unblocks three things at once:
1. **SPEC-016 Phase B end-to-end runtime** — the per-machine `cak_` store is ACL'd to SYSTEM +
Administrators; today the agent runs as the interactive *user* and can't read its own store (the
Phase B C1 *fail-fast guard* exists precisely because of this). Running as SYSTEM makes the store
readable and removes the guard.
2. **True unattended access** — a user-context agent only runs while that user is logged in. Reaching
a rebooted server or a machine sitting at the login screen (table-stakes for remote support)
requires SYSTEM.
3. **SPEC-013 session selection / backstage** — the session-broker primitive built here is the
substrate SPEC-013's session-switching UX drives.
**Success criteria:** the managed agent installs as an auto-start SYSTEM service; it holds the relay
connection and performs SPEC-016 enrollment as SYSTEM (reading/writing the SYSTEM-ACL'd `cak_`); it
spawns a capture/input worker into the active interactive session and relays frames; the worker is
respawned/retargeted on logon/logoff/console-connect; and the Phase B fail-fast guard is removed
because the store is now readable in-context.
## Background — why this is needed (confirmed in code)
- The persistent agent autostarts via `HKCU\…\Run` (`agent/src/startup.rs:21`, `STARTUP_KEY` = HKCU)
→ interactive-user token, not SYSTEM. The only SYSTEM service today is the separate `sas_service`
(Secure Attention Sequence helper).
- SPEC-016 Phase B (`agent/src/credential_store.rs`) ACLs the `cak_` store to `*S-1-5-18` (SYSTEM) +
`*S-1-5-32-544` (Administrators). In the current user context the agent writes but cannot read it
back → the Phase B fail-fast guard (`agent/src/main.rs` `resolve_agent_credential`) emits
"must run as the GuruConnect SYSTEM service (see SPEC-018)" instead of bricking.
- Capture/input live in the agent process (`agent/src/capture/`, `agent/src/input/`); a Session-0
SYSTEM service cannot drive these against the interactive desktop without a per-session worker.
## Scope
### Included in v1
1. **Windows service install/lifecycle** (`agent/src/install.rs` + a new service module): register the
managed agent as a **LocalSystem auto-start service** (`CreateServiceW` / a service crate),
configure failure/recovery (restart on crash), and **replace the HKCU `Run` autostart for managed
mode** (remove the Run entry on service install). Clean uninstall (stop + delete service).
2. **Service control loop** (Session 0, SYSTEM): owns the persistent WSS connection to the relay,
performs SPEC-016 enrollment as SYSTEM (now able to read/write the `cak_` store), and dispatches
session/connect requests to workers. Handles `SERVICE_CONTROL_STOP`/`SHUTDOWN` and
`SERVICE_CONTROL_SESSIONCHANGE`.
3. **Session broker:** enumerate sessions (`WTSEnumerateSessionsW`), resolve the active interactive
session (`WTSGetActiveConsoleSessionId`), obtain its user token (`WTSQueryUserToken`
`DuplicateTokenEx`), and spawn a **per-session capture/input worker** into that session's desktop
(`CreateProcessAsUserW`, `winsta0\default`). The worker does DXGI capture + input injection in the
user's session; the service relays frames over the existing transport.
4. **Service ↔ worker IPC:** a local, ACL'd channel (named pipe `\\.\pipe\guruconnect-<sessionId>`)
carrying frames/input/control; pipe ACL restricted to SYSTEM + the target session user.
5. **Session-change handling:** on logon/logoff/console-connect/disconnect/lock/unlock, (re)spawn or
retarget the worker so the active desktop is always the one being served.
6. **Remove the SPEC-016 Phase B fail-fast guard** once the service runs as SYSTEM (the store is
readable in-context); keep the SYSTEM+Administrators ACL.
### Explicitly out of scope (anticipated, separate specs)
- **Session-selection / backstage UX** — the operator-facing picker and Session-0/secure-desktop
command surface are **SPEC-013**; this spec only provides the broker primitive it drives.
- **Login-screen / secure-desktop (winlogon) capture** beyond the broker hook — the hard
Secure-Desktop case is coordinated with SPEC-013; v1 here targets the active interactive session.
- **macOS/Linux service equivalents** — future SPEC-010 (cross-platform agents).
## Architecture
- **Agent splits into two roles:**
- **service-host** (LocalSystem, Session 0): service lifecycle, relay transport, SPEC-016
enrollment + `cak_` store, session broker, IPC server.
- **session-worker** (per interactive session, user token): DXGI/GDI capture, input injection,
IPC client. Spawned by the service via `CreateProcessAsUserW`.
- **Service install** (`install.rs`): `CreateServiceW` with `SERVICE_AUTO_START`, `SERVICE_WIN32_OWN_PROCESS`,
recovery actions; uninstall stops + deletes. Replaces managed-mode `HKCU Run`.
- **Token handoff:** `WTSGetActiveConsoleSessionId``WTSQueryUserToken``DuplicateTokenEx`
(primary token) → `CreateProcessAsUserW` with `lpDesktop = "winsta0\\default"`.
- **IPC:** named pipe per session, length-prefixed protobuf (reuse `proto/` message types where
sensible), pipe security descriptor granting only SYSTEM + the session user.
- **Session events:** the service registers for `SERVICE_CONTROL_SESSIONCHANGE` and reacts to
`WTS_CONSOLE_CONNECT`, `WTS_SESSION_LOGON/LOGOFF`, `WTS_SESSION_LOCK/UNLOCK`.
## Security considerations
- **LocalSystem is maximal privilege** — minimize the service's attack surface; validate every
relay-delivered command; never spawn a worker except into a legitimately-enumerated active session.
- **IPC pipe must be ACL'd** (SYSTEM + the specific session user only) so a non-admin user can't
inject capture/input commands by connecting to the pipe.
- **Token hygiene:** close duplicated tokens promptly; don't leak SYSTEM or user primary tokens.
- The SPEC-016 `cak_` store (SYSTEM-ACL'd) is now correctly readable; the fail-fast guard is removed
but the ACL stays.
- **Audit:** service start/stop, enrollment-as-SYSTEM, worker spawn, session attach/retarget — written
to the existing event pipeline.
## Implementation details
- New service module (e.g. `agent/src/service/{mod.rs, broker.rs, ipc.rs}`); worker entry split out of
the current capture path. New `Commands` variants or an internal `--service`/`--session-worker`
dispatch in `agent/src/main.rs`.
- `install.rs`: service create/recovery/delete; drop the managed-mode HKCU `Run` write.
- `windows` crate features: `Win32_System_Services`, `Win32_System_RemoteDesktop`
(`WTS*`), `Win32_Security`, `Win32_System_Threading` (`CreateProcessAsUserW`),
`Win32_System_Pipes`.
- Remove the `resolve_agent_credential` fail-fast guard branch added in SPEC-016 Phase B.
## Testing strategy
- **Service:** install → auto-start on boot → stop → uninstall on a clean VM.
- **`cak_` end-to-end:** SYSTEM service enrolls (SPEC-016), stores + reads the `cak_`, connects — the
integration test SPEC-016 Phase B currently cannot run.
- **Session broker:** worker spawns into the active session; capture/input work; survives logoff→logon
(respawn) and console-connect (retarget); fast-user-switch retarget.
- **Security:** non-admin cannot connect to the IPC pipe; worker runs with the user's token (not
SYSTEM) in the user's desktop.
## Effort estimate & dependencies
- **Size:** X-Large (service host + worker split + token-handoff + IPC + session-change handling +
install/uninstall).
- **Depends on:** SPEC-016 (enrollment + `cak_` store); the existing capture/input cores.
- **Unblocks:** SPEC-016 Phase B end-to-end runtime (and the parked managed-agent enrollment test on
the internal beta machines); **SPEC-013** (session selection builds on this broker).
## Open questions
1. **Service vs. SYSTEM scheduled task** — a true Windows service (recovery, SCM integration) is the
standard, robust choice; recommend service. Lock in planning.
2. **One multi-session worker vs. one worker per session** — per-session worker is simpler to reason
about and isolates a crash to one session; confirm.
3. **IPC transport** — named pipe (recommended) vs. local TCP/loopback; pipe ACLing is the cleaner
security story.
4. **Login-screen / Secure-Desktop capture** — how much (if any) in this spec vs. deferred to SPEC-013
(it needs a worker in the winlogon/secure desktop, a distinct hard problem).
5. **Migration** — on upgrade, cleanly transition existing HKCU-`Run` managed installs to the service
(remove the Run entry, install the service) without a gap.

View File

@@ -0,0 +1,129 @@
# GuruConnect Audit Report — 2026-05-31
**Auditor:** Claude (claude-opus-4-8[1m])
**Passes:** Security & Remote-Session Integrity (`--pass=security` only)
**Previous audit:** 2026-05-30 (`reports/2026-05-30-gc-audit.md`)
**Scope note:** v2 **Phase-1 EXIT gate** re-audit. Confirms the three relay CRITICALs stay closed and
the prior net-new HIGH is fixed, and assesses the net-new SPEC-004 surface (Tasks 2/4/5 — machine_uid
dedup, session reaping, operator removal) now committed + deployed. Includes **live** boundary tests
against the running production binary, not just a code re-derivation.
**Code under audit:** working tree at tag **v0.3.0 / e967cce** = the binary deployed to prod
172.16.3.30:3002 (deployed this session from 96f9c0a; e967cce adds only the version bump + changelog).
---
## Executive Summary
| Pass | Total | Critical | High | Medium | Low | Info |
|------|-------|----------|------|--------|-----|------|
| Security & Session | 4 | 0 | 0 | 0 | 0 | 4 |
**Phase-1 security EXIT gate: PASS.** The relay/server plane is clean. All three 2026-05-29 CRITICALs
remain CLOSED (verified in code AND live against the deployed server). The prior net-new HIGH (agent
auto-update TLS bypass) and the prior LOW (chat content logged at INFO) are both remediated. The
net-new SPEC-004 surface (operator removal, machine_uid dedup gate, session reaper/supersede) audits
clean with the keyed-identity security invariant intact end-to-end. No net-new findings.
**Requires action:** none.
---
## Live functional verification (deployed binary, 172.16.3.30:3002)
Forged tokens (HS256, real `JWT_SECRET`) exercised the WS auth boundaries directly. Each illegitimate
access was REJECTED (4xx, never a 101 upgrade):
| Check | Result | Proves |
|-------|--------|--------|
| Login-shape JWT on `/ws/viewer` | **401** | Login token not accepted as a viewer token (`purpose=="viewer"` enforced) — CRITICAL #1 |
| Validly-signed viewer token for session AAAA used on session BBBB | **403** | Session binding enforced — a correctly-signed token is refused for the wrong session — CRITICAL #1 |
| Login JWT used as agent `api_key` on `/ws/agent` | **401** | Agent plane rejects JWTs (no JWT branch) — CRITICAL #3 |
| Wrong-signature token on `/ws/viewer` | **401** | Signature validation holds (control) |
The session-bind case is the decisive one: a token that WOULD be accepted for its own session is
rejected 403 for a different session, proving the binding rather than mere signature validation.
---
## The three relay CRITICALs — verdict
| CRITICAL | Verdict | Enforced at |
|----------|---------|-------------|
| #1 any-JWT-joins-any-session | **CLOSED** | mint authz `api/sessions.rs` (is_admin \|\| permission); viewer WS `relay/mod.rs:496` `validate_viewer_token` (sig+expiry+`purpose=="viewer"`); session-bind `relay/mod.rs:527-534` (`claim != requested → 403`) |
| #2 viewer-WS blacklist | **CLOSED** (TTL-bounded residual unchanged) | `relay/mod.rs:509` `token_blacklist.is_revoked` before upgrade. Residual: logout revokes login JWT not minted viewer tokens (5-min TTL) — same tracked MEDIUM, no regression |
| #3 JWT-accepted-as-agent-key | **CLOSED**, fails closed | `relay/mod.rs:417` `validate_agent_api_key` — no JWT branch; only `cak_` (`auth/agent_keys.rs`, SHA-256 vs `connect_agent_keys`, `revoked_at IS NULL`) or deprecated shared key (WARN). Unresolved machine → 503 (`:303`); client `agent_id` overridden by key identity (`:283`) |
Live results match these code paths exactly.
---
## Prior HIGH — FIXED
**Agent auto-update TLS bypass → MITM-RCE: CLOSED.** `agent/src/update.rs:21` `dev_insecure_tls()` is
`cfg!(debug_assertions)` AND env-var gated, so a release build's `cfg!` compiles out and the agent
ALWAYS verifies certs. Both `check_for_update` (`:64`) and `download_update` (`:130`) consume it; unit
test `test_dev_insecure_tls_release_is_always_false` (`:362`) asserts the release invariant. No
`danger_accept_invalid_certs(true)` reachable in production. A signed-manifest defense-in-depth TODO is
filed at `install_update` (`:189`) (= tracked task #10, not an exit blocker).
---
## Pass 5: Security & Remote-Session Integrity — net-new SPEC-004 surface
### [INFO] Operator removal API (`server/src/api/removal.rs`) — clean, admin-gated
Every removal handler takes the `AdminUser` extractor as its first argument (runs before any DB
mutation): `remove_machine` (`:88`), `remove_session` (`:321`), `bulk_remove_machines` (`:471`).
`AdminUser` (`auth/mod.rs:141`) validates JWT (signature + expiry + blacklist `:97`) then requires
`is_admin()` else 403 (`:146`). Soft-deletes are parameterized + idempotent (`WHERE … AND deleted_at IS
NULL`); bulk bounded (MAX_BATCH 500) with per-id UUID validation + isolated failures; audit
(`db/events.rs:126`) records actor + target + trusted-proxy IP, best-effort (cannot be suppressed by
attacker-controlled input). Removal is admin-role-gated globally (not per-tenant ACL) — same Phase-1
posture as viewer-mint, per-tenant narrowing deferred to SPEC-002 Phase 4. Acceptable by context.
### [INFO] machine_uid dedup security gate — invariant holds
Gate at `relay/mod.rs:352`: `effective_machine_uid = if is_keyed_agent { None } else { claimed }`. The
suppressed value (not the raw claim) flows to `register_agent` and `upsert_machine`. Keyed (`cak_`)
agents take the agent_id-keyed upsert branch and never write/touch a `ON CONFLICT (machine_uid)` row, so
a valid key for machine X cannot repoint machine Y via a claimed uid. An un-keyed uid-spoof can only
match a uid-bearing row — which the keyed connect path never creates; the only residual is a legacy
pre-keying row, and the startup L1 fix (`main.rs:267-288` via `keyed_machine_ids`, fail-closed on query
error) ensures keyed machines are never uid-indexed on restore.
### [INFO] Session reaper + same-machine supersede — clean, TOCTOU closed
`reap_stale_persistent` (`:875`) and supersede (`:322`) select under a read lock then re-assert the full
predicate under the write lock via `remove_session_if` (`:755`). Predicate requires
`!is_online && is_persistent && viewers.is_empty()` (+ TTL / same-uid) — an online, viewer-attached, or
support session is never reaped/superseded. Un-keyed uid-spoof blast radius = denial-of-persistence on
an offline same-uid session at worst, never a hijack. Lock order matches `register_agent`; predicate is
synchronous (no await under lock).
### [INFO] General posture — confirmed, no regressions
Runtime sqlx parameterized everywhere (no `format!`-built SQL); migrations 008/009 idempotent. Frame
caps: agent 4 MiB / viewer 64 KiB applied before upgrade. Input throttle retained. `/api/auth/login`
rate-limited (`main.rs:397`). `JWT_SECRET` panics if <32 (`main.rs:143`); agent keys SHA-256; Argon2id
passwords; no secret/token/code/PII logged. **Chat content no longer logged** (prior LOW fixed —
`relay/mod.rs:829,1428` now log length only).
---
## Definitive answers
- **(a) Any non-admin removal path?** NO — all three removal handlers gate on `AdminUser` (JWT+blacklist+`is_admin`→403) before any DB mutation.
- **(b) Any uid-spoof that repoints/hijacks another machine's row or session (not just denial)?** NO — keyed identity is authoritative and uid-suppressed across connect → upsert → reattach → startup restore. Worst case for an un-keyed spoof is denial-of-persistence on an offline same-uid session.
- **(c) Any auth-plane bypass (agent↔viewer credential crossover)?** NO — viewer plane requires a `purpose=="viewer"` session-bound minted token; agent plane requires a `cak_`/shared key with no JWT branch. Confirmed in code and live.
---
## Verdict
**Phase-1 security EXIT gate: PASS.** Relay/server plane clean; prior HIGH + LOW remediated; SPEC-004
surface sound with the keyed-identity invariant intact across the connect path, DB upsert, in-memory
reattach, and startup restore. No new CRITICAL/HIGH/MEDIUM/LOW.
**Tracked, deferred-by-design (not exit blockers):**
- Viewer-token logout revocation residual (MEDIUM, TTL-bounded) — `v2-secure-session-core/plan.md`.
- Update-binary signature verification (defense-in-depth, task #10) — TODO at `update.rs:189`.
*Note: only `--pass=security` was run. API-surface, Rust-quality, TypeScript, protocol-integrity,
docs-reconciliation, and CI/CD passes were not executed this run.*

View File

@@ -0,0 +1,159 @@
-- Migration: 010_spec016_enrollment.sql
-- Purpose: SPEC-016 zero-touch per-site agent enrollment — server-side data model.
--
-- Adds the per-site enrollment-key table, a minimal sites table to anchor it,
-- and the machine-side columns the collision-gated self-registration flow needs.
--
-- Two-tier credential model (SPEC-016 §Security): a low-sensitivity, rotatable,
-- per-site ENROLLMENT KEY (the `cek_` secret stored hashed here) gates "may this
-- machine register at all", while the high-sensitivity per-machine `cak_`
-- operating credential (connect_agent_keys, migration 004) is minted on a
-- successful enroll. Compromise of an enrollment key is recovered by rotating one
-- site, not a fleet-wide re-key.
--
-- DEVIATION FROM SPEC (documented): SPEC-016 §DB-migration describes
-- `site_enrollment_keys.site_id` as `fk -> sites`, assuming a sites table already
-- exists. It does NOT — in the current schema "site" and "company/organization" are
-- free-text columns on connect_machines (migration 005), there is no relational
-- sites entity. This migration therefore CREATES a minimal `connect_sites` table
-- (the relational anchor the enrollment-key FK and the dashboard per-site key
-- display both require) keyed by a natural `site_code` and scoped per-tenant. It is
-- intentionally minimal (code + display name + tenant); richer site/company
-- modeling is left to future work. The free-text connect_machines.site /
-- .organization columns are untouched and continue to carry agent-reported labels.
--
-- Idempotent: CREATE TABLE/INDEX IF NOT EXISTS, ADD COLUMN IF NOT EXISTS. Applied on
-- server startup by sqlx::migrate!(); never pre-applied via psql. Ordered after 009.
-- See .claude/standards/gururmm/sqlx-migrations.md.
-- pgcrypto provides gen_random_uuid(); enabled in 001/004 but re-asserted for safety.
CREATE EXTENSION IF NOT EXISTS "pgcrypto";
-- ============================================================================
-- connect_sites — relational anchor for per-site enrollment (see DEVIATION above)
-- ============================================================================
-- A site is the unit a single signed installer targets. `site_code` is the
-- non-secret, operator-facing identifier the installer carries and the agent sends
-- at /api/enroll (e.g. "ACME-PHX"). Uniqueness is per-tenant: the same human-chosen
-- code may legitimately exist in two tenants. tenant_id mirrors the nullable,
-- default-tenant-backfilled tenancy column used on every other scoped table
-- (migration 004); db::tenancy::current_tenant_id() resolves it for now.
CREATE TABLE IF NOT EXISTS connect_sites (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Operator-facing site identifier the installer carries. Non-secret.
site_code TEXT NOT NULL,
-- Human-readable site / company display name for the dashboard.
display_name TEXT,
-- Default company label applied to machines enrolled at this site (mirrors the
-- free-text connect_machines.organization the agent otherwise self-reports).
company TEXT,
-- Tenancy-ready (Phase 4). Backfilled to the default tenant below.
tenant_id UUID,
-- RESERVED for future per-site enrollment POLICY work (SPEC-016 §out-of-scope):
-- default 'auto-approve'; a future 'pending-approval' value will gate new
-- enrollments. NOT enforced in Phase A — present so the policy SPEC needs no
-- schema change. Do not branch on this column yet.
enrollment_policy TEXT DEFAULT 'auto-approve',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Per-tenant uniqueness of the natural site_code so /api/enroll can resolve a site
-- deterministically within a tenant while the same code may exist across tenants.
-- COALESCE keeps the index usable while tenant_id is still nullable (Phase 1).
CREATE UNIQUE INDEX IF NOT EXISTS idx_connect_sites_tenant_code
ON connect_sites (COALESCE(tenant_id, '00000000-0000-0000-0000-000000000001'::uuid), site_code);
-- Backfill the sites tenant_id to the default tenant (table is empty on a fresh DB;
-- no-op there, but keeps the migration self-consistent).
UPDATE connect_sites
SET tenant_id = '00000000-0000-0000-0000-000000000001'
WHERE tenant_id IS NULL;
-- ============================================================================
-- site_enrollment_keys — rotatable, hashed per-site enrollment secret + fingerprint
-- ============================================================================
-- Stores ONLY the Argon2id hash of the `cek_` secret; the plaintext is shown once
-- at issue/rotate and never recoverable. `version` is the monotonic rotation
-- counter; `fingerprint` is the non-secret short hex shown as `vN (XXXX)` in the
-- dashboard and baked into the installer filename. `active` marks the current key —
-- rotation flips the old key to active=false (blocking NEW enrollments from old
-- installers) and inserts a new active row; already-enrolled agents holding their
-- own `cak_` are unaffected. Multiple inactive (historical) rows may coexist per
-- site; at most one active row is intended (enforced by a partial unique index).
CREATE TABLE IF NOT EXISTS site_enrollment_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
site_id UUID NOT NULL REFERENCES connect_sites(id) ON DELETE CASCADE,
-- Argon2id hash of the `cek_` enrollment secret. Never the plaintext.
key_hash TEXT NOT NULL,
-- Monotonic rotation version (1, 2, 3, ...).
version INTEGER NOT NULL,
-- Non-secret short hex fingerprint code (the XXXX in `vN (XXXX)`), derived from
-- the secret. Stored so the dashboard / GET endpoint can show it without the
-- secret.
fingerprint TEXT NOT NULL,
active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Set when this key is rotated out (active flipped to false).
rotated_at TIMESTAMPTZ
);
-- Lookup index for the enroll hot path: resolve the active key for a site.
CREATE INDEX IF NOT EXISTS idx_site_enrollment_keys_site_active
ON site_enrollment_keys (site_id, active);
-- At most one ACTIVE enrollment key per site (the "current" installer key).
-- Partial unique index so any number of inactive historical rows may coexist.
CREATE UNIQUE INDEX IF NOT EXISTS idx_site_enrollment_keys_one_active
ON site_enrollment_keys (site_id)
WHERE active;
-- ============================================================================
-- connect_machines — site binding + enrollment-state collision gate
-- ============================================================================
-- machine_uid already exists (migration 008) with a partial UNIQUE index on
-- (machine_uid) WHERE machine_uid IS NOT NULL. SPEC-016 §item-1 / resolved-decision #4
-- call for the dedup key to be PER-TENANT — (tenant_id, machine_uid) — so the same
-- hardware legitimately present in two tenants stays two rows. tenant_id is the
-- scoping column that exists on connect_machines (migration 004); machines have no
-- direct site_id today, so site is tracked separately (site_id below) and tenancy is
-- the uniqueness scope, exactly as the spec states.
--
-- CRITICAL CONSTRAINT (why we ADD rather than REPLACE the 008 index here):
-- db::machines::upsert_machine (the live connect-path upsert) uses
-- `ON CONFLICT (machine_uid) WHERE machine_uid IS NOT NULL` as its conflict arbiter.
-- Postgres matches that arbiter to the EXACT index from migration 008. Dropping that
-- index would make the live upsert fail to find an arbiter and error at runtime —
-- breaking every un-keyed agent reconnect. So migration 008's global index is LEFT
-- IN PLACE (the connect path keeps working unchanged) and the per-tenant index is
-- added ALONGSIDE it. In single-tenant Phase 1 the two are equivalent (every row's
-- tenant_id is the default tenant), so the per-tenant index adds the SPEC-016 dedup
-- semantics without a redundant-uniqueness conflict: a (tenant, uid) pair that is
-- unique is also globally unique today. When multi-tenancy activates AND
-- upsert_machine's ON CONFLICT is updated to name (tenant_id, machine_uid), a future
-- migration drops the global 008 index. Documented as deferred; do not drop it now.
-- Optional FK to the site a machine enrolled under (NULL for legacy / support-code
-- machines that never enrolled through /api/enroll). A site change on re-enroll is
-- the "site move" SPEC-016 audits.
ALTER TABLE connect_machines ADD COLUMN IF NOT EXISTS site_id UUID REFERENCES connect_sites(id) ON DELETE SET NULL;
-- enrollment_state: the collision gate (SPEC-016 §item-1/6). 'active' = live and
-- controllable (auto-approve posture); 'pending' = a machine_uid collision was
-- detected at enroll and an operator must confirm in the dashboard before the
-- endpoint may be controlled. Default 'active' so every legacy/connect-path row is
-- unaffected.
ALTER TABLE connect_machines
ADD COLUMN IF NOT EXISTS enrollment_state TEXT NOT NULL DEFAULT 'active'
CHECK (enrollment_state IN ('active', 'pending'));
-- Per-tenant machine_uid uniqueness (SPEC-016). Added ALONGSIDE migration 008's
-- global (machine_uid) index (see CRITICAL CONSTRAINT above — the connect-path
-- upsert's ON CONFLICT arbiter binds to the 008 index, which must survive). COALESCE
-- folds a NULL tenant_id to the default tenant so the index is well-defined while
-- tenancy is single-tenant (Phase 1); the WHERE clause excludes NULL machine_uid so
-- legacy un-keyed rows coexist freely.
CREATE UNIQUE INDEX IF NOT EXISTS idx_connect_machines_tenant_machine_uid
ON connect_machines (COALESCE(tenant_id, '00000000-0000-0000-0000-000000000001'::uuid), machine_uid)
WHERE machine_uid IS NOT NULL;

1008
server/src/api/enroll.rs Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -4,10 +4,12 @@ pub mod auth;
pub mod auth_logout; pub mod auth_logout;
pub mod changelog; pub mod changelog;
pub mod downloads; pub mod downloads;
pub mod enroll;
pub mod machine_keys; pub mod machine_keys;
pub mod releases; pub mod releases;
pub mod removal; pub mod removal;
pub mod sessions; pub mod sessions;
pub mod sites;
pub mod users; pub mod users;
use axum::{ use axum::{

217
server/src/api/sites.rs Normal file
View File

@@ -0,0 +1,217 @@
//! Site enrollment-key administration (SPEC-016, admin plane).
//!
//! Admin (dashboard JWT + admin role) endpoints for the per-site enrollment key
//! the dashboard surfaces and rotates:
//!
//! - `POST /api/sites/:id/enrollment-key/rotate` — regenerate the `cek_` secret,
//! bump the monotonic version, derive a new fingerprint, deactivate the prior
//! active key, and return the plaintext + fingerprint ONCE. Old installers can no
//! longer enroll NEW machines after this; already-enrolled agents (holding their
//! own `cak_`) are unaffected (SPEC-016 success-criterion #3). Doubles as
//! first-issue when a site has no key yet.
//! - `GET /api/sites/:id/enrollment-key` — read the CURRENT non-secret fingerprint
//! + version (never the secret). 404 if the site has no active key yet.
//!
//! Auth mirrors `api::machine_keys`: the [`crate::auth::AdminUser`] extractor gates
//! both routes, and they are mounted behind the JWT `auth_layer`.
//!
//! SECURITY: the plaintext `cek_` is returned exactly once (rotate response),
//! never persisted in plaintext and never logged. Read responses expose only the
//! version + fingerprint.
use axum::{
extract::{Path, State},
http::StatusCode,
Json,
};
use serde::Serialize;
use uuid::Uuid;
use crate::auth::{enrollment_keys, AdminUser};
use crate::db;
use crate::AppState;
/// Standard error envelope (matches `api::machine_keys::ApiError`).
#[derive(Debug, Serialize)]
pub struct ApiError {
pub detail: String,
pub error_code: String,
pub status_code: u16,
}
impl ApiError {
fn new(status: StatusCode, code: &str, detail: &str) -> (StatusCode, Json<ApiError>) {
(
status,
Json(ApiError {
detail: detail.to_string(),
error_code: code.to_string(),
status_code: status.as_u16(),
}),
)
}
}
type ApiResult<T> = Result<T, (StatusCode, Json<ApiError>)>;
/// Response for a freshly rotated/issued enrollment key. `key` is present ONLY
/// here, once.
#[derive(Debug, Serialize)]
pub struct RotatedEnrollmentKey {
pub site_id: Uuid,
/// The plaintext `cek_` enrollment key. Shown exactly once — bake it into the
/// site installer now; the server keeps only its hash.
pub key: String,
/// Monotonic rotation version.
pub version: i32,
/// The non-secret short hex code (the `XXXX` in `vN (XXXX)`).
pub fingerprint: String,
/// Fully rendered operator-facing fingerprint, e.g. `v3 (7F2A)`.
pub fingerprint_label: String,
}
/// Non-secret current-key view for the GET endpoint.
#[derive(Debug, Serialize)]
pub struct EnrollmentKeyView {
pub site_id: Uuid,
pub version: i32,
pub fingerprint: String,
pub fingerprint_label: String,
pub active: bool,
}
fn require_db(state: &AppState) -> ApiResult<&db::Database> {
state.db.as_ref().ok_or_else(|| {
ApiError::new(
StatusCode::SERVICE_UNAVAILABLE,
"DATABASE_UNAVAILABLE",
"Database not available",
)
})
}
/// Resolve a site by its UUID path segment, or a 404 envelope.
async fn resolve_site(db: &db::Database, site_id: Uuid) -> ApiResult<db::sites::Site> {
db::sites::get_site_by_id(db.pool(), site_id)
.await
.map_err(|e| {
tracing::error!("DB error resolving site: {}", e);
ApiError::new(
StatusCode::INTERNAL_SERVER_ERROR,
"INTERNAL_ERROR",
"Internal server error",
)
})?
.ok_or_else(|| ApiError::new(StatusCode::NOT_FOUND, "SITE_NOT_FOUND", "Site not found"))
}
/// POST /api/sites/:id/enrollment-key/rotate — rotate (or first-issue) a site's
/// enrollment key. Returns the plaintext `cek_` + fingerprint once.
pub async fn rotate_enrollment_key(
AdminUser(admin): AdminUser,
State(state): State<AppState>,
Path(site_id): Path<Uuid>,
) -> ApiResult<(StatusCode, Json<RotatedEnrollmentKey>)> {
let db = require_db(&state)?;
let site = resolve_site(db, site_id).await?;
// Mint plaintext + Argon2id hash + fingerprint. Only the hash + fingerprint
// are persisted; the plaintext is surfaced once below.
let plaintext = enrollment_keys::generate_enrollment_key();
let key_hash = enrollment_keys::hash_enrollment_key(&plaintext).map_err(|e| {
tracing::error!("Failed to hash enrollment key: {}", e);
ApiError::new(
StatusCode::INTERNAL_SERVER_ERROR,
"INTERNAL_ERROR",
"Failed to hash enrollment key",
)
})?;
let fingerprint = enrollment_keys::compute_fingerprint(&plaintext);
let new_key = db::enrollment_keys::rotate_key(db.pool(), site.id, &key_hash, &fingerprint)
.await
.map_err(|e| {
tracing::error!("DB error rotating enrollment key: {}", e);
ApiError::new(
StatusCode::INTERNAL_SERVER_ERROR,
"INTERNAL_ERROR",
"Failed to rotate enrollment key",
)
})?;
let fingerprint_label =
enrollment_keys::render_fingerprint(new_key.version, &new_key.fingerprint);
// Audit WITHOUT key material (no plaintext, no hash).
if let Err(e) = db::events::log_enrollment_event(
db.pool(),
db::events::EventTypes::ENROLLMENT_KEY_ROTATED,
serde_json::json!({
"site_id": site.id,
"site_code": site.site_code,
"version": new_key.version,
"fingerprint": new_key.fingerprint,
"rotated_by": admin.username,
}),
None,
)
.await
{
tracing::warn!("[ENROLL] failed to write key-rotate audit event: {}", e);
}
tracing::info!(
"Admin {} rotated enrollment key for site {} to {}",
admin.username,
site.site_code,
fingerprint_label
);
Ok((
StatusCode::CREATED,
Json(RotatedEnrollmentKey {
site_id: site.id,
key: plaintext,
version: new_key.version,
fingerprint: new_key.fingerprint,
fingerprint_label,
}),
))
}
/// GET /api/sites/:id/enrollment-key — current non-secret fingerprint + version.
pub async fn get_enrollment_key(
AdminUser(_admin): AdminUser,
State(state): State<AppState>,
Path(site_id): Path<Uuid>,
) -> ApiResult<Json<EnrollmentKeyView>> {
let db = require_db(&state)?;
let site = resolve_site(db, site_id).await?;
let key = db::enrollment_keys::get_active_for_site(db.pool(), site.id)
.await
.map_err(|e| {
tracing::error!("DB error loading enrollment key: {}", e);
ApiError::new(
StatusCode::INTERNAL_SERVER_ERROR,
"INTERNAL_ERROR",
"Internal server error",
)
})?
.ok_or_else(|| {
ApiError::new(
StatusCode::NOT_FOUND,
"NO_ENROLLMENT_KEY",
"Site has no active enrollment key",
)
})?;
let fingerprint_label = enrollment_keys::render_fingerprint(key.version, &key.fingerprint);
Ok(Json(EnrollmentKeyView {
site_id: site.id,
version: key.version,
fingerprint: key.fingerprint,
fingerprint_label,
active: key.active,
}))
}

View File

@@ -0,0 +1,191 @@
//! Per-site enrollment key minting, hashing, verification, and fingerprinting
//! (SPEC-016 zero-touch enrollment, auth layer).
//!
//! This is the low-sensitivity, rotatable side of the two-tier credential model
//! (SPEC-016 §Security). A per-site ENROLLMENT key (`cek_` prefix) gates "may
//! this machine register at all" at `POST /api/enroll`; a successful enroll mints
//! the high-sensitivity per-machine `cak_` operating credential
//! ([`crate::auth::agent_keys`]). Compromise of an enrollment key is contained to
//! one site and recovered by rotating it.
//!
//! Lifecycle owned here (the secret side):
//!
//! - [`generate_enrollment_key`] mints a high-entropy, `cek_`-prefixed plaintext
//! secret. Mirrors [`crate::auth::agent_keys::generate_agent_key`]'s entropy
//! approach (32 random bytes from the OS CSPRNG, hex-encoded) with a DISTINCT
//! prefix so the two key kinds are never confused in logs or storage. The
//! plaintext is shown to the operator exactly once at issue/rotate and is NEVER
//! persisted or logged.
//! - [`hash_enrollment_key`] / [`verify_enrollment_key`] use **Argon2id** (via
//! [`crate::auth::password`]). This DIFFERS from `cak_` (which uses SHA-256 for
//! a constant-shape equality lookup): SPEC-016 §2 explicitly requires the
//! enrollment key be "stored hashed (Argon2id, same as `cak_`/passwords)". The
//! trade-off is deliberate — enrollment keys are looked up by `(site, active)`
//! first (a small candidate set, usually one row) and only then verified, so the
//! per-verify KDF cost is bounded and not on a high-QPS path, while Argon2id
//! gives salted, GPU-resistant storage matching the password posture.
//! - [`compute_fingerprint`] derives the non-secret short HEX code shown as
//! `vN (XXXX)` (SPEC-016 resolved-decision #3 — hex, deliberately NOT the
//! GuruRMM word-style code, so the two products' artifacts are never visually
//! conflated).
//!
//! SECURITY: never log a plaintext key or its hash. Functions here return the
//! plaintext to the caller (issue/rotate endpoint) but emit no `tracing` output
//! containing key material.
use anyhow::Result;
use rand::RngCore;
use ring::digest;
/// Prefix marking a GuruConnect per-site enrollment key. Distinct from the
/// per-agent `cak_` prefix so the two key kinds are never confused.
pub const ENROLLMENT_KEY_PREFIX: &str = "cek_";
/// Number of random bytes behind an enrollment key (256 bits of entropy), matching
/// [`crate::auth::agent_keys`]. SPEC-016 §2 requires ≥256-bit.
const ENROLLMENT_KEY_RANDOM_BYTES: usize = 32;
/// Number of hex characters in the fingerprint code (the `XXXX` in `vN (XXXX)`).
/// Four hex chars = 16 bits — ample to let an operator tell two installers apart at
/// a glance; it is a non-secret display aid, not a security control.
const FINGERPRINT_HEX_LEN: usize = 4;
/// Generate a new high-entropy, `cek_`-prefixed per-site enrollment key (plaintext).
///
/// The returned string is the ONLY time the plaintext exists; the caller must
/// surface it to the operator once and store only [`hash_enrollment_key`] of it.
/// Uses the OS CSPRNG via `rand::rngs::OsRng`.
pub fn generate_enrollment_key() -> String {
let mut bytes = [0u8; ENROLLMENT_KEY_RANDOM_BYTES];
rand::rngs::OsRng.fill_bytes(&mut bytes);
format!("{}{}", ENROLLMENT_KEY_PREFIX, hex_encode(&bytes))
}
/// Hash an enrollment key for storage using Argon2id (SPEC-016 §2).
///
/// Delegates to [`crate::auth::password::hash_password`] so the KDF parameters and
/// salt generation match the password posture exactly. Returns the PHC-format
/// string Postgres stores in `site_enrollment_keys.key_hash`.
pub fn hash_enrollment_key(plaintext: &str) -> Result<String> {
crate::auth::password::hash_password(plaintext)
}
/// Verify a presented enrollment key against a stored Argon2id hash.
///
/// Returns `Ok(true)` on a match. A malformed stored hash or a mismatch yields
/// `Ok(false)` / an `Err` from the underlying verifier; the caller treats any
/// non-`Ok(true)` as a rejection. A cheap structural reject (`cek_` prefix) runs
/// first to skip the KDF on obviously-bogus input.
///
/// SECURITY: only compares; never logs the presented key or the hash.
pub fn verify_enrollment_key(presented: &str, stored_hash: &str) -> bool {
if !presented.starts_with(ENROLLMENT_KEY_PREFIX) {
return false;
}
crate::auth::password::verify_password(presented, stored_hash).unwrap_or(false)
}
/// Compute the non-secret short HEX fingerprint code for an enrollment key.
///
/// Derived as the first [`FINGERPRINT_HEX_LEN`] hex chars of the SHA-256 of the
/// plaintext secret, uppercased. This is a stable, non-reversible tag of the secret
/// (knowing the code does not reveal the key) used purely for display. Pair it with
/// the monotonic version via [`render_fingerprint`].
pub fn compute_fingerprint(plaintext: &str) -> String {
let d = digest::digest(&digest::SHA256, plaintext.as_bytes());
let hex = hex_encode(d.as_ref());
hex[..FINGERPRINT_HEX_LEN].to_ascii_uppercase()
}
/// Render the operator-facing fingerprint string `vN (XXXX)` (SPEC-016 §2).
///
/// `version` is the monotonic rotation counter; `code` is [`compute_fingerprint`].
/// Example: `render_fingerprint(3, "7F2A")` -> `"v3 (7F2A)"`.
pub fn render_fingerprint(version: i32, code: &str) -> String {
format!("v{} ({})", version, code)
}
/// Lowercase hex encoding without pulling in the `hex` crate (mirrors
/// [`crate::auth::agent_keys`]).
fn hex_encode(bytes: &[u8]) -> String {
use std::fmt::Write;
let mut s = String::with_capacity(bytes.len() * 2);
for b in bytes {
let _ = write!(s, "{:02x}", b);
}
s
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn generated_key_is_prefixed_and_high_entropy() {
let key = generate_enrollment_key();
assert!(key.starts_with(ENROLLMENT_KEY_PREFIX));
assert_eq!(
key.len(),
ENROLLMENT_KEY_PREFIX.len() + ENROLLMENT_KEY_RANDOM_BYTES * 2
);
}
#[test]
fn generated_keys_are_unique() {
assert_ne!(generate_enrollment_key(), generate_enrollment_key());
}
#[test]
fn hash_and_verify_roundtrip() {
let key = generate_enrollment_key();
let hash = hash_enrollment_key(&key).expect("hash");
assert!(verify_enrollment_key(&key, &hash));
}
#[test]
fn verify_rejects_wrong_key() {
let key = generate_enrollment_key();
let other = generate_enrollment_key();
let hash = hash_enrollment_key(&key).expect("hash");
assert!(!verify_enrollment_key(&other, &hash));
}
#[test]
fn verify_rejects_unprefixed_input_without_touching_kdf() {
let key = generate_enrollment_key();
let hash = hash_enrollment_key(&key).expect("hash");
// A value lacking the cek_ prefix is structurally rejected before the KDF.
assert!(!verify_enrollment_key("not-a-key", &hash));
}
#[test]
fn verify_rejects_malformed_stored_hash() {
let key = generate_enrollment_key();
// A garbage stored hash must not panic and must reject.
assert!(!verify_enrollment_key(&key, "not-a-phc-hash"));
}
#[test]
fn fingerprint_is_stable_uppercase_hex_of_expected_len() {
let key = "cek_deadbeef";
let f1 = compute_fingerprint(key);
let f2 = compute_fingerprint(key);
assert_eq!(f1, f2);
assert_eq!(f1.len(), FINGERPRINT_HEX_LEN);
assert!(f1.chars().all(|c| c.is_ascii_hexdigit()));
assert_eq!(f1, f1.to_ascii_uppercase());
}
#[test]
fn fingerprint_differs_per_key() {
assert_ne!(
compute_fingerprint("cek_aaa"),
compute_fingerprint("cek_bbb")
);
}
#[test]
fn render_fingerprint_matches_spec_shape() {
assert_eq!(render_fingerprint(3, "7F2A"), "v3 (7F2A)");
}
}

View File

@@ -4,6 +4,7 @@
//! validation for agents. //! validation for agents.
pub mod agent_keys; pub mod agent_keys;
pub mod enrollment_keys;
pub mod jwt; pub mod jwt;
pub mod password; pub mod password;
pub mod token_blacklist; pub mod token_blacklist;

View File

@@ -0,0 +1,141 @@
//! Per-site enrollment key database operations (SPEC-016 zero-touch enrollment).
//!
//! Backs the `site_enrollment_keys` table (migration 010). Stores ONLY the
//! Argon2id hash of the `cek_` secret plus the non-secret rotation metadata
//! (version, fingerprint, active flag). Computing the hash and minting the
//! plaintext is [`crate::auth::enrollment_keys`]'s job; this module is
//! hash-agnostic persistence and takes already-hashed values.
//!
//! Rotation invariant: at most one `active` row per site (enforced by a partial
//! unique index in migration 010). [`rotate_key`] deactivates the current active
//! row and inserts a new active one inside a single transaction so the invariant
//! is never transiently violated.
//!
//! All queries use runtime `sqlx::query()` / `sqlx::query_as()` per the codebase
//! convention (no compile-time `query!` macros, no `.sqlx` offline cache).
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use uuid::Uuid;
/// Per-site enrollment key record.
///
/// `key_hash` is the only representation of the secret the server stores; the
/// plaintext is shown once at issue/rotate and never persisted.
#[derive(Debug, Clone, Serialize, Deserialize, sqlx::FromRow)]
pub struct EnrollmentKey {
pub id: Uuid,
pub site_id: Uuid,
pub key_hash: String,
pub version: i32,
pub fingerprint: String,
pub active: bool,
pub created_at: DateTime<Utc>,
pub rotated_at: Option<DateTime<Utc>>,
}
/// Fetch the active enrollment key for a site, if any.
///
/// This is the `/api/enroll` hot path: resolve the one active key whose hash the
/// presented `cek_` is verified against. The partial unique index guarantees at
/// most one active row, so `fetch_optional` is correct.
pub async fn get_active_for_site(
pool: &PgPool,
site_id: Uuid,
) -> Result<Option<EnrollmentKey>, sqlx::Error> {
sqlx::query_as::<_, EnrollmentKey>(
r#"
SELECT id, site_id, key_hash, version, fingerprint, active, created_at, rotated_at
FROM site_enrollment_keys
WHERE site_id = $1 AND active
"#,
)
.bind(site_id)
.fetch_optional(pool)
.await
}
/// Insert the FIRST enrollment key for a site at version 1 (initial issue).
///
/// Use [`rotate_key`] for subsequent rotations. Errors with a unique violation if
/// the site already has an active key (the caller should rotate instead).
#[allow(dead_code)] // Wired by site-admin issue flow; Phase A exposes rotation (which also covers first issue when none exists).
pub async fn insert_initial_key(
pool: &PgPool,
site_id: Uuid,
key_hash: &str,
fingerprint: &str,
) -> Result<EnrollmentKey, sqlx::Error> {
sqlx::query_as::<_, EnrollmentKey>(
r#"
INSERT INTO site_enrollment_keys (site_id, key_hash, version, fingerprint, active)
VALUES ($1, $2, 1, $3, true)
RETURNING id, site_id, key_hash, version, fingerprint, active, created_at, rotated_at
"#,
)
.bind(site_id)
.bind(key_hash)
.bind(fingerprint)
.fetch_one(pool)
.await
}
/// Rotate a site's enrollment key (SPEC-016 §2): deactivate the current active key
/// (if any) and insert a new active key at the next monotonic version, all in one
/// transaction.
///
/// Returns the newly-created active key. If the site has no key yet, this issues
/// version 1 (so rotation also serves as first-issue). The caller passes the
/// already-hashed new secret and its fingerprint; the plaintext is surfaced once by
/// the caller and never reaches this layer.
///
/// The transaction is what keeps the "at most one active key per site" invariant
/// (partial unique index) from being transiently violated between the UPDATE and
/// the INSERT.
pub async fn rotate_key(
pool: &PgPool,
site_id: Uuid,
new_key_hash: &str,
new_fingerprint: &str,
) -> Result<EnrollmentKey, sqlx::Error> {
let mut tx = pool.begin().await?;
// Highest existing version for this site (NULL -> 0 so the first key is v1).
let current_max: Option<i32> =
sqlx::query_scalar("SELECT MAX(version) FROM site_enrollment_keys WHERE site_id = $1")
.bind(site_id)
.fetch_one(&mut *tx)
.await?;
let next_version = current_max.unwrap_or(0) + 1;
// Deactivate the current active key (if any), stamping rotated_at.
sqlx::query(
r#"
UPDATE site_enrollment_keys
SET active = false, rotated_at = NOW()
WHERE site_id = $1 AND active
"#,
)
.bind(site_id)
.execute(&mut *tx)
.await?;
// Insert the new active key at the next version.
let new_key = sqlx::query_as::<_, EnrollmentKey>(
r#"
INSERT INTO site_enrollment_keys (site_id, key_hash, version, fingerprint, active)
VALUES ($1, $2, $3, $4, true)
RETURNING id, site_id, key_hash, version, fingerprint, active, created_at, rotated_at
"#,
)
.bind(site_id)
.bind(new_key_hash)
.bind(next_version)
.bind(new_fingerprint)
.fetch_one(&mut *tx)
.await?;
tx.commit().await?;
Ok(new_key)
}

View File

@@ -69,6 +69,40 @@ impl EventTypes {
pub const MACHINE_REMOVED: &'static str = "machine_removed"; pub const MACHINE_REMOVED: &'static str = "machine_removed";
/// An administrator soft-deleted (purged) a session and dropped it in-memory. /// An administrator soft-deleted (purged) a session and dropped it in-memory.
pub const SESSION_REMOVED: &'static str = "session_removed"; pub const SESSION_REMOVED: &'static str = "session_removed";
// Zero-touch enrollment events (SPEC-016). Written by POST /api/enroll and the
// site enrollment-key rotation endpoint. These carry no session, so they are
// logged via `log_enrollment_event` with `session_id = NULL`; the structured
// detail (machine_uid, site_code, fingerprint, etc.) goes in `details` and the
// source IP in `ip_address`.
/// A new machine self-registered at a site and was minted its first `cak_`.
pub const ENROLL_NEW: &'static str = "enroll_new";
/// An existing machine_uid re-enrolled at the SAME site — the row was reused and
/// a fresh `cak_` minted (re-image / re-install).
pub const ENROLL_REUSE: &'static str = "enroll_reuse";
/// An existing machine_uid enrolled under a DIFFERENT site — the machine's site
/// binding was updated (a "site move"). Fires an alert.
///
/// NOTE (SPEC-016 Phase A): the unauthenticated enroll path does NOT perform this
/// move — a cross-site enroll is REFUSED (`ENROLL_SITE_CONFLICT`) rather than
/// silently repointing the machine. This event is reserved for the deliberate
/// Phase-B `--reassign` flow (and the dashboard move action) that supersede it.
#[allow(dead_code)] // reserved for Phase-B --reassign; not emitted by Phase A enroll
pub const ENROLL_SITE_MOVE: &'static str = "enroll_site_move";
/// An existing machine_uid presented a valid key for a DIFFERENT site than the one
/// the machine is currently bound to. Phase A REFUSES this (no move, no key minted)
/// as the accidental-move / cross-site-hijack guard; the deliberate move arrives
/// with the Phase-B `--reassign` flow + dashboard. Fires an alert.
pub const ENROLL_SITE_CONFLICT: &'static str = "enroll_site_conflict";
/// A machine_uid collision was detected at enroll — the endpoint dropped to
/// `pending` and awaits operator confirmation in the dashboard. Fires an alert.
pub const ENROLL_COLLISION_PENDING: &'static str = "enroll_collision_pending";
/// An enroll attempt failed enrollment-key verification (wrong/inactive key or
/// unknown site_code). Security audit trail for the open-registration surface.
pub const ENROLL_REJECTED: &'static str = "enroll_rejected";
/// An administrator rotated a site's enrollment key (new version + fingerprint;
/// old installers can no longer enroll NEW machines).
pub const ENROLLMENT_KEY_ROTATED: &'static str = "enrollment_key_rotated";
} }
/// Log a session event /// Log a session event
@@ -154,6 +188,42 @@ pub async fn log_admin_removal(
Ok(result) Ok(result)
} }
/// Log a zero-touch enrollment audit event (SPEC-016).
///
/// Shares the `connect_session_events` audit table but carries no session
/// (`session_id = NULL`, the FK column is nullable) and no viewer — enrollment is
/// an unauthenticated agent action, not a viewer/session event. The structured
/// detail (machine_uid, site_code, fingerprint version, decision, etc.) goes in
/// `details` and the agent's source IP in `ip_address`.
///
/// Best-effort: a failure to write the audit row must NOT fail the enroll (the
/// machine row and `cak_` already exist); the caller logs the error and proceeds,
/// matching how the relay and Task-5 removal treat audit writes.
pub async fn log_enrollment_event(
pool: &PgPool,
event_type: &str,
details: JsonValue,
ip_address: Option<IpAddr>,
) -> Result<i64, sqlx::Error> {
let ip_str = ip_address.map(|ip| ip.to_string());
let result = sqlx::query_scalar::<_, i64>(
r#"
INSERT INTO connect_session_events
(session_id, event_type, viewer_id, viewer_name, details, ip_address)
VALUES (NULL, $1, NULL, NULL, $2, $3::inet)
RETURNING id
"#,
)
.bind(event_type)
.bind(details)
.bind(ip_str)
.fetch_one(pool)
.await?;
Ok(result)
}
/// Get events for a session /// Get events for a session
#[allow(dead_code)] // TODO(native-remote-control): consumed by the integration API; see docs/specs/native-remote-control/ #[allow(dead_code)] // TODO(native-remote-control): consumed by the integration API; see docs/specs/native-remote-control/
pub async fn get_session_events( pub async fn get_session_events(

View File

@@ -64,6 +64,16 @@ pub struct Machine {
/// history) is retained. NULL = live. Nullable, so it is read NULL-tolerantly /// history) is retained. NULL = live. Nullable, so it is read NULL-tolerantly
/// in the manual `FromRow` below. /// in the manual `FromRow` below.
pub deleted_at: Option<DateTime<Utc>>, pub deleted_at: Option<DateTime<Utc>>,
/// Relational site binding for a machine enrolled via `/api/enroll` (SPEC-016,
/// migration 010). NULL for legacy / support-code / connect-path machines that
/// never enrolled through the zero-touch flow. A change of this on re-enroll is
/// the "site move" the enroll path audits.
pub site_id: Option<Uuid>,
/// Collision-gate state (SPEC-016, migration 010): `'active'` (live, auto-approve)
/// or `'pending'` (a machine_uid collision was detected at enroll; awaiting
/// operator confirmation before the endpoint may be controlled). Non-null with a
/// default of `'active'`; read NULL-tolerantly below for defense in depth.
pub enrollment_state: String,
} }
impl<'r> FromRow<'r, PgRow> for Machine { impl<'r> FromRow<'r, PgRow> for Machine {
@@ -83,6 +93,13 @@ impl<'r> FromRow<'r, PgRow> for Machine {
machine_uid: row.try_get("machine_uid")?, machine_uid: row.try_get("machine_uid")?,
// Schema-nullable (migration 009); decode directly as Option. // Schema-nullable (migration 009); decode directly as Option.
deleted_at: row.try_get("deleted_at")?, deleted_at: row.try_get("deleted_at")?,
// Schema-nullable (migration 010); decode directly as Option.
site_id: row.try_get("site_id")?,
// Non-null with default 'active' (migration 010); read NULL-tolerantly
// (older snapshots / partial rows) and fall back to 'active'.
enrollment_state: row
.try_get::<Option<String>, _>("enrollment_state")?
.unwrap_or_else(|| "active".to_string()),
// Nullable-with-default columns mapped to non-`Option` Rust types: read as // Nullable-with-default columns mapped to non-`Option` Rust types: read as
// `Option<T>` and fall back to the type default so a NULL cell never errors. // `Option<T>` and fall back to the type default so a NULL cell never errors.
is_elevated: row is_elevated: row
@@ -166,7 +183,7 @@ pub async fn upsert_machine(
r#" r#"
INSERT INTO connect_machines (agent_id, hostname, is_persistent, status, last_seen, machine_uid) INSERT INTO connect_machines (agent_id, hostname, is_persistent, status, last_seen, machine_uid)
VALUES ($1, $2, $3, 'online', NOW(), $4) VALUES ($1, $2, $3, 'online', NOW(), $4)
ON CONFLICT (machine_uid) DO UPDATE SET ON CONFLICT (machine_uid) WHERE machine_uid IS NOT NULL DO UPDATE SET
agent_id = EXCLUDED.agent_id, agent_id = EXCLUDED.agent_id,
hostname = EXCLUDED.hostname, hostname = EXCLUDED.hostname,
status = 'online', status = 'online',
@@ -207,6 +224,131 @@ pub async fn upsert_machine(
} }
} }
/// Find a machine by the SPEC-016 per-tenant dedup key `(tenant_id, machine_uid)`.
///
/// This is the enroll-time dedup lookup: the same hardware re-enrolling (re-image /
/// re-install) resolves to its existing row within the tenant, while the same
/// hardware in a DIFFERENT tenant is a distinct row (resolved-decision #4). Tenant
/// scoping uses the same default-tenant fold as the unique index so the lookup
/// matches the uniqueness guarantee.
///
/// Unlike `get_machine_by_agent_id`, this deliberately does NOT filter
/// `deleted_at IS NULL`: a previously operator-purged machine that legitimately
/// re-enrolls must be found so the enroll path can revive it (clearing
/// `deleted_at`), mirroring the connect-path revive in `upsert_machine`.
pub async fn get_machine_by_tenant_uid(
pool: &PgPool,
tenant_id: Uuid,
machine_uid: &str,
) -> Result<Option<Machine>, sqlx::Error> {
sqlx::query_as::<_, Machine>(
r#"
SELECT * FROM connect_machines
WHERE machine_uid = $1
AND COALESCE(tenant_id, '00000000-0000-0000-0000-000000000001'::uuid) = $2
"#,
)
.bind(machine_uid)
.bind(tenant_id)
.fetch_optional(pool)
.await
}
/// Parameters for an enroll-time machine create/update (SPEC-016 `/api/enroll`).
///
/// `agent_id` is a freshly minted opaque id for a NEW enrollment (the agent's
/// config UUID story is Phase B; the server only needs a unique non-null value for
/// the `agent_id UNIQUE` column). On REUSE/MOVE the existing row's `agent_id` is
/// preserved (the FK target of any already-minted `cak_`), so the update path does
/// not touch it.
pub struct EnrollMachineParams<'a> {
pub agent_id: &'a str,
pub hostname: &'a str,
pub machine_uid: &'a str,
pub tenant_id: Uuid,
pub site_id: Uuid,
/// Company label (-> connect_machines.organization).
pub company: Option<&'a str>,
/// Site label (-> connect_machines.site) — the free-text label, distinct from
/// the relational site_id binding.
pub site_label: Option<&'a str>,
pub tags: &'a [String],
/// 'active' (auto-approve) or 'pending' (collision-gated).
pub enrollment_state: &'a str,
}
/// Insert a NEW machine row for a first-time enrollment (SPEC-016).
///
/// Carries the labels, the relational `site_id`, the per-tenant `machine_uid`, and
/// the collision-gate `enrollment_state`. Persistent + online. Returns the created
/// row (its `id` is the FK target for the `cak_` the caller mints next).
pub async fn insert_enrolled_machine(
pool: &PgPool,
p: &EnrollMachineParams<'_>,
) -> Result<Machine, sqlx::Error> {
sqlx::query_as::<_, Machine>(
r#"
INSERT INTO connect_machines
(agent_id, hostname, is_persistent, status, last_seen, machine_uid,
tenant_id, site_id, organization, site, tags, enrollment_state)
VALUES ($1, $2, true, 'online', NOW(), $3, $4, $5, $6, $7, $8, $9)
RETURNING *
"#,
)
.bind(p.agent_id)
.bind(p.hostname)
.bind(p.machine_uid)
.bind(p.tenant_id)
.bind(p.site_id)
.bind(p.company)
.bind(p.site_label)
.bind(p.tags)
.bind(p.enrollment_state)
.fetch_one(pool)
.await
}
/// Update an EXISTING machine row on re-enroll / reuse / site-move (SPEC-016).
///
/// Refreshes hostname, site binding (`site_id`), labels, and `enrollment_state`,
/// and revives a soft-deleted row (`deleted_at = NULL`) — a re-enroll of a purged
/// host means it is live again, mirroring `upsert_machine`'s revive. Deliberately
/// does NOT change `agent_id`: the existing id is the FK target of any prior `cak_`.
/// Labels are COALESCE-merged so an enroll that omits a label does not wipe an
/// existing value; `tags` is overwritten only when a non-empty set is supplied
/// (matching `update_machine_metadata`'s convention).
pub async fn update_enrolled_machine(
pool: &PgPool,
machine_id: Uuid,
p: &EnrollMachineParams<'_>,
) -> Result<Machine, sqlx::Error> {
sqlx::query_as::<_, Machine>(
r#"
UPDATE connect_machines SET
hostname = $2,
site_id = $3,
organization = COALESCE($4, organization),
site = COALESCE($5, site),
tags = CASE WHEN $6::text[] = '{}' THEN tags ELSE $6 END,
enrollment_state = $7,
status = 'online',
last_seen = NOW(),
deleted_at = NULL
WHERE id = $1
RETURNING *
"#,
)
.bind(machine_id)
.bind(p.hostname)
.bind(p.site_id)
.bind(p.company)
.bind(p.site_label)
.bind(p.tags)
.bind(p.enrollment_state)
.fetch_one(pool)
.await
}
/// Update machine status and info /// Update machine status and info
#[allow(dead_code)] // TODO(native-remote-control): consumed by the integration API; see docs/specs/native-remote-control/ #[allow(dead_code)] // TODO(native-remote-control): consumed by the integration API; see docs/specs/native-remote-control/
pub async fn update_machine_status( pub async fn update_machine_status(

View File

@@ -4,10 +4,12 @@
//! Optional - server works without database if DATABASE_URL not set. //! Optional - server works without database if DATABASE_URL not set.
pub mod agent_keys; pub mod agent_keys;
pub mod enrollment_keys;
pub mod events; pub mod events;
pub mod machines; pub mod machines;
pub mod releases; pub mod releases;
pub mod sessions; pub mod sessions;
pub mod sites;
pub mod support_codes; pub mod support_codes;
pub mod tenancy; pub mod tenancy;
pub mod users; pub mod users;

94
server/src/db/sites.rs Normal file
View File

@@ -0,0 +1,94 @@
//! Site database operations (SPEC-016 zero-touch enrollment).
//!
//! Backs the `connect_sites` table (migration 010): the relational anchor a
//! per-site enrollment key hangs off and the `/api/enroll` flow resolves by
//! `site_code`. See the migration header for why this table exists (the prior
//! schema modeled "site" only as a free-text column on `connect_machines`).
//!
//! All queries use runtime `sqlx::query()` / `sqlx::query_as()` per the codebase
//! convention (no compile-time `query!` macros, no `.sqlx` offline cache).
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use uuid::Uuid;
/// Site record from the database.
#[derive(Debug, Clone, Serialize, Deserialize, sqlx::FromRow)]
pub struct Site {
pub id: Uuid,
pub site_code: String,
pub display_name: Option<String>,
pub company: Option<String>,
pub tenant_id: Option<Uuid>,
/// RESERVED for future per-site enrollment POLICY work (SPEC-016 §out-of-scope).
/// Not enforced in Phase A.
pub enrollment_policy: Option<String>,
pub created_at: DateTime<Utc>,
}
/// Resolve a site by its operator-facing `site_code`, scoped to the given tenant.
///
/// Tenant scoping uses the same default-tenant fold as the unique index so the
/// lookup matches the uniqueness guarantee: `(COALESCE(tenant_id, default),
/// site_code)`. Returns `None` if no site with that code exists in the tenant.
pub async fn get_site_by_code(
pool: &PgPool,
site_code: &str,
tenant_id: Uuid,
) -> Result<Option<Site>, sqlx::Error> {
sqlx::query_as::<_, Site>(
r#"
SELECT id, site_code, display_name, company, tenant_id, enrollment_policy, created_at
FROM connect_sites
WHERE site_code = $1
AND COALESCE(tenant_id, '00000000-0000-0000-0000-000000000001'::uuid) = $2
"#,
)
.bind(site_code)
.bind(tenant_id)
.fetch_optional(pool)
.await
}
/// Fetch a site by its primary-key UUID.
pub async fn get_site_by_id(pool: &PgPool, id: Uuid) -> Result<Option<Site>, sqlx::Error> {
sqlx::query_as::<_, Site>(
r#"
SELECT id, site_code, display_name, company, tenant_id, enrollment_policy, created_at
FROM connect_sites
WHERE id = $1
"#,
)
.bind(id)
.fetch_optional(pool)
.await
}
/// Insert a new site, returning the created row.
///
/// `tenant_id` is `None`-tolerant and resolved via `db::tenancy::current_tenant_id()`
/// at the call site. Errors with a unique-violation if `(tenant, site_code)` already
/// exists (the caller maps that to a 409).
#[allow(dead_code)] // Wired by the site-admin API (dashboard site CRUD); Phase A exposes key rotation, not site CRUD.
pub async fn insert_site(
pool: &PgPool,
site_code: &str,
display_name: Option<&str>,
company: Option<&str>,
tenant_id: Option<Uuid>,
) -> Result<Site, sqlx::Error> {
sqlx::query_as::<_, Site>(
r#"
INSERT INTO connect_sites (site_code, display_name, company, tenant_id)
VALUES ($1, $2, $3, $4)
RETURNING id, site_code, display_name, company, tenant_id, enrollment_policy, created_at
"#,
)
.bind(site_code)
.bind(display_name)
.bind(company)
.bind(tenant_id)
.fetch_one(pool)
.await
}

View File

@@ -448,6 +448,11 @@ async fn main() -> Result<()> {
)), )),
) )
.route("/api/codes/:code/cancel", post(cancel_code)) .route("/api/codes/:code/cancel", post(cancel_code))
// Zero-touch enrollment (SPEC-016). PUBLIC: no JWT — the per-site enrollment
// key in the body is the gate, and the handler applies its own
// per-(site_code, IP) rate limit / lockout (defense-in-depth). Mounted with
// the other public API routes.
.route("/api/enroll", post(api::enroll::enroll))
// WebSocket endpoints // WebSocket endpoints
.route("/ws/agent", get(relay::agent_ws_handler)) .route("/ws/agent", get(relay::agent_ws_handler))
.route("/ws/viewer", get(relay::viewer_ws_handler)) .route("/ws/viewer", get(relay::viewer_ws_handler))
@@ -498,6 +503,18 @@ async fn main() -> Result<()> {
"/api/machines/:agent_id/keys/:key_id", "/api/machines/:agent_id/keys/:key_id",
delete(api::machine_keys::revoke_key), delete(api::machine_keys::revoke_key),
) )
// Per-site enrollment key administration (SPEC-016, admin-only / JWT).
// Rotate regenerates the cek_ secret + fingerprint (old installers can no
// longer enroll new machines); GET returns the current non-secret
// fingerprint/version. Both gated by the AdminUser extractor.
.route(
"/api/sites/:id/enrollment-key",
get(api::sites::get_enrollment_key),
)
.route(
"/api/sites/:id/enrollment-key/rotate",
post(api::sites::rotate_enrollment_key),
)
// REST API - Releases and Version // REST API - Releases and Version
.route("/api/version", get(api::releases::get_version)) // No auth - for agent polling .route("/api/version", get(api::releases::get_version)) // No auth - for agent polling
.route("/api/releases", get(api::releases::list_releases)) .route("/api/releases", get(api::releases::list_releases))

View File

@@ -77,6 +77,19 @@ pub const CODE_VALIDATE_MAX_FAILURES: u32 = 10;
/// Support-code validate: how long an IP stays locked out once tripped. /// Support-code validate: how long an IP stays locked out once tripped.
pub const CODE_VALIDATE_LOCKOUT: Duration = Duration::from_secs(15 * 60); pub const CODE_VALIDATE_LOCKOUT: Duration = Duration::from_secs(15 * 60);
/// Enroll (`POST /api/enroll`, SPEC-016): window length.
pub const ENROLL_WINDOW: Duration = Duration::from_secs(60);
/// Enroll: max requests per window per `(site_code, IP)`. A zero-touch site push
/// drives N machines through enroll near-simultaneously, so this is generous
/// (mass-deploy friendly) while still capping a runaway loop. Defense-in-depth: the
/// 256-bit enrollment key is the load-bearing gate, not this cap.
pub const ENROLL_MAX_PER_WINDOW: u32 = 60;
/// Enroll: consecutive FAILED enroll attempts (bad/inactive key, unknown site) from
/// one `(site_code, IP)` that trip the lockout.
pub const ENROLL_MAX_FAILURES: u32 = 20;
/// Enroll: how long a `(site_code, IP)` stays locked out once tripped.
pub const ENROLL_LOCKOUT: Duration = Duration::from_secs(15 * 60);
/// Hard cap on the number of distinct IPs tracked by any single limiter map. /// Hard cap on the number of distinct IPs tracked by any single limiter map.
/// Prevents an IP-rotating attacker from growing memory without bound. When the /// Prevents an IP-rotating attacker from growing memory without bound. When the
/// cap is hit, the oldest-windowed entries are pruned. Generous for a real MSP /// cap is hit, the oldest-windowed entries are pruned. Generous for a real MSP
@@ -260,6 +273,150 @@ impl FailureLockout {
} }
} }
// ============================================================================
// Composite-key limiter for enrollment (keyed by (site_code, IP)) — SPEC-016
// ============================================================================
//
// The login / change-password / code-validate limiters above key purely on IP.
// SPEC-016 §3 wants the enroll defense keyed on `(site_code, source-IP)` so a noisy
// site push from one office IP cannot lock out a different site enrolling from the
// same egress IP. Rather than overload the IP-only maps, this is a small dedicated
// composite-key limiter + lockout. It is invoked from the enroll HANDLER (not a
// `from_fn` layer) because the `site_code` lives in the JSON body, which a
// pre-handler middleware cannot read without consuming it. Documented as
// defense-in-depth: the 256-bit enrollment key is the real gate.
/// Composite limiter key: the site_code and the real client IP.
type EnrollKey = (String, IpAddr);
/// Per-`(site_code, IP)` fixed-window limiter + consecutive-failure lockout.
///
/// Combines both protections behind one lock-guarded map so the enroll handler
/// makes a single allow/deny decision and reports success/failure into the same
/// structure. Self-pruning and size-capped, like the IP-only limiters.
#[derive(Clone)]
pub struct EnrollLimiter {
inner: std::sync::Arc<Mutex<HashMap<EnrollKey, EnrollEntry>>>,
max_per_window: u32,
window: Duration,
max_failures: u32,
cooldown: Duration,
}
#[derive(Debug, Clone, Copy)]
struct EnrollEntry {
window_started: Instant,
count: u32,
failures: u32,
locked_until: Option<Instant>,
last_seen: Instant,
}
impl EnrollLimiter {
pub fn new(
max_per_window: u32,
window: Duration,
max_failures: u32,
cooldown: Duration,
) -> Self {
Self {
inner: std::sync::Arc::new(Mutex::new(HashMap::new())),
max_per_window,
window,
max_failures,
cooldown,
}
}
fn entry_now() -> EnrollEntry {
let now = Instant::now();
EnrollEntry {
window_started: now,
count: 0,
failures: 0,
locked_until: None,
last_seen: now,
}
}
/// Admit one enroll attempt for `(site_code, ip)`. Returns `true` if allowed
/// (and counts it). Returns `false` if the key is currently locked out OR over
/// the per-window request cap. Clock injected for tests.
fn check_at(&self, site_code: &str, ip: IpAddr, now: Instant) -> bool {
let mut map = self.inner.lock().unwrap_or_else(|e| e.into_inner());
if map.len() >= MAX_TRACKED_IPS {
let window = self.window;
let cooldown = self.cooldown;
map.retain(|_, e| {
e.locked_until.map(|u| now < u).unwrap_or(false)
|| now.duration_since(e.window_started) < window
|| now.duration_since(e.last_seen) < cooldown
});
}
let key = (site_code.to_string(), ip);
let e = map.entry(key).or_insert_with(Self::entry_now);
e.last_seen = now;
// Lockout takes precedence.
if let Some(until) = e.locked_until {
if now < until {
return false;
}
// Cooldown elapsed — clear it for a fresh start.
e.locked_until = None;
e.failures = 0;
}
// Roll the fixed window forward if elapsed.
if now.duration_since(e.window_started) >= self.window {
e.window_started = now;
e.count = 0;
}
if e.count >= self.max_per_window {
false
} else {
e.count += 1;
true
}
}
/// Admit one enroll attempt (real clock).
pub fn check(&self, site_code: &str, ip: IpAddr) -> bool {
self.check_at(site_code, ip, Instant::now())
}
fn record_failure_at(&self, site_code: &str, ip: IpAddr, now: Instant) {
let mut map = self.inner.lock().unwrap_or_else(|e| e.into_inner());
let key = (site_code.to_string(), ip);
let e = map.entry(key).or_insert_with(Self::entry_now);
e.last_seen = now;
e.failures = e.failures.saturating_add(1);
if e.failures >= self.max_failures {
e.locked_until = Some(now + self.cooldown);
}
}
/// Record a FAILED enroll attempt (bad key / unknown site) for the key,
/// tripping the lockout once the streak reaches `max_failures`.
pub fn record_failure(&self, site_code: &str, ip: IpAddr) {
self.record_failure_at(site_code, ip, Instant::now());
}
/// Record a SUCCESSFUL enroll for the key, resetting its failure streak.
pub fn record_success(&self, site_code: &str, ip: IpAddr) {
let mut map = self.inner.lock().unwrap_or_else(|e| e.into_inner());
let key = (site_code.to_string(), ip);
if let Some(e) = map.get_mut(&key) {
e.failures = 0;
e.locked_until = None;
e.last_seen = Instant::now();
}
}
}
// ============================================================================ // ============================================================================
// Shared rate-limit state (lives in AppState) // Shared rate-limit state (lives in AppState)
// ============================================================================ // ============================================================================
@@ -275,6 +432,9 @@ pub struct RateLimitState {
pub code_validate: RateLimiter, pub code_validate: RateLimiter,
/// Per-IP lockout on repeated failed code validations (brute-force defense). /// Per-IP lockout on repeated failed code validations (brute-force defense).
pub code_validate_lockout: FailureLockout, pub code_validate_lockout: FailureLockout,
/// `POST /api/enroll` (SPEC-016): per-`(site_code, IP)` request cap +
/// consecutive-failure lockout. Invoked from the enroll handler.
pub enroll: EnrollLimiter,
} }
impl RateLimitState { impl RateLimitState {
@@ -290,6 +450,12 @@ impl RateLimitState {
CODE_VALIDATE_MAX_FAILURES, CODE_VALIDATE_MAX_FAILURES,
CODE_VALIDATE_LOCKOUT, CODE_VALIDATE_LOCKOUT,
), ),
enroll: EnrollLimiter::new(
ENROLL_MAX_PER_WINDOW,
ENROLL_WINDOW,
ENROLL_MAX_FAILURES,
ENROLL_LOCKOUT,
),
} }
} }
} }
@@ -524,4 +690,51 @@ mod tests {
assert!(lockout.is_locked_at(ip(8), t0)); assert!(lockout.is_locked_at(ip(8), t0));
assert!(!lockout.is_locked_at(ip(9), t0)); // ip9 unaffected assert!(!lockout.is_locked_at(ip(9), t0)); // ip9 unaffected
} }
// -- EnrollLimiter (composite (site_code, IP) key) --------------------------
#[test]
fn enroll_window_allows_up_to_cap_then_blocks() {
let lim = EnrollLimiter::new(2, Duration::from_secs(60), 100, Duration::from_secs(600));
let t0 = Instant::now();
assert!(lim.check_at("SITE-A", ip(1), t0)); // 1
assert!(lim.check_at("SITE-A", ip(1), t0)); // 2
assert!(!lim.check_at("SITE-A", ip(1), t0)); // over cap
}
#[test]
fn enroll_is_keyed_by_site_and_ip() {
let lim = EnrollLimiter::new(1, Duration::from_secs(60), 100, Duration::from_secs(600));
let t0 = Instant::now();
assert!(lim.check_at("SITE-A", ip(1), t0));
assert!(!lim.check_at("SITE-A", ip(1), t0)); // same key over cap
// Different site, same IP -> independent bucket.
assert!(lim.check_at("SITE-B", ip(1), t0));
// Same site, different IP -> independent bucket.
assert!(lim.check_at("SITE-A", ip(2), t0));
}
#[test]
fn enroll_lockout_trips_after_failures_and_blocks_check() {
let lim = EnrollLimiter::new(100, Duration::from_secs(60), 3, Duration::from_secs(600));
let t0 = Instant::now();
lim.record_failure_at("SITE-A", ip(1), t0);
lim.record_failure_at("SITE-A", ip(1), t0);
// Not yet tripped: a check still admits.
assert!(lim.check_at("SITE-A", ip(1), t0));
lim.record_failure_at("SITE-A", ip(1), t0); // 3rd -> trips
// Now locked out: check denies even though under the request cap.
assert!(!lim.check_at("SITE-A", ip(1), t0));
}
#[test]
fn enroll_success_resets_failure_streak() {
let lim = EnrollLimiter::new(100, Duration::from_secs(60), 2, Duration::from_secs(600));
let t0 = Instant::now();
lim.record_failure_at("SITE-A", ip(1), t0);
lim.record_success("SITE-A", ip(1)); // reset
lim.record_failure_at("SITE-A", ip(1), t0);
// Only one failure since reset -> not locked.
assert!(lim.check_at("SITE-A", ip(1), t0));
}
} }

View File

@@ -527,3 +527,60 @@ Reference: SPEC-002 §5; `agent/src/encoder/raw.rs` (salvaged), `proto/guruconne
- **Rate limiting:** hammer `/api/auth/login` and the code-validate route → confirm throttling/lockout. - **Rate limiting:** hammer `/api/auth/login` and the code-validate route → confirm throttling/lockout.
- **Migrations:** fresh DB applies the v2 migrations cleanly; `_sqlx_migrations` consistent; `tenant_id` - **Migrations:** fresh DB applies the v2 migrations cleanly; `_sqlx_migrations` consistent; `tenant_id`
populated with the default tenant. populated with the default tenant.
---
## Task 9 [PROPOSED 2026-06-01 — provisioning model = TOFU auto-enroll, chosen by Mike]: `cak_` auto-enroll provisioning + shared-key retirement
> Context: Task 2 built the SERVER `cak_` machinery (mint/SHA-256 hash/verify in `auth/agent_keys.rs`,
> relay validation in `validate_agent_api_key`, admin issuance `POST /api/machines/:id/keys`). What's
> missing is how an AGENT obtains and uses a `cak_` — today agents still carry the deprecated shared
> `AGENT_API_KEY`, so `connect_agent_keys` is empty and the relay logs the DEPRECATED-shared-key warning
> for every agent. This task closes that with **trust-on-first-use auto-enroll** so the shared key can be
> retired (unblocks task list #5). NOTE: the agent already presents whatever is in its `api_key` slot and
> the relay auto-detects `cak_` vs shared — so a `cak_`-keyed agent needs **no change to its auth call**,
> only a way to *receive*, *persist*, and *prefer* a `cak_`.
**Flow (TOFU):**
1. **Bootstrap (first connect):** a fresh agent authenticates on `/ws/agent` with a bootstrap secret —
interim: the shared `AGENT_API_KEY` (embedded by the download endpoint); target: a single-use,
short-lived **enroll token** (more secure TOFU — see Security).
2. **Server issues on first connect:** when an agent authed via the bootstrap path (i.e. NOT already
`cak_`-keyed) connects and its machine has **no active (non-revoked) `cak_`**, the relay: resolves/creates
the machine row (existing `upsert_machine` on `machine_uid` — now functional after the 2026-06-01
ON CONFLICT fix), mints a `cak_` (`generate_agent_key` + `db::agent_keys::insert_agent_key` for that
`machine_id`), and sends the plaintext key to the agent **once** over a new server→agent message. Only
the hash is stored. **Idempotent:** never re-issue if an active key already exists for the machine.
3. **Agent receives + persists + prefers:** on `AgentKeyProvision`, the agent persists the `cak_` durably at
`%ProgramData%\GuruConnect\agent_key` (restricted ACL, same pattern as `machine_uid`). On startup it loads
the persisted `cak_` if present and uses it as its auth key, falling back to the embedded/bootstrap secret
only when no `cak_` is stored yet. After provisioning, every reconnect authenticates via `cak_` (no more
DEPRECATED-shared-key warning for that agent).
4. **Shared-key retirement (phased):** Phase A — shared key stays as the bootstrap so existing+new agents
self-enroll; monitor the relay WARN count → ~0. Phase B — once the fleet is `cak_`-keyed, restrict the
shared `AGENT_API_KEY` to enrollment-only or remove the env entirely (only `cak_` / enroll-token accepted).
This is the concrete completion of task-list #5.
**Protocol (4-artifact drift discipline):** add `AgentKeyProvision { string key = 1; }` (server→agent) to
`proto/guruconnect.proto` with a new reserved message ID; regenerate prost on both agent + server; the
hand-written `dashboard/src/lib/protobuf.ts` decoder does NOT need it (agent-plane only) but reserve the ID.
**Files:** `proto/guruconnect.proto` (new message); `server/src/relay/mod.rs` (issue+send on bootstrap connect
with no active key); `server/src/db/agent_keys.rs` (add `has_active_key(machine_id)` check; reuse insert);
`agent/src/transport/*` (handle inbound `AgentKeyProvision`); `agent/src/config.rs` + a small key-store module
(load/persist `cak_`, prefer over bootstrap).
**Security (TOFU):** the first connect trusts the bootstrap secret — a leaked shared key during the enroll
window could enroll a rogue agent; the secure target is a **single-use, short-lived enroll token** per
deployment instead of the shared key (shared-key bootstrap is interim convenience). The `cak_` is sent
plaintext once over the existing wss/TLS channel; only the hash is stored server-side; the agent stores it
locally with restricted ACLs. Revocation via the existing `DELETE /api/machines/:id/keys/:key_id` fails the
agent closed; on its next bootstrap connect it re-enrolls. The keyed-agent dedup (Task 3) keeps the
authenticated identity authoritative.
**Verification:** drop a current-build (signed 0.3.0+) agent configured with the shared-key bootstrap →
it connects, receives a `cak_`, persists it; restart → it authenticates via the `cak_` (relay shows NO
DEPRECATED-shared-key warning) and `connect_agent_keys` holds exactly one active key for the machine; issue
is idempotent across reconnects; revoke the key via the admin API → agent rejected, then re-enrolls on next
bootstrap connect. Reference: `auth/agent_keys.rs`, `api/machine_keys.rs`, `relay/mod.rs:266-309`
(`validate_agent_api_key`), `.claude/standards/security/credential-handling.md`.