SPEC-016 Phase A: zero-touch enrollment backend + migration #5

Merged
azcomputerguru merged 3 commits from feat/spec-016-enrollment into main 2026-06-02 11:19:41 -07:00

Phase A of SPEC-016 (zero-touch per-site agent enrollment) — server backend + DB migration only. Agent-side machine_uid derivation, installers, and dashboard are later phases.

What's here

  • Migration 010: connect_sites, site_enrollment_keys, connect_machines.{site_id,machine_uid,enrollment_state}, sites.enrollment_policy (reserved). Per-tenant partial-unique machine_uid index (coexists with 008's global index; drop deferred).
  • POST /api/enroll (public): verify (site_code + per-site enrollment key), dedup by (tenant, machine_uid), mint per-machine cak_, audit + alert hooks, rate-limit/lockout.
  • Per-site enrollment-key module (cek_ mint, Argon2id hash/verify, hex vN (XXXX) fingerprint).
  • Admin (JWT) key rotate + fingerprint GET endpoints.

Review status

Code-reviewed (APPROVE WITH NITS) + a focused re-review confirming all fixes closed:

  • HIGH cross-site silent-move hijack -> now 409 ENROLL_SITE_CONFLICT before any repoint/mint.
  • MEDIUM timing/enumeration oracle -> all reject paths pay one Argon2id verify.
  • LOW new-enroll TOCTOU -> converges to reuse on machine_uid conflict.
  • collision-gate doc made accurate; enforcement TODO left for Phase B/D.

CI must validate (couldn't verify on the Windows dev host)

  • Linux target build (x86_64-unknown-linux-gnu).
  • Postgres-gated tests (machine_uid vs agent_id conflict classification; bound site_id exposure) — need ephemeral Postgres / TEST_DATABASE_URL.

Inert until later phases

No installer issues enrollment keys yet and no agent calls /api/enroll, so the endpoint is unreachable in practice until Phase B/C. The deprecated shared AGENT_API_KEY relay fallback is untouched.

Spec: docs/specs/SPEC-016-zero-touch-enrollment.md. Tracking todo dbfe6a56.

Phase A of SPEC-016 (zero-touch per-site agent enrollment) — server backend + DB migration only. Agent-side machine_uid derivation, installers, and dashboard are later phases. ## What's here - Migration 010: connect_sites, site_enrollment_keys, connect_machines.{site_id,machine_uid,enrollment_state}, sites.enrollment_policy (reserved). Per-tenant partial-unique machine_uid index (coexists with 008's global index; drop deferred). - POST /api/enroll (public): verify (site_code + per-site enrollment key), dedup by (tenant, machine_uid), mint per-machine cak_, audit + alert hooks, rate-limit/lockout. - Per-site enrollment-key module (cek_ mint, Argon2id hash/verify, hex vN (XXXX) fingerprint). - Admin (JWT) key rotate + fingerprint GET endpoints. ## Review status Code-reviewed (APPROVE WITH NITS) + a focused re-review confirming all fixes closed: - HIGH cross-site silent-move hijack -> now 409 ENROLL_SITE_CONFLICT before any repoint/mint. - MEDIUM timing/enumeration oracle -> all reject paths pay one Argon2id verify. - LOW new-enroll TOCTOU -> converges to reuse on machine_uid conflict. - collision-gate doc made accurate; enforcement TODO left for Phase B/D. ## CI must validate (couldn't verify on the Windows dev host) - Linux target build (x86_64-unknown-linux-gnu). - Postgres-gated tests (machine_uid vs agent_id conflict classification; bound site_id exposure) — need ephemeral Postgres / TEST_DATABASE_URL. ## Inert until later phases No installer issues enrollment keys yet and no agent calls /api/enroll, so the endpoint is unreachable in practice until Phase B/C. The deprecated shared AGENT_API_KEY relay fallback is untouched. Spec: docs/specs/SPEC-016-zero-touch-enrollment.md. Tracking todo dbfe6a56.
azcomputerguru added 2 commits 2026-06-02 10:34:06 -07:00
Server-side zero-touch per-site enrollment (Phase A: backend + DB only;
agent-side machine_uid derivation is Phase B, server treats it as opaque).

Migration 010_spec016_enrollment.sql:
- connect_sites: relational site anchor (site_code natural key, per-tenant
  unique). The spec assumed a sites table existed; it did not (site/company
  were free-text columns on connect_machines), so this creates a minimal one.
- site_enrollment_keys: rotatable, Argon2id-hashed cek_ secret + monotonic
  version + hex fingerprint + active flag; one-active-per-site partial unique.
- connect_machines: + site_id (FK), + enrollment_state ('active'|'pending')
  collision gate, + per-tenant (tenant_id, machine_uid) unique index added
  ALONGSIDE the 008 global index (the connect-path upsert_machine ON CONFLICT
  arbiter binds to 008 — dropping it would break live reconnect).
- connect_sites.enrollment_policy: reserved (default auto-approve), not enforced.

auth/enrollment_keys.rs: cek_ mint (256-bit, OS CSPRNG), Argon2id hash/verify
(reuses auth::password), and hex fingerprint vN (XXXX) per resolved-decision #3.

db/sites.rs + db/enrollment_keys.rs: runtime sqlx persistence; rotate_key
deactivates+inserts in one tx to hold the one-active-key invariant.

POST /api/enroll (public, api/enroll.rs): site_code+cek_ verify against active
key -> dedup on (tenant, machine_uid) -> new / reuse / site-move / collision.
Collision gate (PROVISIONAL heuristic: online existing row + different hostname)
-> pending, no usable cak_, alert. Mints cak_ via existing agent_keys path in the
exact form relay::validate_agent_api_key expects. Per-(site_code,IP) rate-limit +
lockout (EnrollLimiter). Audit events + [ENROLL] alert markers with
TODO(SPEC-016) #dev-alerts notes.

Admin (JWT) api/sites.rs: POST /api/sites/:id/enrollment-key/rotate (plaintext +
fingerprint once) and GET .../enrollment-key (fingerprint/version, no secret).

Routes wired in main.rs (enroll public, rotation admin). 13 new unit tests;
full server suite 99 passing. cargo check + clippy clean on the host (Windows)
target — Linux cross-target not installed here; server crate is platform-neutral
Rust. No sqlx offline cache needed (codebase uses runtime queries, no query!).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(enroll): SPEC-016 Phase A review fixes (cross-site guard, timing oracle, TOCTOU)
Some checks failed
Build and Test / Build Agent (Windows) (pull_request) Failing after 10m11s
Build and Test / Build Server (Linux) (pull_request) Failing after 10m5s
Build and Test / Security Audit (pull_request) Successful in 8m5s
Build and Test / Build Summary (pull_request) Has been skipped
0f02f23765
Applies the four review fixes to POST /api/enroll, all in server/src/api/enroll.rs
(+ a new ENROLL_SITE_CONFLICT event type in server/src/db/events.rs):

1. HIGH — close the within-tenant cross-site silent-move hijack. A valid key for
   site B presented for a machine_uid already bound to a DIFFERENT site is now
   REFUSED (409 ENROLL_SITE_CONFLICT) instead of silently repointing the row and
   minting a fresh cak_. No move, no key. Emits an ENROLL_SITE_CONFLICT audit event
   + alert TODO. Same-site match still resolves to reuse; a NULL prior site_id is a
   first relational bind, not a move. The unauthenticated site_move mint path is
   removed; deliberate moves are deferred to the Phase-B --reassign flow + dashboard.

2. MEDIUM — kill the timing/enumeration oracle. Unknown site_code and no-active-key
   early rejects now pay a dummy Argon2id verify against a fixed, valid throwaway PHC
   constant (TIMING_EQUALIZER_PHC) before returning the identical 401, so every
   rejection path pays one KDF. The constant is asserted valid + verifying in tests.

3. LOW — fix the new-enroll TOCTOU. The dedup lookup + INSERT is wrapped in a bounded
   retry loop: a concurrent first-enroll of the same machine_uid whose INSERT loses
   the unique-index race (classified by is_machine_uid_conflict on SQLSTATE 23505 +
   machine_uid constraint) now re-looks-up and converges to reuse instead of 500ing.
   A non-machine_uid unique violation still surfaces as 500.

4. LOW — make the collision-gate doc honest + leave an enforcement TODO. The module
   doc now states the gate withholds only a NEWLY minted cak_ (a prior clean cak_
   survives) and that nothing consults enrollment_state at control time yet, with a
   TODO(SPEC-016 Phase B/D) marker for relay/control-plane enforcement + revocation.

Verify: cargo check, cargo clippy --all-targets, and cargo test all clean on this
Windows host (104 tests pass). Two DB-gated tests (cross-site bound-site_id exposure,
machine_uid-vs-agent_id conflict classification) no-op without TEST_DATABASE_URL and
run against real Postgres in CI; the Linux target / real-Postgres handler path is
validated there, not on this host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
azcomputerguru added 1 commit 2026-06-02 10:49:00 -07:00
style(enroll): cargo fmt --all (satisfy CI fmt gate)
All checks were successful
Build and Test / Build Agent (Windows) (pull_request) Successful in 16m35s
Build and Test / Build Server (Linux) (pull_request) Successful in 19m7s
Build and Test / Security Audit (pull_request) Successful in 5m27s
Build and Test / Build Summary (pull_request) Successful in 26s
4106fc4bc4
The Phase A work passed cargo check + clippy + tests locally but missed
`cargo fmt --all -- --check` (the first step of the Linux CI job): module
ordering in db/mod.rs and two trailing-comment alignments in rate_limit.rs.
No logic change. Agent build failure on the prior run was transient infra
(verified: agent crate compiles clean locally).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
azcomputerguru merged commit 89c3718266 into main 2026-06-02 11:19:41 -07:00
azcomputerguru deleted branch feat/spec-016-enrollment 2026-06-02 11:19:43 -07:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: azcomputerguru/guru-connect#5