spec(v2-session-core): add Task 9 — cak_ auto-enroll provisioning (TOFU) + shared-key retirement

2026-06-01 14:40:14 -07:00
parent 0059b21db6
commit 40c7d860cc
1 changed files with 57 additions and 0 deletions
--- a/specs/v2-secure-session-core/plan.md
+++ b/specs/v2-secure-session-core/plan.md
@@ -527,3 +527,60 @@ Reference: SPEC-002 §5; `agent/src/encoder/raw.rs` (salvaged), `proto/guruconne
 - **Rate limiting:** hammer `/api/auth/login` and the code-validate route → confirm throttling/lockout.
 - **Migrations:** fresh DB applies the v2 migrations cleanly; `_sqlx_migrations` consistent; `tenant_id`
  populated with the default tenant.
+
+---
+
+## Task 9 [PROPOSED 2026-06-01 — provisioning model = TOFU auto-enroll, chosen by Mike]: `cak_` auto-enroll provisioning + shared-key retirement
+
+> Context: Task 2 built the SERVER `cak_` machinery (mint/SHA-256 hash/verify in `auth/agent_keys.rs`,
+> relay validation in `validate_agent_api_key`, admin issuance `POST /api/machines/:id/keys`). What's
+> missing is how an AGENT obtains and uses a `cak_` — today agents still carry the deprecated shared
+> `AGENT_API_KEY`, so `connect_agent_keys` is empty and the relay logs the DEPRECATED-shared-key warning
+> for every agent. This task closes that with **trust-on-first-use auto-enroll** so the shared key can be
+> retired (unblocks task list #5). NOTE: the agent already presents whatever is in its `api_key` slot and
+> the relay auto-detects `cak_` vs shared — so a `cak_`-keyed agent needs **no change to its auth call**,
+> only a way to *receive*, *persist*, and *prefer* a `cak_`.
+
+**Flow (TOFU):**
+1. **Bootstrap (first connect):** a fresh agent authenticates on `/ws/agent` with a bootstrap secret —
+   interim: the shared `AGENT_API_KEY` (embedded by the download endpoint); target: a single-use,
+   short-lived **enroll token** (more secure TOFU — see Security). 
+2. **Server issues on first connect:** when an agent authed via the bootstrap path (i.e. NOT already
+   `cak_`-keyed) connects and its machine has **no active (non-revoked) `cak_`**, the relay: resolves/creates
+   the machine row (existing `upsert_machine` on `machine_uid` — now functional after the 2026-06-01
+   ON CONFLICT fix), mints a `cak_` (`generate_agent_key` + `db::agent_keys::insert_agent_key` for that
+   `machine_id`), and sends the plaintext key to the agent **once** over a new server→agent message. Only
+   the hash is stored. **Idempotent:** never re-issue if an active key already exists for the machine.
+3. **Agent receives + persists + prefers:** on `AgentKeyProvision`, the agent persists the `cak_` durably at
+   `%ProgramData%\GuruConnect\agent_key` (restricted ACL, same pattern as `machine_uid`). On startup it loads
+   the persisted `cak_` if present and uses it as its auth key, falling back to the embedded/bootstrap secret
+   only when no `cak_` is stored yet. After provisioning, every reconnect authenticates via `cak_` (no more
+   DEPRECATED-shared-key warning for that agent).
+4. **Shared-key retirement (phased):** Phase A — shared key stays as the bootstrap so existing+new agents
+   self-enroll; monitor the relay WARN count → ~0. Phase B — once the fleet is `cak_`-keyed, restrict the
+   shared `AGENT_API_KEY` to enrollment-only or remove the env entirely (only `cak_` / enroll-token accepted).
+   This is the concrete completion of task-list #5.
+
+**Protocol (4-artifact drift discipline):** add `AgentKeyProvision { string key = 1; }` (server→agent) to
+`proto/guruconnect.proto` with a new reserved message ID; regenerate prost on both agent + server; the
+hand-written `dashboard/src/lib/protobuf.ts` decoder does NOT need it (agent-plane only) but reserve the ID.
+
+**Files:** `proto/guruconnect.proto` (new message); `server/src/relay/mod.rs` (issue+send on bootstrap connect
+with no active key); `server/src/db/agent_keys.rs` (add `has_active_key(machine_id)` check; reuse insert);
+`agent/src/transport/*` (handle inbound `AgentKeyProvision`); `agent/src/config.rs` + a small key-store module
+(load/persist `cak_`, prefer over bootstrap).
+
+**Security (TOFU):** the first connect trusts the bootstrap secret — a leaked shared key during the enroll
+window could enroll a rogue agent; the secure target is a **single-use, short-lived enroll token** per
+deployment instead of the shared key (shared-key bootstrap is interim convenience). The `cak_` is sent
+plaintext once over the existing wss/TLS channel; only the hash is stored server-side; the agent stores it
+locally with restricted ACLs. Revocation via the existing `DELETE /api/machines/:id/keys/:key_id` fails the
+agent closed; on its next bootstrap connect it re-enrolls. The keyed-agent dedup (Task 3) keeps the
+authenticated identity authoritative.
+
+**Verification:** drop a current-build (signed 0.3.0+) agent configured with the shared-key bootstrap →
+it connects, receives a `cak_`, persists it; restart → it authenticates via the `cak_` (relay shows NO
+DEPRECATED-shared-key warning) and `connect_agent_keys` holds exactly one active key for the machine; issue
+is idempotent across reconnects; revoke the key via the admin API → agent rejected, then re-enrolls on next
+bootstrap connect. Reference: `auth/agent_keys.rs`, `api/machine_keys.rs`, `relay/mod.rs:266-309`
+(`validate_agent_api_key`), `.claude/standards/security/credential-handling.md`.