spec: add v2-stable-identity implementation plan (SPEC-004 breakdown)
Ordered, execution-ready plan for SPEC-004 (stable machine identity + session reaping + operator removal). Works out the core integration: machine_uid = deterministic MachineGuid-based hardware identity (recomputable, so config loss can't duplicate); per-agent cak_ key stays the credential/trust boundary; they compose so one cak_ key per machine_uid = one key per real machine (the prerequisite the fleet key-migration #7 needs). Root cause grounded in code: agent_id is a random UUID (config.rs:90), connect_machines dedups on ON CONFLICT (agent_id), so config loss -> duplicate rows (DESKTOP-I66IM5Q x9 live). 5 ordered tasks (agent uid -> server dedup -> reconcile/age-out -> reaping -> operator removal). Unblocks #7 -> #5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
84
specs/v2-stable-identity/plan.md
Normal file
84
specs/v2-stable-identity/plan.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# v2 Stable Machine Identity + Session Reaping + Operator Removal — Implementation Plan
|
||||
|
||||
> Status: planned 2026-05-30. Parent: [SPEC-004](../../docs/specs/SPEC-004-session-lifecycle-and-removal.md)
|
||||
> (v2 Phase 1/2). Builds on the v2-secure-session-core per-agent `cak_` keys.
|
||||
> Unblocks the fleet per-agent-key migration (task #7) → retire shared key (task #5).
|
||||
|
||||
## Why now
|
||||
|
||||
Live evidence of the problem: `connect_machines` holds **15 persistent rows for 5 real hosts**
|
||||
(`DESKTOP-I66IM5Q` ×9) — the duplicate-registration bug. Until the fleet is deduped, you
|
||||
cannot mint one `cak_` key per real machine, so the shared `AGENT_API_KEY` can't be retired.
|
||||
|
||||
## The core integration: `machine_uid` vs. per-agent `cak_` key
|
||||
|
||||
Worked out against the current code so this composes with the just-built auth, not against it:
|
||||
|
||||
- **Today:** `agent_id` = random UUID from `generate_agent_id()` (`agent/src/config.rs:90`), persisted
|
||||
in the config file. Lost/missing config → a fresh UUID → a new row, because
|
||||
`connect_machines.agent_id` is `UNIQUE` and `upsert_machine` dedups on `ON CONFLICT (agent_id)`
|
||||
(`server/src/db/machines.rs:101,111`). Unstable id → duplicates.
|
||||
- **Per-agent keys already prevent duplicates for KEYED agents:** a `cak_` key binds to a
|
||||
`connect_machines` row via `connect_agent_keys.machine_id`, and reattach uses the **key's** machine
|
||||
identity, ignoring a client-supplied `agent_id`. The duplicate problem is the **shared-key /
|
||||
support-code / config-loss** fleet, which has no stable identity.
|
||||
- **`machine_uid` = deterministic hardware identity** (Windows `MachineGuid` from
|
||||
`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`, hashed; optionally folded with a board serial),
|
||||
**recomputable** so a lost config self-heals to the *same* id. It is the IDENTITY (who this box is);
|
||||
the `cak_` key is the CREDENTIAL (proof it's authorized as that box). They compose: **one `cak_` key
|
||||
per `machine_uid` = one key per real machine** — exactly what task #7 needs.
|
||||
- **Security stance (unchanged from SPEC-004):** a client-asserted `machine_uid` is spoofable, so it is
|
||||
**not** a trust boundary on its own. For keyed agents the `cak_` key stays authoritative (server uses
|
||||
the key's machine, not the claimed uid). For un-keyed agents `machine_uid` is **dedup-only**
|
||||
(correctness, not trust). Reimage/clone caveats per SPEC-004 (MachineGuid regenerates on sysprep,
|
||||
clones collide — caught by the key binding).
|
||||
|
||||
## Tasks (ordered; each Coding Agent + Code Review, gates green)
|
||||
|
||||
### Task 1 — Agent derives + reports `machine_uid`
|
||||
- New `agent/src/identity.rs`: `machine_uid()` = stable hash of `MachineGuid` (Windows; registry read),
|
||||
recomputable; non-Windows fallback = a persisted random UUID. Cache in config but **recompute if
|
||||
absent** (don't depend on the config file for correctness).
|
||||
- Send `machine_uid` in the connect handshake (`agent/src/transport/websocket.rs:40` query) and on
|
||||
`AgentStatus` (proto). Keep the legacy random `agent_id` as a migration fallback only.
|
||||
- Tests: deterministic (same machine inputs → same uid); a wiped config recomputes the same uid.
|
||||
|
||||
### Task 2 — Server schema + dedup on `machine_uid`
|
||||
- Migration `008_machine_uid.sql`: add `connect_machines.machine_uid TEXT` (nullable for legacy) +
|
||||
a unique index `WHERE machine_uid IS NOT NULL`. Idempotent; startup-applied.
|
||||
- `upsert_machine` keys on `machine_uid` when present (`ON CONFLICT (machine_uid)`), falling back to
|
||||
`agent_id` for legacy agents. Session reuse / reattach key on `machine_uid`. The `cak_` key's machine
|
||||
binding stays authoritative for keyed agents.
|
||||
- Tests: same `machine_uid` with varying `agent_id` → ONE row; legacy (no uid) path unchanged.
|
||||
|
||||
### Task 3 — Reconcile existing duplicate rows
|
||||
- Prefer **age-out + operator removal over a risky backfill** (per the #7 audit note): once Tasks 4/5
|
||||
land, the 14 duplicate ghost rows reap/purge naturally. If a deterministic collapse is wanted later,
|
||||
map duplicates by hostname→`machine_uid` and repoint sessions/keys — but only as a separate, reviewed
|
||||
migration. v1 of this plan does NOT auto-collapse.
|
||||
|
||||
### Task 4 — Session lifecycle reaping (`server/src/session/mod.rs`)
|
||||
- `reap_stale_persistent(ttl)` on `SessionManager`: periodic sweep (spawned at startup, ~60s) removing
|
||||
persistent sessions offline (`is_online == false`) past a TTL (default 10 min, via `last_heartbeat_instant`).
|
||||
- On reconnect, **supersede** prior same-machine (`machine_uid`) sessions so a fresh `agent_id` can't
|
||||
strand the old one.
|
||||
- Tests: offline-past-TTL reaped; online / within-TTL spared; same-machine reconnect supersedes; never
|
||||
reap an online or viewer-attached session.
|
||||
|
||||
### Task 5 — Operator removal API + dashboard
|
||||
- `deleted_at` on `connect_sessions` (+ machines as needed); `DELETE …?purge=true` (in-memory remove +
|
||||
DB soft-delete) distinct from the live-only disconnect; a bulk endpoint; per-row + multi-select removal
|
||||
on the dashboard machines/sessions views. Admin-gated, audited to `events`.
|
||||
- This is also the **immediate fix for the live ghost rows** — once it lands, purge the 14 duplicates.
|
||||
|
||||
## Exit criteria
|
||||
One machine = one record/session; a config-loss or portable run can't duplicate; admins can purge stale
|
||||
rows individually and in bulk; the fleet is deduped enough to mint one `cak_` key per real machine —
|
||||
unblocking task #7 (fleet key migration) → task #5 (retire shared `AGENT_API_KEY`).
|
||||
|
||||
## Open questions
|
||||
1. `machine_uid` recipe — `MachineGuid` alone vs. folded with a board/BIOS serial? (Proposed: MachineGuid
|
||||
primary, hashed.)
|
||||
2. Should `machine_uid` REPLACE `agent_id` as the primary key, or sit alongside it (legacy fallback)?
|
||||
(Proposed: alongside, dedup prefers `machine_uid`; agent_id retained for legacy + transition.)
|
||||
3. Reap TTL default (10 min proposed) and whether managed vs. support sessions differ.
|
||||
Reference in New Issue
Block a user