Ordered, execution-ready plan for SPEC-004 (stable machine identity + session reaping + operator removal). Works out the core integration: machine_uid = deterministic MachineGuid-based hardware identity (recomputable, so config loss can't duplicate); per-agent cak_ key stays the credential/trust boundary; they compose so one cak_ key per machine_uid = one key per real machine (the prerequisite the fleet key-migration #7 needs). Root cause grounded in code: agent_id is a random UUID (config.rs:90), connect_machines dedups on ON CONFLICT (agent_id), so config loss -> duplicate rows (DESKTOP-I66IM5Q x9 live). 5 ordered tasks (agent uid -> server dedup -> reconcile/age-out -> reaping -> operator removal). Unblocks #7 -> #5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.7 KiB
v2 Stable Machine Identity + Session Reaping + Operator Removal — Implementation Plan
Status: planned 2026-05-30. Parent: SPEC-004 (v2 Phase 1/2). Builds on the v2-secure-session-core per-agent
cak_keys. Unblocks the fleet per-agent-key migration (task #7) → retire shared key (task #5).
Why now
Live evidence of the problem: connect_machines holds 15 persistent rows for 5 real hosts
(DESKTOP-I66IM5Q ×9) — the duplicate-registration bug. Until the fleet is deduped, you
cannot mint one cak_ key per real machine, so the shared AGENT_API_KEY can't be retired.
The core integration: machine_uid vs. per-agent cak_ key
Worked out against the current code so this composes with the just-built auth, not against it:
- Today:
agent_id= random UUID fromgenerate_agent_id()(agent/src/config.rs:90), persisted in the config file. Lost/missing config → a fresh UUID → a new row, becauseconnect_machines.agent_idisUNIQUEandupsert_machinededups onON CONFLICT (agent_id)(server/src/db/machines.rs:101,111). Unstable id → duplicates. - Per-agent keys already prevent duplicates for KEYED agents: a
cak_key binds to aconnect_machinesrow viaconnect_agent_keys.machine_id, and reattach uses the key's machine identity, ignoring a client-suppliedagent_id. The duplicate problem is the shared-key / support-code / config-loss fleet, which has no stable identity. machine_uid= deterministic hardware identity (WindowsMachineGuidfromHKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid, hashed; optionally folded with a board serial), recomputable so a lost config self-heals to the same id. It is the IDENTITY (who this box is); thecak_key is the CREDENTIAL (proof it's authorized as that box). They compose: onecak_key permachine_uid= one key per real machine — exactly what task #7 needs.- Security stance (unchanged from SPEC-004): a client-asserted
machine_uidis spoofable, so it is not a trust boundary on its own. For keyed agents thecak_key stays authoritative (server uses the key's machine, not the claimed uid). For un-keyed agentsmachine_uidis dedup-only (correctness, not trust). Reimage/clone caveats per SPEC-004 (MachineGuid regenerates on sysprep, clones collide — caught by the key binding).
Tasks (ordered; each Coding Agent + Code Review, gates green)
Task 1 — Agent derives + reports machine_uid
- New
agent/src/identity.rs:machine_uid()= stable hash ofMachineGuid(Windows; registry read), recomputable; non-Windows fallback = a persisted random UUID. Cache in config but recompute if absent (don't depend on the config file for correctness). - Send
machine_uidin the connect handshake (agent/src/transport/websocket.rs:40query) and onAgentStatus(proto). Keep the legacy randomagent_idas a migration fallback only. - Tests: deterministic (same machine inputs → same uid); a wiped config recomputes the same uid.
Task 2 — Server schema + dedup on machine_uid
- Migration
008_machine_uid.sql: addconnect_machines.machine_uid TEXT(nullable for legacy) + a unique indexWHERE machine_uid IS NOT NULL. Idempotent; startup-applied. upsert_machinekeys onmachine_uidwhen present (ON CONFLICT (machine_uid)), falling back toagent_idfor legacy agents. Session reuse / reattach key onmachine_uid. Thecak_key's machine binding stays authoritative for keyed agents.- Tests: same
machine_uidwith varyingagent_id→ ONE row; legacy (no uid) path unchanged.
Task 3 — Reconcile existing duplicate rows
- Prefer age-out + operator removal over a risky backfill (per the #7 audit note): once Tasks 4/5
land, the 14 duplicate ghost rows reap/purge naturally. If a deterministic collapse is wanted later,
map duplicates by hostname→
machine_uidand repoint sessions/keys — but only as a separate, reviewed migration. v1 of this plan does NOT auto-collapse.
Task 4 — Session lifecycle reaping (server/src/session/mod.rs)
reap_stale_persistent(ttl)onSessionManager: periodic sweep (spawned at startup, ~60s) removing persistent sessions offline (is_online == false) past a TTL (default 10 min, vialast_heartbeat_instant).- On reconnect, supersede prior same-machine (
machine_uid) sessions so a freshagent_idcan't strand the old one. - Tests: offline-past-TTL reaped; online / within-TTL spared; same-machine reconnect supersedes; never reap an online or viewer-attached session.
Task 5 — Operator removal API + dashboard
deleted_atonconnect_sessions(+ machines as needed);DELETE …?purge=true(in-memory remove + DB soft-delete) distinct from the live-only disconnect; a bulk endpoint; per-row + multi-select removal on the dashboard machines/sessions views. Admin-gated, audited toevents.- This is also the immediate fix for the live ghost rows — once it lands, purge the 14 duplicates.
Exit criteria
One machine = one record/session; a config-loss or portable run can't duplicate; admins can purge stale
rows individually and in bulk; the fleet is deduped enough to mint one cak_ key per real machine —
unblocking task #7 (fleet key migration) → task #5 (retire shared AGENT_API_KEY).
Open questions
machine_uidrecipe —MachineGuidalone vs. folded with a board/BIOS serial? (Proposed: MachineGuid primary, hashed.)- Should
machine_uidREPLACEagent_idas the primary key, or sit alongside it (legacy fallback)? (Proposed: alongside, dedup prefersmachine_uid; agent_id retained for legacy + transition.) - Reap TTL default (10 min proposed) and whether managed vs. support sessions differ.