Files
guru-connect/specs/v2-stable-identity/plan.md
Mike Swanson 92bc522c3a
Some checks failed
Build and Test / Build Server (Linux) (push) Has started running
Build and Test / Build Agent (Windows) (push) Has started running
Build and Test / Security Audit (push) Has been cancelled
Build and Test / Build Summary (push) Has been cancelled
spec: add v2-stable-identity implementation plan (SPEC-004 breakdown)
Ordered, execution-ready plan for SPEC-004 (stable machine identity + session
reaping + operator removal). Works out the core integration: machine_uid =
deterministic MachineGuid-based hardware identity (recomputable, so config loss
can't duplicate); per-agent cak_ key stays the credential/trust boundary; they
compose so one cak_ key per machine_uid = one key per real machine (the
prerequisite the fleet key-migration #7 needs). Root cause grounded in code:
agent_id is a random UUID (config.rs:90), connect_machines dedups on ON CONFLICT
(agent_id), so config loss -> duplicate rows (DESKTOP-I66IM5Q x9 live). 5 ordered
tasks (agent uid -> server dedup -> reconcile/age-out -> reaping -> operator
removal). Unblocks #7 -> #5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:17:49 -07:00

5.7 KiB
Raw Blame History

v2 Stable Machine Identity + Session Reaping + Operator Removal — Implementation Plan

Status: planned 2026-05-30. Parent: SPEC-004 (v2 Phase 1/2). Builds on the v2-secure-session-core per-agent cak_ keys. Unblocks the fleet per-agent-key migration (task #7) → retire shared key (task #5).

Why now

Live evidence of the problem: connect_machines holds 15 persistent rows for 5 real hosts (DESKTOP-I66IM5Q ×9) — the duplicate-registration bug. Until the fleet is deduped, you cannot mint one cak_ key per real machine, so the shared AGENT_API_KEY can't be retired.

The core integration: machine_uid vs. per-agent cak_ key

Worked out against the current code so this composes with the just-built auth, not against it:

  • Today: agent_id = random UUID from generate_agent_id() (agent/src/config.rs:90), persisted in the config file. Lost/missing config → a fresh UUID → a new row, because connect_machines.agent_id is UNIQUE and upsert_machine dedups on ON CONFLICT (agent_id) (server/src/db/machines.rs:101,111). Unstable id → duplicates.
  • Per-agent keys already prevent duplicates for KEYED agents: a cak_ key binds to a connect_machines row via connect_agent_keys.machine_id, and reattach uses the key's machine identity, ignoring a client-supplied agent_id. The duplicate problem is the shared-key / support-code / config-loss fleet, which has no stable identity.
  • machine_uid = deterministic hardware identity (Windows MachineGuid from HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid, hashed; optionally folded with a board serial), recomputable so a lost config self-heals to the same id. It is the IDENTITY (who this box is); the cak_ key is the CREDENTIAL (proof it's authorized as that box). They compose: one cak_ key per machine_uid = one key per real machine — exactly what task #7 needs.
  • Security stance (unchanged from SPEC-004): a client-asserted machine_uid is spoofable, so it is not a trust boundary on its own. For keyed agents the cak_ key stays authoritative (server uses the key's machine, not the claimed uid). For un-keyed agents machine_uid is dedup-only (correctness, not trust). Reimage/clone caveats per SPEC-004 (MachineGuid regenerates on sysprep, clones collide — caught by the key binding).

Tasks (ordered; each Coding Agent + Code Review, gates green)

Task 1 — Agent derives + reports machine_uid

  • New agent/src/identity.rs: machine_uid() = stable hash of MachineGuid (Windows; registry read), recomputable; non-Windows fallback = a persisted random UUID. Cache in config but recompute if absent (don't depend on the config file for correctness).
  • Send machine_uid in the connect handshake (agent/src/transport/websocket.rs:40 query) and on AgentStatus (proto). Keep the legacy random agent_id as a migration fallback only.
  • Tests: deterministic (same machine inputs → same uid); a wiped config recomputes the same uid.

Task 2 — Server schema + dedup on machine_uid

  • Migration 008_machine_uid.sql: add connect_machines.machine_uid TEXT (nullable for legacy) + a unique index WHERE machine_uid IS NOT NULL. Idempotent; startup-applied.
  • upsert_machine keys on machine_uid when present (ON CONFLICT (machine_uid)), falling back to agent_id for legacy agents. Session reuse / reattach key on machine_uid. The cak_ key's machine binding stays authoritative for keyed agents.
  • Tests: same machine_uid with varying agent_id → ONE row; legacy (no uid) path unchanged.

Task 3 — Reconcile existing duplicate rows

  • Prefer age-out + operator removal over a risky backfill (per the #7 audit note): once Tasks 4/5 land, the 14 duplicate ghost rows reap/purge naturally. If a deterministic collapse is wanted later, map duplicates by hostname→machine_uid and repoint sessions/keys — but only as a separate, reviewed migration. v1 of this plan does NOT auto-collapse.

Task 4 — Session lifecycle reaping (server/src/session/mod.rs)

  • reap_stale_persistent(ttl) on SessionManager: periodic sweep (spawned at startup, ~60s) removing persistent sessions offline (is_online == false) past a TTL (default 10 min, via last_heartbeat_instant).
  • On reconnect, supersede prior same-machine (machine_uid) sessions so a fresh agent_id can't strand the old one.
  • Tests: offline-past-TTL reaped; online / within-TTL spared; same-machine reconnect supersedes; never reap an online or viewer-attached session.

Task 5 — Operator removal API + dashboard

  • deleted_at on connect_sessions (+ machines as needed); DELETE …?purge=true (in-memory remove + DB soft-delete) distinct from the live-only disconnect; a bulk endpoint; per-row + multi-select removal on the dashboard machines/sessions views. Admin-gated, audited to events.
  • This is also the immediate fix for the live ghost rows — once it lands, purge the 14 duplicates.

Exit criteria

One machine = one record/session; a config-loss or portable run can't duplicate; admins can purge stale rows individually and in bulk; the fleet is deduped enough to mint one cak_ key per real machine — unblocking task #7 (fleet key migration) → task #5 (retire shared AGENT_API_KEY).

Open questions

  1. machine_uid recipe — MachineGuid alone vs. folded with a board/BIOS serial? (Proposed: MachineGuid primary, hashed.)
  2. Should machine_uid REPLACE agent_id as the primary key, or sit alongside it (legacy fallback)? (Proposed: alongside, dedup prefers machine_uid; agent_id retained for legacy + transition.)
  3. Reap TTL default (10 min proposed) and whether managed vs. support sessions differ.