Files
guru-connect/specs/v2-stable-identity/plan.md
Mike Swanson 92bc522c3a
Some checks failed
Build and Test / Build Server (Linux) (push) Has started running
Build and Test / Build Agent (Windows) (push) Has started running
Build and Test / Security Audit (push) Has been cancelled
Build and Test / Build Summary (push) Has been cancelled
spec: add v2-stable-identity implementation plan (SPEC-004 breakdown)
Ordered, execution-ready plan for SPEC-004 (stable machine identity + session
reaping + operator removal). Works out the core integration: machine_uid =
deterministic MachineGuid-based hardware identity (recomputable, so config loss
can't duplicate); per-agent cak_ key stays the credential/trust boundary; they
compose so one cak_ key per machine_uid = one key per real machine (the
prerequisite the fleet key-migration #7 needs). Root cause grounded in code:
agent_id is a random UUID (config.rs:90), connect_machines dedups on ON CONFLICT
(agent_id), so config loss -> duplicate rows (DESKTOP-I66IM5Q x9 live). 5 ordered
tasks (agent uid -> server dedup -> reconcile/age-out -> reaping -> operator
removal). Unblocks #7 -> #5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 21:17:49 -07:00

85 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v2 Stable Machine Identity + Session Reaping + Operator Removal — Implementation Plan
> Status: planned 2026-05-30. Parent: [SPEC-004](../../docs/specs/SPEC-004-session-lifecycle-and-removal.md)
> (v2 Phase 1/2). Builds on the v2-secure-session-core per-agent `cak_` keys.
> Unblocks the fleet per-agent-key migration (task #7) → retire shared key (task #5).
## Why now
Live evidence of the problem: `connect_machines` holds **15 persistent rows for 5 real hosts**
(`DESKTOP-I66IM5Q` ×9) — the duplicate-registration bug. Until the fleet is deduped, you
cannot mint one `cak_` key per real machine, so the shared `AGENT_API_KEY` can't be retired.
## The core integration: `machine_uid` vs. per-agent `cak_` key
Worked out against the current code so this composes with the just-built auth, not against it:
- **Today:** `agent_id` = random UUID from `generate_agent_id()` (`agent/src/config.rs:90`), persisted
in the config file. Lost/missing config → a fresh UUID → a new row, because
`connect_machines.agent_id` is `UNIQUE` and `upsert_machine` dedups on `ON CONFLICT (agent_id)`
(`server/src/db/machines.rs:101,111`). Unstable id → duplicates.
- **Per-agent keys already prevent duplicates for KEYED agents:** a `cak_` key binds to a
`connect_machines` row via `connect_agent_keys.machine_id`, and reattach uses the **key's** machine
identity, ignoring a client-supplied `agent_id`. The duplicate problem is the **shared-key /
support-code / config-loss** fleet, which has no stable identity.
- **`machine_uid` = deterministic hardware identity** (Windows `MachineGuid` from
`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`, hashed; optionally folded with a board serial),
**recomputable** so a lost config self-heals to the *same* id. It is the IDENTITY (who this box is);
the `cak_` key is the CREDENTIAL (proof it's authorized as that box). They compose: **one `cak_` key
per `machine_uid` = one key per real machine** — exactly what task #7 needs.
- **Security stance (unchanged from SPEC-004):** a client-asserted `machine_uid` is spoofable, so it is
**not** a trust boundary on its own. For keyed agents the `cak_` key stays authoritative (server uses
the key's machine, not the claimed uid). For un-keyed agents `machine_uid` is **dedup-only**
(correctness, not trust). Reimage/clone caveats per SPEC-004 (MachineGuid regenerates on sysprep,
clones collide — caught by the key binding).
## Tasks (ordered; each Coding Agent + Code Review, gates green)
### Task 1 — Agent derives + reports `machine_uid`
- New `agent/src/identity.rs`: `machine_uid()` = stable hash of `MachineGuid` (Windows; registry read),
recomputable; non-Windows fallback = a persisted random UUID. Cache in config but **recompute if
absent** (don't depend on the config file for correctness).
- Send `machine_uid` in the connect handshake (`agent/src/transport/websocket.rs:40` query) and on
`AgentStatus` (proto). Keep the legacy random `agent_id` as a migration fallback only.
- Tests: deterministic (same machine inputs → same uid); a wiped config recomputes the same uid.
### Task 2 — Server schema + dedup on `machine_uid`
- Migration `008_machine_uid.sql`: add `connect_machines.machine_uid TEXT` (nullable for legacy) +
a unique index `WHERE machine_uid IS NOT NULL`. Idempotent; startup-applied.
- `upsert_machine` keys on `machine_uid` when present (`ON CONFLICT (machine_uid)`), falling back to
`agent_id` for legacy agents. Session reuse / reattach key on `machine_uid`. The `cak_` key's machine
binding stays authoritative for keyed agents.
- Tests: same `machine_uid` with varying `agent_id` → ONE row; legacy (no uid) path unchanged.
### Task 3 — Reconcile existing duplicate rows
- Prefer **age-out + operator removal over a risky backfill** (per the #7 audit note): once Tasks 4/5
land, the 14 duplicate ghost rows reap/purge naturally. If a deterministic collapse is wanted later,
map duplicates by hostname→`machine_uid` and repoint sessions/keys — but only as a separate, reviewed
migration. v1 of this plan does NOT auto-collapse.
### Task 4 — Session lifecycle reaping (`server/src/session/mod.rs`)
- `reap_stale_persistent(ttl)` on `SessionManager`: periodic sweep (spawned at startup, ~60s) removing
persistent sessions offline (`is_online == false`) past a TTL (default 10 min, via `last_heartbeat_instant`).
- On reconnect, **supersede** prior same-machine (`machine_uid`) sessions so a fresh `agent_id` can't
strand the old one.
- Tests: offline-past-TTL reaped; online / within-TTL spared; same-machine reconnect supersedes; never
reap an online or viewer-attached session.
### Task 5 — Operator removal API + dashboard
- `deleted_at` on `connect_sessions` (+ machines as needed); `DELETE …?purge=true` (in-memory remove +
DB soft-delete) distinct from the live-only disconnect; a bulk endpoint; per-row + multi-select removal
on the dashboard machines/sessions views. Admin-gated, audited to `events`.
- This is also the **immediate fix for the live ghost rows** — once it lands, purge the 14 duplicates.
## Exit criteria
One machine = one record/session; a config-loss or portable run can't duplicate; admins can purge stale
rows individually and in bulk; the fleet is deduped enough to mint one `cak_` key per real machine —
unblocking task #7 (fleet key migration) → task #5 (retire shared `AGENT_API_KEY`).
## Open questions
1. `machine_uid` recipe — `MachineGuid` alone vs. folded with a board/BIOS serial? (Proposed: MachineGuid
primary, hashed.)
2. Should `machine_uid` REPLACE `agent_id` as the primary key, or sit alongside it (legacy fallback)?
(Proposed: alongside, dedup prefers `machine_uid`; agent_id retained for legacy + transition.)
3. Reap TTL default (10 min proposed) and whether managed vs. support sessions differ.