Files
guru-connect/docs/specs/SPEC-004-session-lifecycle-and-removal.md
Mike Swanson f8bd4d1dab spec: SPEC-004 add stable machine-derived identity as the primary fix
Address duplicate registration at the source, not just via cleanup. Root
cause now grounded: agent_id is a random UUID (config.rs:90 generate_agent_id)
persisted only in the config file, so a portable/misconfigured execution
(the Pavon desktop launcher) regenerates a fresh id each launch, defeating
both the DB upsert (ON CONFLICT agent_id) and session-reuse dedupe. Add a
deterministic machine_uid (Windows MachineGuid-based, recomputable) keyed by
registration; reaping/supersede become defense-in-depth. Security: machine_uid
is identity not authorization and must be bound to the per-machine agent key
to prevent session/record hijack. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 16:11:38 -07:00

232 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal
**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-05-30)
**Estimated Effort:** Medium
## Overview
Stop orphaned managed sessions from accumulating in the Operator Console, and give
operators a first-class way to remove stale sessions/units — per-row **and** in bulk
(multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it.
The durable fix is at registration: the **same machine must resolve to one stable
identity** so a repeated execution cannot mint duplicates in the first place; reaping
and manual removal then become defense-in-depth and cleanup, not the primary mechanism.
Success = (a) the same machine, run repeatedly (even from a portable/misconfigured
copy), registers to **one** record/session — no duplicates; (b) reconnecting/offline
persistent agents no longer leave behind retained ghost sessions; and (c) an admin can
select one or many session rows (and stale machine rows) and remove them from the
console.
**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all
`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
**both** a lifecycle fix and a manual-removal control.
## Root cause (confirmed in code)
The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions``list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Three compounding defects let ghosts accumulate there:
0. **Machine identity is a config-file random UUID, not machine-derived.** The agent's
`agent_id` is a random UUID minted by `generate_agent_id()` (`agent/src/config.rs:90`)
on first run and persisted **only in the agent config file** (`config.rs:331`), or
taken from `GURUCONNECT_AGENT_ID` (`config.rs:366`). A portable or misconfigured
execution that cannot locate/write that config — e.g. the Pavon desktop launcher
`guruconnect-pavon-raidersreef.exe` run repeatedly from a user Desktop — regenerates
a **fresh** `agent_id` every launch. Because identity is not derived from the machine,
the same physical box presents as N different agents. The DB upsert
(`upsert_machine`, `ON CONFLICT (agent_id)`) and the session-reuse map both key on
this id, so an unstable id defeats *both* dedupe layers at the source.
1. **Reconnect-reuse is keyed on that `agent_id`.** `register_agent`
(`session/mod.rs:169`) reuses an existing session only when
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. With a new
`agent_id` per launch (defect 0) the lookup misses and a **brand-new persistent
session** is created each time. The `agents` map holds only one session per agent_id,
so prior sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
removed entirely; persistent/managed sessions are deliberately retained
(`session/mod.rs:519542`) and there is **no TTL sweep**. An offline managed session
therefore lives in memory indefinitely, displayed alongside genuinely-live ones.
Net effect: N reconnects with unstable identity → N retained, never-expiring managed
sessions, none of which the UI can clear.
## Scope
### Included in v1
- **Stable, machine-derived identity (the primary fix):**
- The agent computes a deterministic `machine_uid` from durable machine identifiers —
primary source the Windows `MachineGuid`
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`), optionally folded with a stable
hardware id (board/BIOS serial) — hashed to a stable string. It is **recomputable**:
a lost/absent config self-heals to the *same* id rather than minting a new random
one. Persist a cached copy, but never depend on the config file for correctness.
- Registration keys on `machine_uid`: `upsert_machine`s `ON CONFLICT` and the
in-memory `agents`/session-reuse map both use it, so the same box converges to **one**
machine record and **one** managed session no matter how many times it executes.
- Carry `machine_uid` in the agent connect handshake (`transport/websocket.rs:40`
query params) / `AgentStatus`; keep the legacy random `agent_id` only as a
migration fallback.
- **Lifecycle reaping (defense-in-depth):**
- Periodic background sweep that removes persistent sessions whose agent has been
offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
using the existing `last_heartbeat_instant`.
- On agent reconnect, **supersede** prior retained sessions for the same machine
(dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh
agent_id cannot strand the old session.
- On socket drop for a persistent agent, mark the session offline *and* eligible for
the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory
state with it).
- **Manual removal API (admin-gated, audited):**
- `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record
(`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB
row. Distinguish from the existing `disconnect_session` (which ends a *live*
session) — purge works on dead rows.
- `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action:
"purge" | "end" }` for mass delete / bulk-end.
- Same stale-removal for the Machines view: extend the existing
`DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for
stale units.
- **Dashboard UX:**
- Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the
existing End in `EndSessionDialog.tsx`).
- Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
on the Sessions view; mirror on `MachinesPage.tsx` for stale units.
### Explicitly out of scope
- Surviving a full machine **reimage/clone** with the same identity. `MachineGuid`
regenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged
box legitimately becomes a new `machine_uid` (and a clone collision is caught by the
auth binding below). Cross-reimage identity continuity is out of scope for v1.
- Replacing the shared `AGENT_API_KEY` with per-machine agent keys — tracked separately
(roadmap GuruRMM-Integration). SPEC-004 *assumes* that binding for its threat model
(see Security) and degrades safely without it, but does not implement it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
audit trail (CLAUDE.md DB conventions).
## Architecture
- **Agent (`agent/src/`):** new `identity` module computes `machine_uid` deterministically
(Windows `MachineGuid` primary; non-Windows fallback to a stable persisted UUID).
Replace/augment `generate_agent_id()` (`config.rs:90`) so the effective id is the
machine-derived value, with the config-file value used only as a cache. Send
`machine_uid` in the connect query string (`transport/websocket.rs:40`) and on
`AgentStatus`.
- **Relay-server (`server/src/session/mod.rs`):** key `register_agent` and the
`agents` map on `machine_uid` so the same machine reuses one session; add
`reap_stale_persistent(ttl)` to `SessionManager` + a periodic task (e.g. every 60 s)
from server startup; supersede any prior same-machine sessions on reconnect; add a
`purge`-style removal the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
`deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup —
never pre-applied via psql (see the 005→007 lesson).
- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind
the existing `AuthenticatedUser`/admin guard; emit audit `events` rows.
- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state,
bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
`DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
`purgeSession` / `bulkSessions`.
- **Protobuf:** add `machine_uid` to the agent identity carried on `AgentStatus` (the
connect handshake passes it as a query param; mirroring it on `AgentStatus` lets the
server reconcile mid-session). Otherwise server/dashboard only.
## Implementation details
- Files to touch: `agent/src/identity/` (new — `machine_uid` derivation),
`agent/src/config.rs:90,366` (effective id = machine-derived),
`agent/src/transport/websocket.rs:40` (send `machine_uid`);
`server/src/session/mod.rs:169,519,548,584` (key on `machine_uid`; reuse/reap/remove);
`server/src/relay/mod.rs:584,591` (registration path);
`server/src/main.rs:376388,636,661` (routes + handlers); `server/src/db/sessions.rs`
(purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
`deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
`EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`;
`dashboard/src/features/machines/MachinesPage.tsx`.
- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output.
## Security considerations
- **Identity is not authorization.** A client-asserted `machine_uid` is self-reported
and therefore spoofable — on its own, agent A could claim agent B's `machine_uid` to
bind to (and hijack) B's session and machine record. The `machine_uid` must be
**bound to the agent's authenticated credential**: the server accepts a given
`machine_uid` only from a connection authenticated by that machine's own agent key
(or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key
pair). This is why per-machine agent keys (roadmap) are the natural companion; until
they ship, the shared `AGENT_API_KEY` means `machine_uid` is a *correctness* improvement
(dedupe) but not yet a *trust* boundary — call this out so it isn't mistaken for one.
A clone collision (two boxes, same `MachineGuid`) surfaces here as two agents claiming
one uid and is resolved by the key binding, not by the uid alone.
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count
for bulk) — soft-delete + audit, not silent hard-delete.
- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the
whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
flapping); never reap an `is_online` or viewer-attached session.
## Testing strategy
- **Unit:** `machine_uid` derivation is deterministic — same machine inputs yield the
same uid across runs, and an absent config recomputes the same value (no fresh random
id). `register_agent` for the same `machine_uid` reuses/supersedes the prior session
(no duplicate retained) even when the legacy `agent_id` differs.
`reap_stale_persistent` removes offline-past-TTL persistent sessions and spares
online/within-TTL ones. `purge_session` soft-deletes and filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, **varying `agent_id` but the
same `machine_uid`**, as the Pavon launcher did) → assert `list_sessions()` converges
to one live session and `connect_machines` holds one row, not M. A spoof attempt
(uid X presented on a connection not authenticated for X) is rejected/not bound. Purge
a dead session via API → gone from list + `deleted_at` set + audit row written. Bulk
purge of K ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
ghost rows can be multi-selected and removed and do not reappear after the sweep.
## Effort estimate & dependencies
- **Size: Medium.** Reaping + supersede + purge/bulk + dashboard follow existing
patterns; the migration is trivial. The added agent-side `machine_uid` derivation and
threading it through the handshake/registration is the main new surface (bumps this
toward the upper end of Medium).
- **Depends on:** nothing blocking. **Pairs with** per-machine agent keys (roadmap) for
the full trust boundary on `machine_uid` — see Security; SPEC-004 degrades safely
without them.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard
hardening of the same surfaces.
## Open questions
1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
support sessions?
2. **`machine_uid` source mix** — `MachineGuid` alone, or folded with board/BIOS serial?
`MachineGuid` is stable and present everywhere but regenerates on sysprep and is
cloneable; adding a hardware serial reduces clone collisions but churns on hardware
swaps. Pick the recipe (proposed: `MachineGuid` primary, hashed).
3. **uid↔key binding model** — trust-on-first-use pinning of `machine_uid` to the agent
key, vs. requiring per-machine keys before honoring a uid. What's the interim policy
while the shared `AGENT_API_KEY` is still in use?
4. **Migration of existing rows** — legacy random-`agent_id` machine/session rows: let
them age out via the reaper + manual purge, or run a one-time reconcile that maps
known hosts to their new `machine_uid`? (Proposed: age-out + purge; no risky backfill.)
5. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
for audit while hiding it from the console; confirm operators don't expect a hard
purge.
6. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?