guru-connect/docs/specs/SPEC-004-session-lifecycle-and-removal.md

# SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal

**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-05-30)
**Estimated Effort:** Medium

## Overview

Stop orphaned managed sessions from accumulating in the Operator Console, and give
operators a first-class way to remove stale sessions/units — per-row **and** in bulk
(multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it.
The durable fix is at registration: the **same machine must resolve to one stable
identity** so a repeated execution cannot mint duplicates in the first place; reaping
and manual removal then become defense-in-depth and cleanup, not the primary mechanism.
Success = (a) the same machine, run repeatedly (even from a portable/misconfigured
copy), registers to **one** record/session — no duplicates; (b) reconnecting/offline
persistent agents no longer leave behind retained ghost sessions; and (c) an admin can
select one or many session rows (and stale machine rows) and remove them from the
console.

**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all
`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
**both** a lifecycle fix and a manual-removal control.

## Root cause (confirmed in code)

The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions` → `list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Three compounding defects let ghosts accumulate there:

0. **Machine identity is a config-file random UUID, not machine-derived.** The agent's
   `agent_id` is a random UUID minted by `generate_agent_id()` (`agent/src/config.rs:90`)
   on first run and persisted **only in the agent config file** (`config.rs:331`), or
   taken from `GURUCONNECT_AGENT_ID` (`config.rs:366`). A portable or misconfigured
   execution that cannot locate/write that config — e.g. the Pavon desktop launcher
   `guruconnect-pavon-raidersreef.exe` run repeatedly from a user Desktop — regenerates
   a **fresh** `agent_id` every launch. Because identity is not derived from the machine,
   the same physical box presents as N different agents. The DB upsert
   (`upsert_machine`, `ON CONFLICT (agent_id)`) and the session-reuse map both key on
   this id, so an unstable id defeats *both* dedupe layers at the source.
1. **Reconnect-reuse is keyed on that `agent_id`.** `register_agent`
   (`session/mod.rs:169`) reuses an existing session only when
   `self.agents.get(&agent_id)` resolves to an `is_online == false` session. With a new
   `agent_id` per launch (defect 0) the lookup misses and a **brand-new persistent
   session** is created each time. The `agents` map holds only one session per agent_id,
   so prior sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
   removed entirely; persistent/managed sessions are deliberately retained
   (`session/mod.rs:519–542`) and there is **no TTL sweep**. An offline managed session
   therefore lives in memory indefinitely, displayed alongside genuinely-live ones.

Net effect: N reconnects with unstable identity → N retained, never-expiring managed
sessions, none of which the UI can clear.

## Scope

### Included in v1

- **Stable, machine-derived identity (the primary fix):**
  - The agent computes a deterministic `machine_uid` from durable machine identifiers —
    primary source the Windows `MachineGuid`
    (`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`), optionally folded with a stable
    hardware id (board/BIOS serial) — hashed to a stable string. It is **recomputable**:
    a lost/absent config self-heals to the *same* id rather than minting a new random
    one. Persist a cached copy, but never depend on the config file for correctness.
  - Registration keys on `machine_uid`: `upsert_machine`’s `ON CONFLICT` and the
    in-memory `agents`/session-reuse map both use it, so the same box converges to **one**
    machine record and **one** managed session no matter how many times it executes.
  - Carry `machine_uid` in the agent connect handshake (`transport/websocket.rs:40`
    query params) / `AgentStatus`; keep the legacy random `agent_id` only as a
    migration fallback.
- **Lifecycle reaping (defense-in-depth):**
  - Periodic background sweep that removes persistent sessions whose agent has been
    offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
    using the existing `last_heartbeat_instant`.
  - On agent reconnect, **supersede** prior retained sessions for the same machine
    (dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh
    agent_id cannot strand the old session.
  - On socket drop for a persistent agent, mark the session offline *and* eligible for
    the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory
    state with it).
- **Manual removal API (admin-gated, audited):**
  - `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record
    (`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB
    row. Distinguish from the existing `disconnect_session` (which ends a *live*
    session) — purge works on dead rows.
  - `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action:
    "purge" | "end" }` for mass delete / bulk-end.
  - Same stale-removal for the Machines view: extend the existing
    `DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for
    stale units.
- **Dashboard UX:**
  - Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the
    existing End in `EndSessionDialog.tsx`).
  - Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
    on the Sessions view; mirror on `MachinesPage.tsx` for stale units.

### Explicitly out of scope

- Surviving a full machine **reimage/clone** with the same identity. `MachineGuid`
  regenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged
  box legitimately becomes a new `machine_uid` (and a clone collision is caught by the
  auth binding below). Cross-reimage identity continuity is out of scope for v1.
- Replacing the shared `AGENT_API_KEY` with per-machine agent keys — tracked separately
  (roadmap GuruRMM-Integration). SPEC-004 *assumes* that binding for its threat model
  (see Security) and degrades safely without it, but does not implement it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
  mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
  audit trail (CLAUDE.md DB conventions).

## Architecture

- **Agent (`agent/src/`):** new `identity` module computes `machine_uid` deterministically
  (Windows `MachineGuid` primary; non-Windows fallback to a stable persisted UUID).
  Replace/augment `generate_agent_id()` (`config.rs:90`) so the effective id is the
  machine-derived value, with the config-file value used only as a cache. Send
  `machine_uid` in the connect query string (`transport/websocket.rs:40`) and on
  `AgentStatus`.
- **Relay-server (`server/src/session/mod.rs`):** key `register_agent` and the
  `agents` map on `machine_uid` so the same machine reuses one session; add
  `reap_stale_persistent(ttl)` to `SessionManager` + a periodic task (e.g. every 60 s)
  from server startup; supersede any prior same-machine sessions on reconnect; add a
  `purge`-style removal the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
  `deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
  a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
  Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup —
  never pre-applied via psql (see the 005→007 lesson).
- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind
  the existing `AuthenticatedUser`/admin guard; emit audit `events` rows.
- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state,
  bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
  `DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
  `purgeSession` / `bulkSessions`.
- **Protobuf:** add `machine_uid` to the agent identity carried on `AgentStatus` (the
  connect handshake passes it as a query param; mirroring it on `AgentStatus` lets the
  server reconcile mid-session). Otherwise server/dashboard only.

## Implementation details

- Files to touch: `agent/src/identity/` (new — `machine_uid` derivation),
  `agent/src/config.rs:90,366` (effective id = machine-derived),
  `agent/src/transport/websocket.rs:40` (send `machine_uid`);
  `server/src/session/mod.rs:169,519,548,584` (key on `machine_uid`; reuse/reap/remove);
  `server/src/relay/mod.rs:584,591` (registration path);
  `server/src/main.rs:376–388,636,661` (routes + handlers); `server/src/db/sessions.rs`
  (purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
  `deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
  `EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`;
  `dashboard/src/features/machines/MachinesPage.tsx`.
- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
  soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output.

## Security considerations

- **Identity is not authorization.** A client-asserted `machine_uid` is self-reported
  and therefore spoofable — on its own, agent A could claim agent B's `machine_uid` to
  bind to (and hijack) B's session and machine record. The `machine_uid` must be
  **bound to the agent's authenticated credential**: the server accepts a given
  `machine_uid` only from a connection authenticated by that machine's own agent key
  (or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key
  pair). This is why per-machine agent keys (roadmap) are the natural companion; until
  they ship, the shared `AGENT_API_KEY` means `machine_uid` is a *correctness* improvement
  (dedupe) but not yet a *trust* boundary — call this out so it isn't mistaken for one.
  A clone collision (two boxes, same `MachineGuid`) surfaces here as two agents claiming
  one uid and is resolved by the key binding, not by the uid alone.
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
  guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count
  for bulk) — soft-delete + audit, not silent hard-delete.
- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the
  whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
  flapping); never reap an `is_online` or viewer-attached session.

## Testing strategy

- **Unit:** `machine_uid` derivation is deterministic — same machine inputs yield the
  same uid across runs, and an absent config recomputes the same value (no fresh random
  id). `register_agent` for the same `machine_uid` reuses/supersedes the prior session
  (no duplicate retained) even when the legacy `agent_id` differs.
  `reap_stale_persistent` removes offline-past-TTL persistent sessions and spares
  online/within-TTL ones. `purge_session` soft-deletes and filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, **varying `agent_id` but the
  same `machine_uid`**, as the Pavon launcher did) → assert `list_sessions()` converges
  to one live session and `connect_machines` holds one row, not M. A spoof attempt
  (uid X presented on a connection not authenticated for X) is rejected/not bound. Purge
  a dead session via API → gone from list + `deleted_at` set + audit row written. Bulk
  purge of K ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
  ghost rows can be multi-selected and removed and do not reappear after the sweep.

## Effort estimate & dependencies

- **Size: Medium.** Reaping + supersede + purge/bulk + dashboard follow existing
  patterns; the migration is trivial. The added agent-side `machine_uid` derivation and
  threading it through the handshake/registration is the main new surface (bumps this
  toward the upper end of Medium).
- **Depends on:** nothing blocking. **Pairs with** per-machine agent keys (roadmap) for
  the full trust boundary on `machine_uid` — see Security; SPEC-004 degrades safely
  without them.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
  live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard
  hardening of the same surfaces.

## Open questions

1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
   support sessions?
2. **`machine_uid` source mix** — `MachineGuid` alone, or folded with board/BIOS serial?
   `MachineGuid` is stable and present everywhere but regenerates on sysprep and is
   cloneable; adding a hardware serial reduces clone collisions but churns on hardware
   swaps. Pick the recipe (proposed: `MachineGuid` primary, hashed).
3. **uid↔key binding model** — trust-on-first-use pinning of `machine_uid` to the agent
   key, vs. requiring per-machine keys before honoring a uid. What's the interim policy
   while the shared `AGENT_API_KEY` is still in use?
4. **Migration of existing rows** — legacy random-`agent_id` machine/session rows: let
   them age out via the reaper + manual purge, or run a one-time reconcile that maps
   known hosts to their new `machine_uid`? (Proposed: age-out + purge; no risky backfill.)
5. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
   for audit while hiding it from the console; confirm operators don't expect a hard
   purge.
6. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?