guru-connect/docs/specs/SPEC-004-session-lifecycle-and-removal.md

# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal

**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-05-30)
**Estimated Effort:** Medium

## Overview

Stop orphaned managed sessions from accumulating in the Operator Console, and give
operators a first-class way to remove stale sessions/units — per-row **and** in bulk
(multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it.
Success = (a) reconnecting/offline persistent agents no longer leave behind retained
ghost sessions, and (b) an admin can select one or many session rows (and stale machine
rows) and remove them from the console.

**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all
`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
**both** a lifecycle fix and a manual-removal control.

## Root cause (confirmed in code)

The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions` → `list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there:

1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent`
   (`session/mod.rs:169`) reuses an existing session only when
   `self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the
   agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the
   misbehaving client did), the lookup misses and a **brand-new persistent session** is
   created each time. The `agents` map holds only one session per agent_id, so prior
   sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
   removed entirely; persistent/managed sessions are deliberately retained
   (`session/mod.rs:519–542`) and there is **no TTL sweep**. An offline managed session
   therefore lives in memory indefinitely, displayed alongside genuinely-live ones.

Net effect: N reconnects with unstable identity → N retained, never-expiring managed
sessions, none of which the UI can clear.

## Scope

### Included in v1

- **Lifecycle reaping (the fix):**
  - Periodic background sweep that removes persistent sessions whose agent has been
    offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
    using the existing `last_heartbeat_instant`.
  - On agent reconnect, **supersede** prior retained sessions for the same machine
    (dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh
    agent_id cannot strand the old session.
  - On socket drop for a persistent agent, mark the session offline *and* eligible for
    the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory
    state with it).
- **Manual removal API (admin-gated, audited):**
  - `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record
    (`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB
    row. Distinguish from the existing `disconnect_session` (which ends a *live*
    session) — purge works on dead rows.
  - `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action:
    "purge" | "end" }` for mass delete / bulk-end.
  - Same stale-removal for the Machines view: extend the existing
    `DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for
    stale units.
- **Dashboard UX:**
  - Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the
    existing End in `EndSessionDialog.tsx`).
  - Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
    on the Sessions view; mirror on `MachinesPage.tsx` for stale units.

### Explicitly out of scope

- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent
  keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
  mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
  audit trail (CLAUDE.md DB conventions).

## Architecture

- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to
  `SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup;
  extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style
  removal that the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
  `deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
  a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
  Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup —
  never pre-applied via psql (see the 005→007 lesson).
- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind
  the existing `AuthenticatedUser`/admin guard; emit audit `events` rows.
- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state,
  bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
  `DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
  `purgeSession` / `bulkSessions`.
- **Protobuf:** none — this is server/dashboard only.

## Implementation details

- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove);
  `server/src/main.rs:376–388,636,661` (routes + handlers); `server/src/db/sessions.rs`
  (purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
  `deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
  `EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`;
  `dashboard/src/features/machines/MachinesPage.tsx`.
- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
  soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output.

## Security considerations

- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
  guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count
  for bulk) — soft-delete + audit, not silent hard-delete.
- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the
  whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
  flapping); never reap an `is_online` or viewer-attached session.

## Testing strategy

- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the
  prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL
  persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and
  filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same
  hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead
  session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K
  ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
  ghost rows can be multi-selected and removed and do not reappear after the sweep.

## Effort estimate & dependencies

- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API
  and dashboard work follows existing End/Delete patterns. The migration is trivial.
- **Depends on:** nothing blocking.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
  live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.

## Open questions

1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
   support sessions?
2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once
   that lands? v1 proposes hostname; revisit when per-machine keys ship.
3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
   for audit while hiding it from the console; confirm operators don't expect a hard
   purge.
4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?