spec: add SPEC-004 session lifecycle reaping + operator removal

Stop orphaned managed sessions accumulating in the Operator Console and let
admins remove stale sessions/units individually and in bulk. Root cause
confirmed in code: the Sessions list is the in-memory SessionManager;
register_agent reconnect-reuse keys on a stable agent_id (session/mod.rs:169)
and persistent sessions are never reaped on disconnect (session/mod.rs:519-542),
so an agent reconnecting with a fresh agent_id leaves a new retained ghost
session each time (observed: 15 sessions/0 live, ~10 orphans for one machine
after a GuruConnect-client reconnect storm). Adds TTL sweep + same-machine
supersede, admin-gated audited purge + bulk endpoints, and dashboard
multi-select removal. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 16:05:32 -07:00
parent abf499cb23
commit ee900c6395
2 changed files with 159 additions and 0 deletions

View File

@@ -0,0 +1,158 @@
# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal
**Status:** Proposed
**Priority:** P1
**Requested By:** Mike (2026-05-30)
**Estimated Effort:** Medium
## Overview
Stop orphaned managed sessions from accumulating in the Operator Console, and give
operators a first-class way to remove stale sessions/units — per-row **and** in bulk
(multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it.
Success = (a) reconnecting/offline persistent agents no longer leave behind retained
ghost sessions, and (b) an admin can select one or many session rows (and stale machine
rows) and remove them from the console.
**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all
`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
**both** a lifecycle fix and a manual-removal control.
## Root cause (confirmed in code)
The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions``list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there:
1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent`
(`session/mod.rs:169`) reuses an existing session only when
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the
agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the
misbehaving client did), the lookup misses and a **brand-new persistent session** is
created each time. The `agents` map holds only one session per agent_id, so prior
sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
removed entirely; persistent/managed sessions are deliberately retained
(`session/mod.rs:519542`) and there is **no TTL sweep**. An offline managed session
therefore lives in memory indefinitely, displayed alongside genuinely-live ones.
Net effect: N reconnects with unstable identity → N retained, never-expiring managed
sessions, none of which the UI can clear.
## Scope
### Included in v1
- **Lifecycle reaping (the fix):**
- Periodic background sweep that removes persistent sessions whose agent has been
offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
using the existing `last_heartbeat_instant`.
- On agent reconnect, **supersede** prior retained sessions for the same machine
(dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh
agent_id cannot strand the old session.
- On socket drop for a persistent agent, mark the session offline *and* eligible for
the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory
state with it).
- **Manual removal API (admin-gated, audited):**
- `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record
(`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB
row. Distinguish from the existing `disconnect_session` (which ends a *live*
session) — purge works on dead rows.
- `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action:
"purge" | "end" }` for mass delete / bulk-end.
- Same stale-removal for the Machines view: extend the existing
`DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for
stale units.
- **Dashboard UX:**
- Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the
existing End in `EndSessionDialog.tsx`).
- Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
on the Sessions view; mirror on `MachinesPage.tsx` for stale units.
### Explicitly out of scope
- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent
keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
audit trail (CLAUDE.md DB conventions).
## Architecture
- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to
`SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup;
extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style
removal that the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
`deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup —
never pre-applied via psql (see the 005→007 lesson).
- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind
the existing `AuthenticatedUser`/admin guard; emit audit `events` rows.
- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state,
bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
`DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
`purgeSession` / `bulkSessions`.
- **Protobuf:** none — this is server/dashboard only.
## Implementation details
- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove);
`server/src/main.rs:376388,636,661` (routes + handlers); `server/src/db/sessions.rs`
(purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
`deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
`EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`;
`dashboard/src/features/machines/MachinesPage.tsx`.
- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output.
## Security considerations
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count
for bulk) — soft-delete + audit, not silent hard-delete.
- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the
whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
flapping); never reap an `is_online` or viewer-attached session.
## Testing strategy
- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the
prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL
persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and
filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same
hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead
session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K
ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
ghost rows can be multi-selected and removed and do not reappear after the sweep.
## Effort estimate & dependencies
- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API
and dashboard work follows existing End/Delete patterns. The migration is trivial.
- **Depends on:** nothing blocking.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.
## Open questions
1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
support sessions?
2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once
that lands? v1 proposes hostname; revisit when per-machine keys ship.
3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
for audit while hiding it from the console; confirm operators don't expect a hard
purge.
4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?