Files

Mike Swanson ee900c6395 spec: add SPEC-004 session lifecycle reaping + operator removal

Stop orphaned managed sessions accumulating in the Operator Console and let
admins remove stale sessions/units individually and in bulk. Root cause
confirmed in code: the Sessions list is the in-memory SessionManager;
register_agent reconnect-reuse keys on a stable agent_id (session/mod.rs:169)
and persistent sessions are never reaped on disconnect (session/mod.rs:519-542),
so an agent reconnecting with a fresh agent_id leaves a new retained ghost
session each time (observed: 15 sessions/0 live, ~10 orphans for one machine
after a GuruConnect-client reconnect storm). Adds TTL sweep + same-machine
supersede, admin-gated audited purge + bulk endpoints, and dashboard
multi-select removal. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-30 16:05:32 -07:00

8.8 KiB

Raw Blame History

SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal

Status: Proposed Priority: P1 Requested By: Mike (2026-05-30) Estimated Effort: Medium

Overview

Stop orphaned managed sessions from accumulating in the Operator Console, and give operators a first-class way to remove stale sessions/units — per-row and in bulk (multi-select mass delete). Today the Sessions view can show many dead rows that look live, and the only per-row action is "End", which applies to a live session and does nothing for an already-dead one — so junk just piles up with no way to clear it. Success = (a) reconnecting/offline persistent agents no longer leave behind retained ghost sessions, and (b) an admin can select one or many session rows (and stale machine rows) and remove them from the console.

Observed (live console, 2026-05-30): the Sessions view listed 15 sessions, 0 live, of which ~10 were duplicate MANAGED rows for a single machine (DESKTOP-I66IM5Q / Pavon-Raiders), each a distinct session UUID, all NOT REQUIRED consent, no viewers, no duration, all "37 minutes ago". That machine had just been cleaned of a misbehaving GuruConnect client that was reconnecting in a loop — that reconnect storm is what produced the orphans, which is exactly why this needs both a lifecycle fix and a manual-removal control.

Root cause (confirmed in code)

The Sessions list is served from the in-memory SessionManager, not the database: GET /api/sessions → list_sessions (main.rs:636) → state.sessions.list_sessions() (session/mod.rs:584). Two compounding defects let ghosts accumulate there:

Reconnect-reuse is keyed on a stable agent_id. register_agent (session/mod.rs:169) reuses an existing session only when self.agents.get(&agent_id) resolves to an is_online == false session. If the agent reconnects with a new agent_id (a per-process/regenerated identity, as the misbehaving client did), the lookup misses and a brand-new persistent session is created each time. The agents map holds only one session per agent_id, so prior sessions become unreferenced yet remain in the sessions map.
Persistent sessions are never reaped. On disconnect, only support sessions are removed entirely; persistent/managed sessions are deliberately retained (session/mod.rs:519–542) and there is no TTL sweep. An offline managed session therefore lives in memory indefinitely, displayed alongside genuinely-live ones.

Net effect: N reconnects with unstable identity → N retained, never-expiring managed sessions, none of which the UI can clear.

Scope

Included in v1

Lifecycle reaping (the fix):
- Periodic background sweep that removes persistent sessions whose agent has been offline (is_online == false) longer than a TTL (default 10 min, configurable), using the existing last_heartbeat_instant.
- On agent reconnect, supersede prior retained sessions for the same machine (dedupe by hostname/machine identity, not only exact agent_id) so a fresh agent_id cannot strand the old session.
- On socket drop for a persistent agent, mark the session offline and eligible for the sweep (DB end_session already fires at relay/mod.rs:892; align in-memory state with it).
Manual removal API (admin-gated, audited):
- DELETE /api/sessions/:id?purge=true — remove the in-memory session record (SessionManager::remove_session, session/mod.rs:548) and soft-delete the DB row. Distinguish from the existing disconnect_session (which ends a live session) — purge works on dead rows.
- POST /api/sessions/bulk (or DELETE with a body) taking { ids: [...] , action: "purge" | "end" } for mass delete / bulk-end.
- Same stale-removal for the Machines view: extend the existing DELETE /api/machines/:agent_id (main.rs:387) usage with a bulk variant for stale units.
Dashboard UX:
- Per-row Remove action on SessionsPage.tsx for non-live rows (alongside the existing End in EndSessionDialog.tsx).
- Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End) on the Sessions view; mirror on MachinesPage.tsx for stale units.

Explicitly out of scope

Stabilizing the agent's agent_id across reinstalls (overlaps "Per-machine agent keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it.
"Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request mentioned it; deferred unless a concrete edit field is identified.
Hard-deleting DB session history — v1 soft-deletes (deleted_at) to preserve the audit trail (CLAUDE.md DB conventions).

Architecture

Relay-server (server/src/session/mod.rs): add reap_stale_persistent(ttl) to SessionManager and spawn a periodic task (e.g. every 60 s) from server startup; extend register_agent to supersede prior same-machine sessions; add a purge-style removal that the API can call for dead rows.
DB (server/src/db/sessions.rs + migration 008/009): add deleted_at TIMESTAMPTZ to connect_sessions; add purge_session (soft-delete) and a bulk variant; get_recent_sessions/list queries filter deleted_at IS NULL. Idempotent ADD COLUMN IF NOT EXISTS, applied by sqlx::migrate!() on startup — never pre-applied via psql (see the 005→007 lesson).
API (server/src/main.rs, server/src/api/): new purge + bulk routes, all behind the existing AuthenticatedUser/admin guard; emit audit events rows.
Dashboard (dashboard/src/features/sessions/, .../machines/): selection state, bulk-action bar, Remove confirmation reusing the EndSessionDialog/ DeleteMachineDialog patterns; dashboard/src/api/sessions.ts gains purgeSession / bulkSessions.
Protobuf: none — this is server/dashboard only.

Implementation details

Files to touch: server/src/session/mod.rs:169,519,548,584 (reuse/reap/remove); server/src/main.rs:376–388,636,661 (routes + handlers); server/src/db/sessions.rs (purge + bulk + deleted_at filtering); server/migrations/ (new migration for deleted_at); dashboard/src/features/sessions/SessionsPage.tsx, EndSessionDialog.tsx, dashboard/src/api/sessions.ts; dashboard/src/features/machines/MachinesPage.tsx.
Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory + soft-delete DB. A reaped/purged session must vanish from list_sessions() output.

Security considerations

All purge/bulk endpoints require an authenticated admin (AuthenticatedUser, same guard as list_sessions); never expose removal unauthenticated.
Audit every removal to the events table (who, which session/machine, when, count for bulk) — soft-delete + audit, not silent hard-delete.
Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the whole fleet by accident or abuse.
Reaping must not end a session that is merely briefly offline (TTL guards against flapping); never reap an is_online or viewer-attached session.

Testing strategy

Unit: register_agent with a new agent_id for an existing hostname supersedes the prior session (no duplicate retained). reap_stale_persistent removes offline-past-TTL persistent sessions and spares online/within-TTL ones. purge_session soft-deletes and filters out of list queries.
Integration: simulate a reconnect storm (M connects, varying agent_id, same hostname) → assert list_sessions() converges to one live session, not M. Purge a dead session via API → gone from list + deleted_at set + audit row written. Bulk purge of K ids removes exactly K.
Manual: on the live console, reproduce against the Pavon machines, confirm the ghost rows can be multi-selected and removed and do not reappear after the sweep.

Effort estimate & dependencies

Size: Medium. Reaping + supersede logic is contained to SessionManager; the API and dashboard work follows existing End/Delete patterns. The migration is trivial.
Depends on: nothing blocking.
Unblocks: a trustworthy Sessions/Machines view (dead rows no longer masquerade as live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.

Open questions

Reap TTL default — 10 min proposed; confirm. Should it differ for managed vs. support sessions?
Dedupe key on reconnect — by hostname, or by the per-machine agent key once that lands? v1 proposes hostname; revisit when per-machine keys ship.
Purge vs. keep history — soft-delete (proposed) keeps connect_sessions history for audit while hiding it from the console; confirm operators don't expect a hard purge.
Bulk-action cap — what's a sane max N per bulk call (e.g. 100)?

8.8 KiB Raw Blame History Unescape Escape