Stop orphaned managed sessions accumulating in the Operator Console and let admins remove stale sessions/units individually and in bulk. Root cause confirmed in code: the Sessions list is the in-memory SessionManager; register_agent reconnect-reuse keys on a stable agent_id (session/mod.rs:169) and persistent sessions are never reaped on disconnect (session/mod.rs:519-542), so an agent reconnecting with a fresh agent_id leaves a new retained ghost session each time (observed: 15 sessions/0 live, ~10 orphans for one machine after a GuruConnect-client reconnect storm). Adds TTL sweep + same-machine supersede, admin-gated audited purge + bulk endpoints, and dashboard multi-select removal. Requested by Mike 2026-05-30. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.8 KiB
SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal
Status: Proposed Priority: P1 Requested By: Mike (2026-05-30) Estimated Effort: Medium
Overview
Stop orphaned managed sessions from accumulating in the Operator Console, and give operators a first-class way to remove stale sessions/units — per-row and in bulk (multi-select mass delete). Today the Sessions view can show many dead rows that look live, and the only per-row action is "End", which applies to a live session and does nothing for an already-dead one — so junk just piles up with no way to clear it. Success = (a) reconnecting/offline persistent agents no longer leave behind retained ghost sessions, and (b) an admin can select one or many session rows (and stale machine rows) and remove them from the console.
Observed (live console, 2026-05-30): the Sessions view listed 15 sessions, 0
live, of which ~10 were duplicate MANAGED rows for a single machine
(DESKTOP-I66IM5Q / Pavon-Raiders), each a distinct session UUID, all
NOT REQUIRED consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect client that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
both a lifecycle fix and a manual-removal control.
Root cause (confirmed in code)
The Sessions list is served from the in-memory SessionManager, not the database:
GET /api/sessions → list_sessions (main.rs:636) → state.sessions.list_sessions()
(session/mod.rs:584). Two compounding defects let ghosts accumulate there:
- Reconnect-reuse is keyed on a stable
agent_id.register_agent(session/mod.rs:169) reuses an existing session only whenself.agents.get(&agent_id)resolves to anis_online == falsesession. If the agent reconnects with a newagent_id(a per-process/regenerated identity, as the misbehaving client did), the lookup misses and a brand-new persistent session is created each time. Theagentsmap holds only one session per agent_id, so prior sessions become unreferenced yet remain in thesessionsmap. - Persistent sessions are never reaped. On disconnect, only support sessions are
removed entirely; persistent/managed sessions are deliberately retained
(
session/mod.rs:519–542) and there is no TTL sweep. An offline managed session therefore lives in memory indefinitely, displayed alongside genuinely-live ones.
Net effect: N reconnects with unstable identity → N retained, never-expiring managed sessions, none of which the UI can clear.
Scope
Included in v1
- Lifecycle reaping (the fix):
- Periodic background sweep that removes persistent sessions whose agent has been
offline (
is_online == false) longer than a TTL (default 10 min, configurable), using the existinglast_heartbeat_instant. - On agent reconnect, supersede prior retained sessions for the same machine
(dedupe by
hostname/machine identity, not only exactagent_id) so a fresh agent_id cannot strand the old session. - On socket drop for a persistent agent, mark the session offline and eligible for
the sweep (DB
end_sessionalready fires atrelay/mod.rs:892; align in-memory state with it).
- Periodic background sweep that removes persistent sessions whose agent has been
offline (
- Manual removal API (admin-gated, audited):
DELETE /api/sessions/:id?purge=true— remove the in-memory session record (SessionManager::remove_session,session/mod.rs:548) and soft-delete the DB row. Distinguish from the existingdisconnect_session(which ends a live session) — purge works on dead rows.POST /api/sessions/bulk(orDELETEwith a body) taking{ ids: [...] , action: "purge" | "end" }for mass delete / bulk-end.- Same stale-removal for the Machines view: extend the existing
DELETE /api/machines/:agent_id(main.rs:387) usage with a bulk variant for stale units.
- Dashboard UX:
- Per-row Remove action on
SessionsPage.tsxfor non-live rows (alongside the existing End inEndSessionDialog.tsx). - Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
on the Sessions view; mirror on
MachinesPage.tsxfor stale units.
- Per-row Remove action on
Explicitly out of scope
- Stabilizing the agent's
agent_idacross reinstalls (overlaps "Per-machine agent keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it. - "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (
deleted_at) to preserve the audit trail (CLAUDE.md DB conventions).
Architecture
- Relay-server (
server/src/session/mod.rs): addreap_stale_persistent(ttl)toSessionManagerand spawn a periodic task (e.g. every 60 s) from server startup; extendregister_agentto supersede prior same-machine sessions; add apurge-style removal that the API can call for dead rows. - DB (
server/src/db/sessions.rs+ migration008/009): adddeleted_at TIMESTAMPTZtoconnect_sessions; addpurge_session(soft-delete) and a bulk variant;get_recent_sessions/list queries filterdeleted_at IS NULL. IdempotentADD COLUMN IF NOT EXISTS, applied bysqlx::migrate!()on startup — never pre-applied via psql (see the 005→007 lesson). - API (
server/src/main.rs,server/src/api/): new purge + bulk routes, all behind the existingAuthenticatedUser/admin guard; emit auditeventsrows. - Dashboard (
dashboard/src/features/sessions/,.../machines/): selection state, bulk-action bar, Remove confirmation reusing theEndSessionDialog/DeleteMachineDialogpatterns;dashboard/src/api/sessions.tsgainspurgeSession/bulkSessions. - Protobuf: none — this is server/dashboard only.
Implementation details
- Files to touch:
server/src/session/mod.rs:169,519,548,584(reuse/reap/remove);server/src/main.rs:376–388,636,661(routes + handlers);server/src/db/sessions.rs(purge + bulk +deleted_atfiltering);server/migrations/(new migration fordeleted_at);dashboard/src/features/sessions/SessionsPage.tsx,EndSessionDialog.tsx,dashboard/src/api/sessions.ts;dashboard/src/features/machines/MachinesPage.tsx. - Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
soft-delete DB. A reaped/purged session must vanish from
list_sessions()output.
Security considerations
- All purge/bulk endpoints require an authenticated admin (
AuthenticatedUser, same guard aslist_sessions); never expose removal unauthenticated. - Audit every removal to the
eventstable (who, which session/machine, when, count for bulk) — soft-delete + audit, not silent hard-delete. - Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
flapping); never reap an
is_onlineor viewer-attached session.
Testing strategy
- Unit:
register_agentwith a new agent_id for an existing hostname supersedes the prior session (no duplicate retained).reap_stale_persistentremoves offline-past-TTL persistent sessions and spares online/within-TTL ones.purge_sessionsoft-deletes and filters out of list queries. - Integration: simulate a reconnect storm (M connects, varying agent_id, same
hostname) → assert
list_sessions()converges to one live session, not M. Purge a dead session via API → gone from list +deleted_atset + audit row written. Bulk purge of K ids removes exactly K. - Manual: on the live console, reproduce against the Pavon machines, confirm the ghost rows can be multi-selected and removed and do not reappear after the sweep.
Effort estimate & dependencies
- Size: Medium. Reaping + supersede logic is contained to
SessionManager; the API and dashboard work follows existing End/Delete patterns. The migration is trivial. - Depends on: nothing blocking.
- Unblocks: a trustworthy Sessions/Machines view (dead rows no longer masquerade as live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.
Open questions
- Reap TTL default — 10 min proposed; confirm. Should it differ for managed vs. support sessions?
- Dedupe key on reconnect — by
hostname, or by the per-machine agent key once that lands? v1 proposes hostname; revisit when per-machine keys ship. - Purge vs. keep history — soft-delete (proposed) keeps
connect_sessionshistory for audit while hiding it from the console; confirm operators don't expect a hard purge. - Bulk-action cap — what's a sane max N per bulk call (e.g. 100)?