diff --git a/docs/FEATURE_ROADMAP.md b/docs/FEATURE_ROADMAP.md index 47137fa..64144de 100644 --- a/docs/FEATURE_ROADMAP.md +++ b/docs/FEATURE_ROADMAP.md @@ -48,6 +48,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001]( - [x] JWT auth, Argon2id passwords, rate limiting, security headers - [x] Sessions / machines / support-codes / events - [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). ([SPEC-003](specs/SPEC-003-machine-inventory.md)) +- [ ] **Session lifecycle reaping + operator session/unit removal** — P1 — reap orphaned managed sessions (TTL sweep + supersede prior same-machine sessions on reconnect) so dead rows stop masquerading as live, and add admin-gated per-row + multi-select bulk removal of stale sessions/units in the Operator Console. Fixes ghost-session accumulation observed on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md)) - [ ] Programmatic session pre-create + viewer-token (integration contract) — P2 ## Security & Infrastructure diff --git a/docs/specs/SPEC-004-session-lifecycle-and-removal.md b/docs/specs/SPEC-004-session-lifecycle-and-removal.md new file mode 100644 index 0000000..b2ac85c --- /dev/null +++ b/docs/specs/SPEC-004-session-lifecycle-and-removal.md @@ -0,0 +1,158 @@ +# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal + +**Status:** Proposed +**Priority:** P1 +**Requested By:** Mike (2026-05-30) +**Estimated Effort:** Medium + +## Overview + +Stop orphaned managed sessions from accumulating in the Operator Console, and give +operators a first-class way to remove stale sessions/units — per-row **and** in bulk +(multi-select mass delete). Today the Sessions view can show many dead rows that look +live, and the only per-row action is "End", which applies to a *live* session and does +nothing for an already-dead one — so junk just piles up with no way to clear it. +Success = (a) reconnecting/offline persistent agents no longer leave behind retained +ghost sessions, and (b) an admin can select one or many session rows (and stale machine +rows) and remove them from the console. + +**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0 +live**, of which ~10 were duplicate `MANAGED` rows for a single machine +(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all +`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had +just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop +— that reconnect storm is what produced the orphans, which is exactly why this needs +**both** a lifecycle fix and a manual-removal control. + +## Root cause (confirmed in code) + +The Sessions list is served from the **in-memory `SessionManager`**, not the database: +`GET /api/sessions` → `list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()` +(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there: + +1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent` + (`session/mod.rs:169`) reuses an existing session only when + `self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the + agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the + misbehaving client did), the lookup misses and a **brand-new persistent session** is + created each time. The `agents` map holds only one session per agent_id, so prior + sessions become unreferenced yet remain in the `sessions` map. +2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are + removed entirely; persistent/managed sessions are deliberately retained + (`session/mod.rs:519–542`) and there is **no TTL sweep**. An offline managed session + therefore lives in memory indefinitely, displayed alongside genuinely-live ones. + +Net effect: N reconnects with unstable identity → N retained, never-expiring managed +sessions, none of which the UI can clear. + +## Scope + +### Included in v1 + +- **Lifecycle reaping (the fix):** + - Periodic background sweep that removes persistent sessions whose agent has been + offline (`is_online == false`) longer than a TTL (default 10 min, configurable), + using the existing `last_heartbeat_instant`. + - On agent reconnect, **supersede** prior retained sessions for the same machine + (dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh + agent_id cannot strand the old session. + - On socket drop for a persistent agent, mark the session offline *and* eligible for + the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory + state with it). +- **Manual removal API (admin-gated, audited):** + - `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record + (`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB + row. Distinguish from the existing `disconnect_session` (which ends a *live* + session) — purge works on dead rows. + - `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action: + "purge" | "end" }` for mass delete / bulk-end. + - Same stale-removal for the Machines view: extend the existing + `DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for + stale units. +- **Dashboard UX:** + - Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the + existing End in `EndSessionDialog.tsx`). + - Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End) + on the Sessions view; mirror on `MachinesPage.tsx` for stale units. + +### Explicitly out of scope + +- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent + keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it. +- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request + mentioned it; deferred unless a concrete edit field is identified. +- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the + audit trail (CLAUDE.md DB conventions). + +## Architecture + +- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to + `SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup; + extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style + removal that the API can call for dead rows. +- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add + `deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and + a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`. + Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup — + never pre-applied via psql (see the 005→007 lesson). +- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind + the existing `AuthenticatedUser`/admin guard; emit audit `events` rows. +- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state, + bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/ + `DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains + `purgeSession` / `bulkSessions`. +- **Protobuf:** none — this is server/dashboard only. + +## Implementation details + +- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove); + `server/src/main.rs:376–388,636,661` (routes + handlers); `server/src/db/sessions.rs` + (purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for + `deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`, + `EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`; + `dashboard/src/features/machines/MachinesPage.tsx`. +- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory + + soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output. + +## Security considerations + +- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same + guard as `list_sessions`); never expose removal unauthenticated. +- Audit every removal to the `events` table (who, which session/machine, when, count + for bulk) — soft-delete + audit, not silent hard-delete. +- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the + whole fleet by accident or abuse. +- Reaping must not end a session that is merely briefly offline (TTL guards against + flapping); never reap an `is_online` or viewer-attached session. + +## Testing strategy + +- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the + prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL + persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and + filters out of list queries. +- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same + hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead + session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K + ids removes exactly K. +- **Manual:** on the live console, reproduce against the Pavon machines, confirm the + ghost rows can be multi-selected and removed and do not reappear after the sweep. + +## Effort estimate & dependencies + +- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API + and dashboard work follows existing End/Delete patterns. The migration is trivial. +- **Depends on:** nothing blocking. +- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as + live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces. + +## Open questions + +1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs. + support sessions? +2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once + that lands? v1 proposes hostname; revisit when per-machine keys ship. +3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history + for audit while hiding it from the console; confirm operators don't expect a hard + purge. +4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?