spec: add SPEC-004 session lifecycle reaping + operator removal
Stop orphaned managed sessions accumulating in the Operator Console and let admins remove stale sessions/units individually and in bulk. Root cause confirmed in code: the Sessions list is the in-memory SessionManager; register_agent reconnect-reuse keys on a stable agent_id (session/mod.rs:169) and persistent sessions are never reaped on disconnect (session/mod.rs:519-542), so an agent reconnecting with a fresh agent_id leaves a new retained ghost session each time (observed: 15 sessions/0 live, ~10 orphans for one machine after a GuruConnect-client reconnect storm). Adds TTL sweep + same-machine supersede, admin-gated audited purge + bulk endpoints, and dashboard multi-select removal. Requested by Mike 2026-05-30. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -48,6 +48,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
|
||||
- [x] JWT auth, Argon2id passwords, rate limiting, security headers
|
||||
- [x] Sessions / machines / support-codes / events
|
||||
- [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). ([SPEC-003](specs/SPEC-003-machine-inventory.md))
|
||||
- [ ] **Session lifecycle reaping + operator session/unit removal** — P1 — reap orphaned managed sessions (TTL sweep + supersede prior same-machine sessions on reconnect) so dead rows stop masquerading as live, and add admin-gated per-row + multi-select bulk removal of stale sessions/units in the Operator Console. Fixes ghost-session accumulation observed on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
|
||||
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
|
||||
|
||||
## Security & Infrastructure
|
||||
|
||||
158
docs/specs/SPEC-004-session-lifecycle-and-removal.md
Normal file
158
docs/specs/SPEC-004-session-lifecycle-and-removal.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal
|
||||
|
||||
**Status:** Proposed
|
||||
**Priority:** P1
|
||||
**Requested By:** Mike (2026-05-30)
|
||||
**Estimated Effort:** Medium
|
||||
|
||||
## Overview
|
||||
|
||||
Stop orphaned managed sessions from accumulating in the Operator Console, and give
|
||||
operators a first-class way to remove stale sessions/units — per-row **and** in bulk
|
||||
(multi-select mass delete). Today the Sessions view can show many dead rows that look
|
||||
live, and the only per-row action is "End", which applies to a *live* session and does
|
||||
nothing for an already-dead one — so junk just piles up with no way to clear it.
|
||||
Success = (a) reconnecting/offline persistent agents no longer leave behind retained
|
||||
ghost sessions, and (b) an admin can select one or many session rows (and stale machine
|
||||
rows) and remove them from the console.
|
||||
|
||||
**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
|
||||
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
|
||||
(`DESKTOP-I66IM5Q` / Pavon-Raiders), each a distinct session UUID, all
|
||||
`NOT REQUIRED` consent, no viewers, no duration, all "37 minutes ago". That machine had
|
||||
just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in a loop
|
||||
— that reconnect storm is what produced the orphans, which is exactly why this needs
|
||||
**both** a lifecycle fix and a manual-removal control.
|
||||
|
||||
## Root cause (confirmed in code)
|
||||
|
||||
The Sessions list is served from the **in-memory `SessionManager`**, not the database:
|
||||
`GET /api/sessions` → `list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
|
||||
(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there:
|
||||
|
||||
1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent`
|
||||
(`session/mod.rs:169`) reuses an existing session only when
|
||||
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the
|
||||
agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the
|
||||
misbehaving client did), the lookup misses and a **brand-new persistent session** is
|
||||
created each time. The `agents` map holds only one session per agent_id, so prior
|
||||
sessions become unreferenced yet remain in the `sessions` map.
|
||||
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
|
||||
removed entirely; persistent/managed sessions are deliberately retained
|
||||
(`session/mod.rs:519–542`) and there is **no TTL sweep**. An offline managed session
|
||||
therefore lives in memory indefinitely, displayed alongside genuinely-live ones.
|
||||
|
||||
Net effect: N reconnects with unstable identity → N retained, never-expiring managed
|
||||
sessions, none of which the UI can clear.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included in v1
|
||||
|
||||
- **Lifecycle reaping (the fix):**
|
||||
- Periodic background sweep that removes persistent sessions whose agent has been
|
||||
offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
|
||||
using the existing `last_heartbeat_instant`.
|
||||
- On agent reconnect, **supersede** prior retained sessions for the same machine
|
||||
(dedupe by `hostname`/machine identity, not only exact `agent_id`) so a fresh
|
||||
agent_id cannot strand the old session.
|
||||
- On socket drop for a persistent agent, mark the session offline *and* eligible for
|
||||
the sweep (DB `end_session` already fires at `relay/mod.rs:892`; align in-memory
|
||||
state with it).
|
||||
- **Manual removal API (admin-gated, audited):**
|
||||
- `DELETE /api/sessions/:id?purge=true` — remove the in-memory session record
|
||||
(`SessionManager::remove_session`, `session/mod.rs:548`) **and** soft-delete the DB
|
||||
row. Distinguish from the existing `disconnect_session` (which ends a *live*
|
||||
session) — purge works on dead rows.
|
||||
- `POST /api/sessions/bulk` (or `DELETE` with a body) taking `{ ids: [...] , action:
|
||||
"purge" | "end" }` for mass delete / bulk-end.
|
||||
- Same stale-removal for the Machines view: extend the existing
|
||||
`DELETE /api/machines/:agent_id` (`main.rs:387`) usage with a bulk variant for
|
||||
stale units.
|
||||
- **Dashboard UX:**
|
||||
- Per-row **Remove** action on `SessionsPage.tsx` for non-live rows (alongside the
|
||||
existing End in `EndSessionDialog.tsx`).
|
||||
- Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
|
||||
on the Sessions view; mirror on `MachinesPage.tsx` for stale units.
|
||||
|
||||
### Explicitly out of scope
|
||||
|
||||
- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent
|
||||
keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it.
|
||||
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
|
||||
mentioned it; deferred unless a concrete edit field is identified.
|
||||
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
|
||||
audit trail (CLAUDE.md DB conventions).
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to
|
||||
`SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup;
|
||||
extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style
|
||||
removal that the API can call for dead rows.
|
||||
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
|
||||
`deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
|
||||
a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
|
||||
Idempotent `ADD COLUMN IF NOT EXISTS`, applied by `sqlx::migrate!()` on startup —
|
||||
never pre-applied via psql (see the 005→007 lesson).
|
||||
- **API (`server/src/main.rs`, `server/src/api/`):** new purge + bulk routes, all behind
|
||||
the existing `AuthenticatedUser`/admin guard; emit audit `events` rows.
|
||||
- **Dashboard (`dashboard/src/features/sessions/`, `.../machines/`):** selection state,
|
||||
bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
|
||||
`DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
|
||||
`purgeSession` / `bulkSessions`.
|
||||
- **Protobuf:** none — this is server/dashboard only.
|
||||
|
||||
## Implementation details
|
||||
|
||||
- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove);
|
||||
`server/src/main.rs:376–388,636,661` (routes + handlers); `server/src/db/sessions.rs`
|
||||
(purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
|
||||
`deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
|
||||
`EndSessionDialog.tsx`, `dashboard/src/api/sessions.ts`;
|
||||
`dashboard/src/features/machines/MachinesPage.tsx`.
|
||||
- Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
|
||||
soft-delete DB. A reaped/purged session must vanish from `list_sessions()` output.
|
||||
|
||||
## Security considerations
|
||||
|
||||
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
|
||||
guard as `list_sessions`); never expose removal unauthenticated.
|
||||
- Audit every removal to the `events` table (who, which session/machine, when, count
|
||||
for bulk) — soft-delete + audit, not silent hard-delete.
|
||||
- Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the
|
||||
whole fleet by accident or abuse.
|
||||
- Reaping must not end a session that is merely briefly offline (TTL guards against
|
||||
flapping); never reap an `is_online` or viewer-attached session.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the
|
||||
prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL
|
||||
persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and
|
||||
filters out of list queries.
|
||||
- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same
|
||||
hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead
|
||||
session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K
|
||||
ids removes exactly K.
|
||||
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
|
||||
ghost rows can be multi-selected and removed and do not reappear after the sweep.
|
||||
|
||||
## Effort estimate & dependencies
|
||||
|
||||
- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API
|
||||
and dashboard work follows existing End/Delete patterns. The migration is trivial.
|
||||
- **Depends on:** nothing blocking.
|
||||
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
|
||||
live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
|
||||
support sessions?
|
||||
2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once
|
||||
that lands? v1 proposes hostname; revisit when per-machine keys ship.
|
||||
3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
|
||||
for audit while hiding it from the console; confirm operators don't expect a hard
|
||||
purge.
|
||||
4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?
|
||||
Reference in New Issue
Block a user