Files
guru-connect/docs/specs/SPEC-004-session-lifecycle-and-removal.md
Mike Swanson f8bd4d1dab spec: SPEC-004 add stable machine-derived identity as the primary fix
Address duplicate registration at the source, not just via cleanup. Root
cause now grounded: agent_id is a random UUID (config.rs:90 generate_agent_id)
persisted only in the config file, so a portable/misconfigured execution
(the Pavon desktop launcher) regenerates a fresh id each launch, defeating
both the DB upsert (ON CONFLICT agent_id) and session-reuse dedupe. Add a
deterministic machine_uid (Windows MachineGuid-based, recomputable) keyed by
registration; reaping/supersede become defense-in-depth. Security: machine_uid
is identity not authorization and must be bound to the per-machine agent key
to prevent session/record hijack. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 16:11:38 -07:00

14 KiB
Raw Blame History

SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal

Status: Proposed Priority: P1 Requested By: Mike (2026-05-30) Estimated Effort: Medium

Overview

Stop orphaned managed sessions from accumulating in the Operator Console, and give operators a first-class way to remove stale sessions/units — per-row and in bulk (multi-select mass delete). Today the Sessions view can show many dead rows that look live, and the only per-row action is "End", which applies to a live session and does nothing for an already-dead one — so junk just piles up with no way to clear it. The durable fix is at registration: the same machine must resolve to one stable identity so a repeated execution cannot mint duplicates in the first place; reaping and manual removal then become defense-in-depth and cleanup, not the primary mechanism. Success = (a) the same machine, run repeatedly (even from a portable/misconfigured copy), registers to one record/session — no duplicates; (b) reconnecting/offline persistent agents no longer leave behind retained ghost sessions; and (c) an admin can select one or many session rows (and stale machine rows) and remove them from the console.

Observed (live console, 2026-05-30): the Sessions view listed 15 sessions, 0 live, of which ~10 were duplicate MANAGED rows for a single machine (DESKTOP-I66IM5Q / Pavon-Raiders), each a distinct session UUID, all NOT REQUIRED consent, no viewers, no duration, all "37 minutes ago". That machine had just been cleaned of a misbehaving GuruConnect client that was reconnecting in a loop — that reconnect storm is what produced the orphans, which is exactly why this needs both a lifecycle fix and a manual-removal control.

Root cause (confirmed in code)

The Sessions list is served from the in-memory SessionManager, not the database: GET /api/sessionslist_sessions (main.rs:636) → state.sessions.list_sessions() (session/mod.rs:584). Three compounding defects let ghosts accumulate there:

  1. Machine identity is a config-file random UUID, not machine-derived. The agent's agent_id is a random UUID minted by generate_agent_id() (agent/src/config.rs:90) on first run and persisted only in the agent config file (config.rs:331), or taken from GURUCONNECT_AGENT_ID (config.rs:366). A portable or misconfigured execution that cannot locate/write that config — e.g. the Pavon desktop launcher guruconnect-pavon-raidersreef.exe run repeatedly from a user Desktop — regenerates a fresh agent_id every launch. Because identity is not derived from the machine, the same physical box presents as N different agents. The DB upsert (upsert_machine, ON CONFLICT (agent_id)) and the session-reuse map both key on this id, so an unstable id defeats both dedupe layers at the source.
  2. Reconnect-reuse is keyed on that agent_id. register_agent (session/mod.rs:169) reuses an existing session only when self.agents.get(&agent_id) resolves to an is_online == false session. With a new agent_id per launch (defect 0) the lookup misses and a brand-new persistent session is created each time. The agents map holds only one session per agent_id, so prior sessions become unreferenced yet remain in the sessions map.
  3. Persistent sessions are never reaped. On disconnect, only support sessions are removed entirely; persistent/managed sessions are deliberately retained (session/mod.rs:519542) and there is no TTL sweep. An offline managed session therefore lives in memory indefinitely, displayed alongside genuinely-live ones.

Net effect: N reconnects with unstable identity → N retained, never-expiring managed sessions, none of which the UI can clear.

Scope

Included in v1

  • Stable, machine-derived identity (the primary fix):
    • The agent computes a deterministic machine_uid from durable machine identifiers — primary source the Windows MachineGuid (HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid), optionally folded with a stable hardware id (board/BIOS serial) — hashed to a stable string. It is recomputable: a lost/absent config self-heals to the same id rather than minting a new random one. Persist a cached copy, but never depend on the config file for correctness.
    • Registration keys on machine_uid: upsert_machines ON CONFLICT and the in-memory agents/session-reuse map both use it, so the same box converges to one machine record and one managed session no matter how many times it executes.
    • Carry machine_uid in the agent connect handshake (transport/websocket.rs:40 query params) / AgentStatus; keep the legacy random agent_id only as a migration fallback.
  • Lifecycle reaping (defense-in-depth):
    • Periodic background sweep that removes persistent sessions whose agent has been offline (is_online == false) longer than a TTL (default 10 min, configurable), using the existing last_heartbeat_instant.
    • On agent reconnect, supersede prior retained sessions for the same machine (dedupe by hostname/machine identity, not only exact agent_id) so a fresh agent_id cannot strand the old session.
    • On socket drop for a persistent agent, mark the session offline and eligible for the sweep (DB end_session already fires at relay/mod.rs:892; align in-memory state with it).
  • Manual removal API (admin-gated, audited):
    • DELETE /api/sessions/:id?purge=true — remove the in-memory session record (SessionManager::remove_session, session/mod.rs:548) and soft-delete the DB row. Distinguish from the existing disconnect_session (which ends a live session) — purge works on dead rows.
    • POST /api/sessions/bulk (or DELETE with a body) taking { ids: [...] , action: "purge" | "end" } for mass delete / bulk-end.
    • Same stale-removal for the Machines view: extend the existing DELETE /api/machines/:agent_id (main.rs:387) usage with a bulk variant for stale units.
  • Dashboard UX:
    • Per-row Remove action on SessionsPage.tsx for non-live rows (alongside the existing End in EndSessionDialog.tsx).
    • Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End) on the Sessions view; mirror on MachinesPage.tsx for stale units.

Explicitly out of scope

  • Surviving a full machine reimage/clone with the same identity. MachineGuid regenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged box legitimately becomes a new machine_uid (and a clone collision is caught by the auth binding below). Cross-reimage identity continuity is out of scope for v1.
  • Replacing the shared AGENT_API_KEY with per-machine agent keys — tracked separately (roadmap GuruRMM-Integration). SPEC-004 assumes that binding for its threat model (see Security) and degrades safely without it, but does not implement it.
  • "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request mentioned it; deferred unless a concrete edit field is identified.
  • Hard-deleting DB session history — v1 soft-deletes (deleted_at) to preserve the audit trail (CLAUDE.md DB conventions).

Architecture

  • Agent (agent/src/): new identity module computes machine_uid deterministically (Windows MachineGuid primary; non-Windows fallback to a stable persisted UUID). Replace/augment generate_agent_id() (config.rs:90) so the effective id is the machine-derived value, with the config-file value used only as a cache. Send machine_uid in the connect query string (transport/websocket.rs:40) and on AgentStatus.
  • Relay-server (server/src/session/mod.rs): key register_agent and the agents map on machine_uid so the same machine reuses one session; add reap_stale_persistent(ttl) to SessionManager + a periodic task (e.g. every 60 s) from server startup; supersede any prior same-machine sessions on reconnect; add a purge-style removal the API can call for dead rows.
  • DB (server/src/db/sessions.rs + migration 008/009): add deleted_at TIMESTAMPTZ to connect_sessions; add purge_session (soft-delete) and a bulk variant; get_recent_sessions/list queries filter deleted_at IS NULL. Idempotent ADD COLUMN IF NOT EXISTS, applied by sqlx::migrate!() on startup — never pre-applied via psql (see the 005→007 lesson).
  • API (server/src/main.rs, server/src/api/): new purge + bulk routes, all behind the existing AuthenticatedUser/admin guard; emit audit events rows.
  • Dashboard (dashboard/src/features/sessions/, .../machines/): selection state, bulk-action bar, Remove confirmation reusing the EndSessionDialog/ DeleteMachineDialog patterns; dashboard/src/api/sessions.ts gains purgeSession / bulkSessions.
  • Protobuf: add machine_uid to the agent identity carried on AgentStatus (the connect handshake passes it as a query param; mirroring it on AgentStatus lets the server reconcile mid-session). Otherwise server/dashboard only.

Implementation details

  • Files to touch: agent/src/identity/ (new — machine_uid derivation), agent/src/config.rs:90,366 (effective id = machine-derived), agent/src/transport/websocket.rs:40 (send machine_uid); server/src/session/mod.rs:169,519,548,584 (key on machine_uid; reuse/reap/remove); server/src/relay/mod.rs:584,591 (registration path); server/src/main.rs:376388,636,661 (routes + handlers); server/src/db/sessions.rs (purge + bulk + deleted_at filtering); server/migrations/ (new migration for deleted_at); dashboard/src/features/sessions/SessionsPage.tsx, EndSessionDialog.tsx, dashboard/src/api/sessions.ts; dashboard/src/features/machines/MachinesPage.tsx.
  • Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory + soft-delete DB. A reaped/purged session must vanish from list_sessions() output.

Security considerations

  • Identity is not authorization. A client-asserted machine_uid is self-reported and therefore spoofable — on its own, agent A could claim agent B's machine_uid to bind to (and hijack) B's session and machine record. The machine_uid must be bound to the agent's authenticated credential: the server accepts a given machine_uid only from a connection authenticated by that machine's own agent key (or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key pair). This is why per-machine agent keys (roadmap) are the natural companion; until they ship, the shared AGENT_API_KEY means machine_uid is a correctness improvement (dedupe) but not yet a trust boundary — call this out so it isn't mistaken for one. A clone collision (two boxes, same MachineGuid) surfaces here as two agents claiming one uid and is resolved by the key binding, not by the uid alone.
  • All purge/bulk endpoints require an authenticated admin (AuthenticatedUser, same guard as list_sessions); never expose removal unauthenticated.
  • Audit every removal to the events table (who, which session/machine, when, count for bulk) — soft-delete + audit, not silent hard-delete.
  • Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the whole fleet by accident or abuse.
  • Reaping must not end a session that is merely briefly offline (TTL guards against flapping); never reap an is_online or viewer-attached session.

Testing strategy

  • Unit: machine_uid derivation is deterministic — same machine inputs yield the same uid across runs, and an absent config recomputes the same value (no fresh random id). register_agent for the same machine_uid reuses/supersedes the prior session (no duplicate retained) even when the legacy agent_id differs. reap_stale_persistent removes offline-past-TTL persistent sessions and spares online/within-TTL ones. purge_session soft-deletes and filters out of list queries.
  • Integration: simulate a reconnect storm (M connects, varying agent_id but the same machine_uid, as the Pavon launcher did) → assert list_sessions() converges to one live session and connect_machines holds one row, not M. A spoof attempt (uid X presented on a connection not authenticated for X) is rejected/not bound. Purge a dead session via API → gone from list + deleted_at set + audit row written. Bulk purge of K ids removes exactly K.
  • Manual: on the live console, reproduce against the Pavon machines, confirm the ghost rows can be multi-selected and removed and do not reappear after the sweep.

Effort estimate & dependencies

  • Size: Medium. Reaping + supersede + purge/bulk + dashboard follow existing patterns; the migration is trivial. The added agent-side machine_uid derivation and threading it through the handshake/registration is the main new surface (bumps this toward the upper end of Medium).
  • Depends on: nothing blocking. Pairs with per-machine agent keys (roadmap) for the full trust boundary on machine_uid — see Security; SPEC-004 degrades safely without them.
  • Unblocks: a trustworthy Sessions/Machines view (dead rows no longer masquerade as live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.

Open questions

  1. Reap TTL default — 10 min proposed; confirm. Should it differ for managed vs. support sessions?
  2. machine_uid source mixMachineGuid alone, or folded with board/BIOS serial? MachineGuid is stable and present everywhere but regenerates on sysprep and is cloneable; adding a hardware serial reduces clone collisions but churns on hardware swaps. Pick the recipe (proposed: MachineGuid primary, hashed).
  3. uid↔key binding model — trust-on-first-use pinning of machine_uid to the agent key, vs. requiring per-machine keys before honoring a uid. What's the interim policy while the shared AGENT_API_KEY is still in use?
  4. Migration of existing rows — legacy random-agent_id machine/session rows: let them age out via the reaper + manual purge, or run a one-time reconcile that maps known hosts to their new machine_uid? (Proposed: age-out + purge; no risky backfill.)
  5. Purge vs. keep history — soft-delete (proposed) keeps connect_sessions history for audit while hiding it from the console; confirm operators don't expect a hard purge.
  6. Bulk-action cap — what's a sane max N per bulk call (e.g. 100)?