Address duplicate registration at the source, not just via cleanup. Root cause now grounded: agent_id is a random UUID (config.rs:90 generate_agent_id) persisted only in the config file, so a portable/misconfigured execution (the Pavon desktop launcher) regenerates a fresh id each launch, defeating both the DB upsert (ON CONFLICT agent_id) and session-reuse dedupe. Add a deterministic machine_uid (Windows MachineGuid-based, recomputable) keyed by registration; reaping/supersede become defense-in-depth. Security: machine_uid is identity not authorization and must be bound to the per-machine agent key to prevent session/record hijack. Requested by Mike 2026-05-30. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
14 KiB
SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal
Status: Proposed Priority: P1 Requested By: Mike (2026-05-30) Estimated Effort: Medium
Overview
Stop orphaned managed sessions from accumulating in the Operator Console, and give operators a first-class way to remove stale sessions/units — per-row and in bulk (multi-select mass delete). Today the Sessions view can show many dead rows that look live, and the only per-row action is "End", which applies to a live session and does nothing for an already-dead one — so junk just piles up with no way to clear it. The durable fix is at registration: the same machine must resolve to one stable identity so a repeated execution cannot mint duplicates in the first place; reaping and manual removal then become defense-in-depth and cleanup, not the primary mechanism. Success = (a) the same machine, run repeatedly (even from a portable/misconfigured copy), registers to one record/session — no duplicates; (b) reconnecting/offline persistent agents no longer leave behind retained ghost sessions; and (c) an admin can select one or many session rows (and stale machine rows) and remove them from the console.
Observed (live console, 2026-05-30): the Sessions view listed 15 sessions, 0
live, of which ~10 were duplicate MANAGED rows for a single machine
(DESKTOP-I66IM5Q / Pavon-Raiders), each a distinct session UUID, all
NOT REQUIRED consent, no viewers, no duration, all "37 minutes ago". That machine had
just been cleaned of a misbehaving GuruConnect client that was reconnecting in a loop
— that reconnect storm is what produced the orphans, which is exactly why this needs
both a lifecycle fix and a manual-removal control.
Root cause (confirmed in code)
The Sessions list is served from the in-memory SessionManager, not the database:
GET /api/sessions → list_sessions (main.rs:636) → state.sessions.list_sessions()
(session/mod.rs:584). Three compounding defects let ghosts accumulate there:
- Machine identity is a config-file random UUID, not machine-derived. The agent's
agent_idis a random UUID minted bygenerate_agent_id()(agent/src/config.rs:90) on first run and persisted only in the agent config file (config.rs:331), or taken fromGURUCONNECT_AGENT_ID(config.rs:366). A portable or misconfigured execution that cannot locate/write that config — e.g. the Pavon desktop launcherguruconnect-pavon-raidersreef.exerun repeatedly from a user Desktop — regenerates a freshagent_idevery launch. Because identity is not derived from the machine, the same physical box presents as N different agents. The DB upsert (upsert_machine,ON CONFLICT (agent_id)) and the session-reuse map both key on this id, so an unstable id defeats both dedupe layers at the source. - Reconnect-reuse is keyed on that
agent_id.register_agent(session/mod.rs:169) reuses an existing session only whenself.agents.get(&agent_id)resolves to anis_online == falsesession. With a newagent_idper launch (defect 0) the lookup misses and a brand-new persistent session is created each time. Theagentsmap holds only one session per agent_id, so prior sessions become unreferenced yet remain in thesessionsmap. - Persistent sessions are never reaped. On disconnect, only support sessions are
removed entirely; persistent/managed sessions are deliberately retained
(
session/mod.rs:519–542) and there is no TTL sweep. An offline managed session therefore lives in memory indefinitely, displayed alongside genuinely-live ones.
Net effect: N reconnects with unstable identity → N retained, never-expiring managed sessions, none of which the UI can clear.
Scope
Included in v1
- Stable, machine-derived identity (the primary fix):
- The agent computes a deterministic
machine_uidfrom durable machine identifiers — primary source the WindowsMachineGuid(HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid), optionally folded with a stable hardware id (board/BIOS serial) — hashed to a stable string. It is recomputable: a lost/absent config self-heals to the same id rather than minting a new random one. Persist a cached copy, but never depend on the config file for correctness. - Registration keys on
machine_uid:upsert_machine’sON CONFLICTand the in-memoryagents/session-reuse map both use it, so the same box converges to one machine record and one managed session no matter how many times it executes. - Carry
machine_uidin the agent connect handshake (transport/websocket.rs:40query params) /AgentStatus; keep the legacy randomagent_idonly as a migration fallback.
- The agent computes a deterministic
- Lifecycle reaping (defense-in-depth):
- Periodic background sweep that removes persistent sessions whose agent has been
offline (
is_online == false) longer than a TTL (default 10 min, configurable), using the existinglast_heartbeat_instant. - On agent reconnect, supersede prior retained sessions for the same machine
(dedupe by
hostname/machine identity, not only exactagent_id) so a fresh agent_id cannot strand the old session. - On socket drop for a persistent agent, mark the session offline and eligible for
the sweep (DB
end_sessionalready fires atrelay/mod.rs:892; align in-memory state with it).
- Periodic background sweep that removes persistent sessions whose agent has been
offline (
- Manual removal API (admin-gated, audited):
DELETE /api/sessions/:id?purge=true— remove the in-memory session record (SessionManager::remove_session,session/mod.rs:548) and soft-delete the DB row. Distinguish from the existingdisconnect_session(which ends a live session) — purge works on dead rows.POST /api/sessions/bulk(orDELETEwith a body) taking{ ids: [...] , action: "purge" | "end" }for mass delete / bulk-end.- Same stale-removal for the Machines view: extend the existing
DELETE /api/machines/:agent_id(main.rs:387) usage with a bulk variant for stale units.
- Dashboard UX:
- Per-row Remove action on
SessionsPage.tsxfor non-live rows (alongside the existing End inEndSessionDialog.tsx). - Multi-select checkboxes + a bulk-action bar (Select all / mass Remove / bulk End)
on the Sessions view; mirror on
MachinesPage.tsxfor stale units.
- Per-row Remove action on
Explicitly out of scope
- Surviving a full machine reimage/clone with the same identity.
MachineGuidregenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged box legitimately becomes a newmachine_uid(and a clone collision is caught by the auth binding below). Cross-reimage identity continuity is out of scope for v1. - Replacing the shared
AGENT_API_KEYwith per-machine agent keys — tracked separately (roadmap GuruRMM-Integration). SPEC-004 assumes that binding for its threat model (see Security) and degrades safely without it, but does not implement it. - "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (
deleted_at) to preserve the audit trail (CLAUDE.md DB conventions).
Architecture
- Agent (
agent/src/): newidentitymodule computesmachine_uiddeterministically (WindowsMachineGuidprimary; non-Windows fallback to a stable persisted UUID). Replace/augmentgenerate_agent_id()(config.rs:90) so the effective id is the machine-derived value, with the config-file value used only as a cache. Sendmachine_uidin the connect query string (transport/websocket.rs:40) and onAgentStatus. - Relay-server (
server/src/session/mod.rs): keyregister_agentand theagentsmap onmachine_uidso the same machine reuses one session; addreap_stale_persistent(ttl)toSessionManager+ a periodic task (e.g. every 60 s) from server startup; supersede any prior same-machine sessions on reconnect; add apurge-style removal the API can call for dead rows. - DB (
server/src/db/sessions.rs+ migration008/009): adddeleted_at TIMESTAMPTZtoconnect_sessions; addpurge_session(soft-delete) and a bulk variant;get_recent_sessions/list queries filterdeleted_at IS NULL. IdempotentADD COLUMN IF NOT EXISTS, applied bysqlx::migrate!()on startup — never pre-applied via psql (see the 005→007 lesson). - API (
server/src/main.rs,server/src/api/): new purge + bulk routes, all behind the existingAuthenticatedUser/admin guard; emit auditeventsrows. - Dashboard (
dashboard/src/features/sessions/,.../machines/): selection state, bulk-action bar, Remove confirmation reusing theEndSessionDialog/DeleteMachineDialogpatterns;dashboard/src/api/sessions.tsgainspurgeSession/bulkSessions. - Protobuf: add
machine_uidto the agent identity carried onAgentStatus(the connect handshake passes it as a query param; mirroring it onAgentStatuslets the server reconcile mid-session). Otherwise server/dashboard only.
Implementation details
- Files to touch:
agent/src/identity/(new —machine_uidderivation),agent/src/config.rs:90,366(effective id = machine-derived),agent/src/transport/websocket.rs:40(sendmachine_uid);server/src/session/mod.rs:169,519,548,584(key onmachine_uid; reuse/reap/remove);server/src/relay/mod.rs:584,591(registration path);server/src/main.rs:376–388,636,661(routes + handlers);server/src/db/sessions.rs(purge + bulk +deleted_atfiltering);server/migrations/(new migration fordeleted_at);dashboard/src/features/sessions/SessionsPage.tsx,EndSessionDialog.tsx,dashboard/src/api/sessions.ts;dashboard/src/features/machines/MachinesPage.tsx. - Keep the in-memory list authoritative for "live"; treat purge as: remove in-memory +
soft-delete DB. A reaped/purged session must vanish from
list_sessions()output.
Security considerations
- Identity is not authorization. A client-asserted
machine_uidis self-reported and therefore spoofable — on its own, agent A could claim agent B'smachine_uidto bind to (and hijack) B's session and machine record. Themachine_uidmust be bound to the agent's authenticated credential: the server accepts a givenmachine_uidonly from a connection authenticated by that machine's own agent key (or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key pair). This is why per-machine agent keys (roadmap) are the natural companion; until they ship, the sharedAGENT_API_KEYmeansmachine_uidis a correctness improvement (dedupe) but not yet a trust boundary — call this out so it isn't mistaken for one. A clone collision (two boxes, sameMachineGuid) surfaces here as two agents claiming one uid and is resolved by the key binding, not by the uid alone. - All purge/bulk endpoints require an authenticated admin (
AuthenticatedUser, same guard aslist_sessions); never expose removal unauthenticated. - Audit every removal to the
eventstable (who, which session/machine, when, count for bulk) — soft-delete + audit, not silent hard-delete. - Validate/limit bulk request size (cap N per call) to avoid a single call sweeping the whole fleet by accident or abuse.
- Reaping must not end a session that is merely briefly offline (TTL guards against
flapping); never reap an
is_onlineor viewer-attached session.
Testing strategy
- Unit:
machine_uidderivation is deterministic — same machine inputs yield the same uid across runs, and an absent config recomputes the same value (no fresh random id).register_agentfor the samemachine_uidreuses/supersedes the prior session (no duplicate retained) even when the legacyagent_iddiffers.reap_stale_persistentremoves offline-past-TTL persistent sessions and spares online/within-TTL ones.purge_sessionsoft-deletes and filters out of list queries. - Integration: simulate a reconnect storm (M connects, varying
agent_idbut the samemachine_uid, as the Pavon launcher did) → assertlist_sessions()converges to one live session andconnect_machinesholds one row, not M. A spoof attempt (uid X presented on a connection not authenticated for X) is rejected/not bound. Purge a dead session via API → gone from list +deleted_atset + audit row written. Bulk purge of K ids removes exactly K. - Manual: on the live console, reproduce against the Pavon machines, confirm the ghost rows can be multi-selected and removed and do not reappear after the sweep.
Effort estimate & dependencies
- Size: Medium. Reaping + supersede + purge/bulk + dashboard follow existing
patterns; the migration is trivial. The added agent-side
machine_uidderivation and threading it through the handshake/registration is the main new surface (bumps this toward the upper end of Medium). - Depends on: nothing blocking. Pairs with per-machine agent keys (roadmap) for
the full trust boundary on
machine_uid— see Security; SPEC-004 degrades safely without them. - Unblocks: a trustworthy Sessions/Machines view (dead rows no longer masquerade as live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.
Open questions
- Reap TTL default — 10 min proposed; confirm. Should it differ for managed vs. support sessions?
machine_uidsource mix —MachineGuidalone, or folded with board/BIOS serial?MachineGuidis stable and present everywhere but regenerates on sysprep and is cloneable; adding a hardware serial reduces clone collisions but churns on hardware swaps. Pick the recipe (proposed:MachineGuidprimary, hashed).- uid↔key binding model — trust-on-first-use pinning of
machine_uidto the agent key, vs. requiring per-machine keys before honoring a uid. What's the interim policy while the sharedAGENT_API_KEYis still in use? - Migration of existing rows — legacy random-
agent_idmachine/session rows: let them age out via the reaper + manual purge, or run a one-time reconcile that maps known hosts to their newmachine_uid? (Proposed: age-out + purge; no risky backfill.) - Purge vs. keep history — soft-delete (proposed) keeps
connect_sessionshistory for audit while hiding it from the console; confirm operators don't expect a hard purge. - Bulk-action cap — what's a sane max N per bulk call (e.g. 100)?