spec: SPEC-004 add stable machine-derived identity as the primary fix

Address duplicate registration at the source, not just via cleanup. Root
cause now grounded: agent_id is a random UUID (config.rs:90 generate_agent_id)
persisted only in the config file, so a portable/misconfigured execution
(the Pavon desktop launcher) regenerates a fresh id each launch, defeating
both the DB upsert (ON CONFLICT agent_id) and session-reuse dedupe. Add a
deterministic machine_uid (Windows MachineGuid-based, recomputable) keyed by
registration; reaping/supersede become defense-in-depth. Security: machine_uid
is identity not authorization and must be bound to the per-machine agent key
to prevent session/record hijack. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 16:11:38 -07:00
parent ee900c6395
commit f8bd4d1dab
2 changed files with 110 additions and 37 deletions

View File

@@ -48,7 +48,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [x] JWT auth, Argon2id passwords, rate limiting, security headers - [x] JWT auth, Argon2id passwords, rate limiting, security headers
- [x] Sessions / machines / support-codes / events - [x] Sessions / machines / support-codes / events
- [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). ([SPEC-003](specs/SPEC-003-machine-inventory.md)) - [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). ([SPEC-003](specs/SPEC-003-machine-inventory.md))
- [ ] **Session lifecycle reaping + operator session/unit removal** — P1 — reap orphaned managed sessions (TTL sweep + supersede prior same-machine sessions on reconnect) so dead rows stop masquerading as live, and add admin-gated per-row + multi-select bulk removal of stale sessions/units in the Operator Console. Fixes ghost-session accumulation observed on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md)) - [ ] **Stable machine identity + session lifecycle reaping + operator removal** — P1 — give the agent a deterministic machine-derived `machine_uid` (Windows `MachineGuid`-based) so the same box can't register duplicates (root cause: `agent_id` is a config-file random UUID that a portable/misconfigured run regenerates each launch); key registration on it; add TTL reaping + same-machine supersede as defense-in-depth; and admin-gated per-row + multi-select bulk removal of stale sessions/units. Identity must be bound to the per-machine agent key (spoof guard). Fixes ghost-session accumulation seen on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2 - [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
## Security & Infrastructure ## Security & Infrastructure

View File

@@ -1,4 +1,4 @@
# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal # SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal
**Status:** Proposed **Status:** Proposed
**Priority:** P1 **Priority:** P1
@@ -12,9 +12,14 @@ operators a first-class way to remove stale sessions/units — per-row **and** i
(multi-select mass delete). Today the Sessions view can show many dead rows that look (multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it. nothing for an already-dead one — so junk just piles up with no way to clear it.
Success = (a) reconnecting/offline persistent agents no longer leave behind retained The durable fix is at registration: the **same machine must resolve to one stable
ghost sessions, and (b) an admin can select one or many session rows (and stale machine identity** so a repeated execution cannot mint duplicates in the first place; reaping
rows) and remove them from the console. and manual removal then become defense-in-depth and cleanup, not the primary mechanism.
Success = (a) the same machine, run repeatedly (even from a portable/misconfigured
copy), registers to **one** record/session — no duplicates; (b) reconnecting/offline
persistent agents no longer leave behind retained ghost sessions; and (c) an admin can
select one or many session rows (and stale machine rows) and remove them from the
console.
**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0 **Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine live**, of which ~10 were duplicate `MANAGED` rows for a single machine
@@ -28,15 +33,24 @@ just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in
The Sessions list is served from the **in-memory `SessionManager`**, not the database: The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions``list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()` `GET /api/sessions``list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there: (`session/mod.rs:584`). Three compounding defects let ghosts accumulate there:
1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent` 0. **Machine identity is a config-file random UUID, not machine-derived.** The agent's
`agent_id` is a random UUID minted by `generate_agent_id()` (`agent/src/config.rs:90`)
on first run and persisted **only in the agent config file** (`config.rs:331`), or
taken from `GURUCONNECT_AGENT_ID` (`config.rs:366`). A portable or misconfigured
execution that cannot locate/write that config — e.g. the Pavon desktop launcher
`guruconnect-pavon-raidersreef.exe` run repeatedly from a user Desktop — regenerates
a **fresh** `agent_id` every launch. Because identity is not derived from the machine,
the same physical box presents as N different agents. The DB upsert
(`upsert_machine`, `ON CONFLICT (agent_id)`) and the session-reuse map both key on
this id, so an unstable id defeats *both* dedupe layers at the source.
1. **Reconnect-reuse is keyed on that `agent_id`.** `register_agent`
(`session/mod.rs:169`) reuses an existing session only when (`session/mod.rs:169`) reuses an existing session only when
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the `self.agents.get(&agent_id)` resolves to an `is_online == false` session. With a new
agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the `agent_id` per launch (defect 0) the lookup misses and a **brand-new persistent
misbehaving client did), the lookup misses and a **brand-new persistent session** is session** is created each time. The `agents` map holds only one session per agent_id,
created each time. The `agents` map holds only one session per agent_id, so prior so prior sessions become unreferenced yet remain in the `sessions` map.
sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are 2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
removed entirely; persistent/managed sessions are deliberately retained removed entirely; persistent/managed sessions are deliberately retained
(`session/mod.rs:519542`) and there is **no TTL sweep**. An offline managed session (`session/mod.rs:519542`) and there is **no TTL sweep**. An offline managed session
@@ -49,7 +63,20 @@ sessions, none of which the UI can clear.
### Included in v1 ### Included in v1
- **Lifecycle reaping (the fix):** - **Stable, machine-derived identity (the primary fix):**
- The agent computes a deterministic `machine_uid` from durable machine identifiers —
primary source the Windows `MachineGuid`
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`), optionally folded with a stable
hardware id (board/BIOS serial) — hashed to a stable string. It is **recomputable**:
a lost/absent config self-heals to the *same* id rather than minting a new random
one. Persist a cached copy, but never depend on the config file for correctness.
- Registration keys on `machine_uid`: `upsert_machine`s `ON CONFLICT` and the
in-memory `agents`/session-reuse map both use it, so the same box converges to **one**
machine record and **one** managed session no matter how many times it executes.
- Carry `machine_uid` in the agent connect handshake (`transport/websocket.rs:40`
query params) / `AgentStatus`; keep the legacy random `agent_id` only as a
migration fallback.
- **Lifecycle reaping (defense-in-depth):**
- Periodic background sweep that removes persistent sessions whose agent has been - Periodic background sweep that removes persistent sessions whose agent has been
offline (`is_online == false`) longer than a TTL (default 10 min, configurable), offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
using the existing `last_heartbeat_instant`. using the existing `last_heartbeat_instant`.
@@ -77,8 +104,13 @@ sessions, none of which the UI can clear.
### Explicitly out of scope ### Explicitly out of scope
- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent - Surviving a full machine **reimage/clone** with the same identity. `MachineGuid`
keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it. regenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged
box legitimately becomes a new `machine_uid` (and a clone collision is caught by the
auth binding below). Cross-reimage identity continuity is out of scope for v1.
- Replacing the shared `AGENT_API_KEY` with per-machine agent keys — tracked separately
(roadmap GuruRMM-Integration). SPEC-004 *assumes* that binding for its threat model
(see Security) and degrades safely without it, but does not implement it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request - "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
mentioned it; deferred unless a concrete edit field is identified. mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the - Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
@@ -86,10 +118,17 @@ sessions, none of which the UI can clear.
## Architecture ## Architecture
- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to - **Agent (`agent/src/`):** new `identity` module computes `machine_uid` deterministically
`SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup; (Windows `MachineGuid` primary; non-Windows fallback to a stable persisted UUID).
extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style Replace/augment `generate_agent_id()` (`config.rs:90`) so the effective id is the
removal that the API can call for dead rows. machine-derived value, with the config-file value used only as a cache. Send
`machine_uid` in the connect query string (`transport/websocket.rs:40`) and on
`AgentStatus`.
- **Relay-server (`server/src/session/mod.rs`):** key `register_agent` and the
`agents` map on `machine_uid` so the same machine reuses one session; add
`reap_stale_persistent(ttl)` to `SessionManager` + a periodic task (e.g. every 60 s)
from server startup; supersede any prior same-machine sessions on reconnect; add a
`purge`-style removal the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add - **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
`deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and `deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`. a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
@@ -101,11 +140,17 @@ sessions, none of which the UI can clear.
bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/ bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
`DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains `DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
`purgeSession` / `bulkSessions`. `purgeSession` / `bulkSessions`.
- **Protobuf:** none — this is server/dashboard only. - **Protobuf:** add `machine_uid` to the agent identity carried on `AgentStatus` (the
connect handshake passes it as a query param; mirroring it on `AgentStatus` lets the
server reconcile mid-session). Otherwise server/dashboard only.
## Implementation details ## Implementation details
- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove); - Files to touch: `agent/src/identity/` (new — `machine_uid` derivation),
`agent/src/config.rs:90,366` (effective id = machine-derived),
`agent/src/transport/websocket.rs:40` (send `machine_uid`);
`server/src/session/mod.rs:169,519,548,584` (key on `machine_uid`; reuse/reap/remove);
`server/src/relay/mod.rs:584,591` (registration path);
`server/src/main.rs:376388,636,661` (routes + handlers); `server/src/db/sessions.rs` `server/src/main.rs:376388,636,661` (routes + handlers); `server/src/db/sessions.rs`
(purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for (purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
`deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`, `deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
@@ -116,6 +161,17 @@ sessions, none of which the UI can clear.
## Security considerations ## Security considerations
- **Identity is not authorization.** A client-asserted `machine_uid` is self-reported
and therefore spoofable — on its own, agent A could claim agent B's `machine_uid` to
bind to (and hijack) B's session and machine record. The `machine_uid` must be
**bound to the agent's authenticated credential**: the server accepts a given
`machine_uid` only from a connection authenticated by that machine's own agent key
(or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key
pair). This is why per-machine agent keys (roadmap) are the natural companion; until
they ship, the shared `AGENT_API_KEY` means `machine_uid` is a *correctness* improvement
(dedupe) but not yet a *trust* boundary — call this out so it isn't mistaken for one.
A clone collision (two boxes, same `MachineGuid`) surfaces here as two agents claiming
one uid and is resolved by the key binding, not by the uid alone.
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same - All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
guard as `list_sessions`); never expose removal unauthenticated. guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count - Audit every removal to the `events` table (who, which session/machine, when, count
@@ -127,32 +183,49 @@ sessions, none of which the UI can clear.
## Testing strategy ## Testing strategy
- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the - **Unit:** `machine_uid` derivation is deterministic — same machine inputs yield the
prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL same uid across runs, and an absent config recomputes the same value (no fresh random
persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and id). `register_agent` for the same `machine_uid` reuses/supersedes the prior session
filters out of list queries. (no duplicate retained) even when the legacy `agent_id` differs.
- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same `reap_stale_persistent` removes offline-past-TTL persistent sessions and spares
hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead online/within-TTL ones. `purge_session` soft-deletes and filters out of list queries.
session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K - **Integration:** simulate a reconnect storm (M connects, **varying `agent_id` but the
ids removes exactly K. same `machine_uid`**, as the Pavon launcher did) → assert `list_sessions()` converges
to one live session and `connect_machines` holds one row, not M. A spoof attempt
(uid X presented on a connection not authenticated for X) is rejected/not bound. Purge
a dead session via API → gone from list + `deleted_at` set + audit row written. Bulk
purge of K ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the - **Manual:** on the live console, reproduce against the Pavon machines, confirm the
ghost rows can be multi-selected and removed and do not reappear after the sweep. ghost rows can be multi-selected and removed and do not reappear after the sweep.
## Effort estimate & dependencies ## Effort estimate & dependencies
- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API - **Size: Medium.** Reaping + supersede + purge/bulk + dashboard follow existing
and dashboard work follows existing End/Delete patterns. The migration is trivial. patterns; the migration is trivial. The added agent-side `machine_uid` derivation and
- **Depends on:** nothing blocking. threading it through the handshake/registration is the main new surface (bumps this
toward the upper end of Medium).
- **Depends on:** nothing blocking. **Pairs with** per-machine agent keys (roadmap) for
the full trust boundary on `machine_uid` — see Security; SPEC-004 degrades safely
without them.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as - **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces. live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard
hardening of the same surfaces.
## Open questions ## Open questions
1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs. 1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
support sessions? support sessions?
2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once 2. **`machine_uid` source mix** — `MachineGuid` alone, or folded with board/BIOS serial?
that lands? v1 proposes hostname; revisit when per-machine keys ship. `MachineGuid` is stable and present everywhere but regenerates on sysprep and is
3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history cloneable; adding a hardware serial reduces clone collisions but churns on hardware
swaps. Pick the recipe (proposed: `MachineGuid` primary, hashed).
3. **uid↔key binding model** — trust-on-first-use pinning of `machine_uid` to the agent
key, vs. requiring per-machine keys before honoring a uid. What's the interim policy
while the shared `AGENT_API_KEY` is still in use?
4. **Migration of existing rows** — legacy random-`agent_id` machine/session rows: let
them age out via the reaper + manual purge, or run a one-time reconcile that maps
known hosts to their new `machine_uid`? (Proposed: age-out + purge; no risky backfill.)
5. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
for audit while hiding it from the console; confirm operators don't expect a hard for audit while hiding it from the console; confirm operators don't expect a hard
purge. purge.
4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)? 6. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?