spec: SPEC-004 add stable machine-derived identity as the primary fix

Address duplicate registration at the source, not just via cleanup. Root
cause now grounded: agent_id is a random UUID (config.rs:90 generate_agent_id)
persisted only in the config file, so a portable/misconfigured execution
(the Pavon desktop launcher) regenerates a fresh id each launch, defeating
both the DB upsert (ON CONFLICT agent_id) and session-reuse dedupe. Add a
deterministic machine_uid (Windows MachineGuid-based, recomputable) keyed by
registration; reaping/supersede become defense-in-depth. Security: machine_uid
is identity not authorization and must be bound to the per-machine agent key
to prevent session/record hijack. Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 16:11:38 -07:00
parent ee900c6395
commit f8bd4d1dab
2 changed files with 110 additions and 37 deletions

View File

@@ -48,7 +48,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [x] JWT auth, Argon2id passwords, rate limiting, security headers
- [x] Sessions / machines / support-codes / events
- [ ] **Full machine inventory in the connection DB** — P2 — persist per-machine device inventory (OS+locale+install, CPU/RAM, mfr/model/serial, external WAN IP captured server-side + private LAN IP + MAC, logged-on user, idle, time zone, uptime, local-admin) on `connect_machines`, refreshed each `AgentStatus`, shown in the dashboard machine detail (ScreenConnect "Guest Info" parity). Data layer for SPEC-002 Phase 2; closes GC side of agent-IP gap (todo 7459428e). ([SPEC-003](specs/SPEC-003-machine-inventory.md))
- [ ] **Session lifecycle reaping + operator session/unit removal** — P1 — reap orphaned managed sessions (TTL sweep + supersede prior same-machine sessions on reconnect) so dead rows stop masquerading as live, and add admin-gated per-row + multi-select bulk removal of stale sessions/units in the Operator Console. Fixes ghost-session accumulation observed on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
- [ ] **Stable machine identity + session lifecycle reaping + operator removal** — P1 — give the agent a deterministic machine-derived `machine_uid` (Windows `MachineGuid`-based) so the same box can't register duplicates (root cause: `agent_id` is a config-file random UUID that a portable/misconfigured run regenerates each launch); key registration on it; add TTL reaping + same-machine supersede as defense-in-depth; and admin-gated per-row + multi-select bulk removal of stale sessions/units. Identity must be bound to the per-machine agent key (spoof guard). Fixes ghost-session accumulation seen on the live console (15 sessions / 0 live, ~10 orphans for one machine). ([SPEC-004](specs/SPEC-004-session-lifecycle-and-removal.md))
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
## Security & Infrastructure

View File

@@ -1,4 +1,4 @@
# SPEC-004: Session Lifecycle Reaping + Operator Session/Unit Removal
# SPEC-004: Stable Machine Identity, Session Lifecycle Reaping, and Operator Removal
**Status:** Proposed
**Priority:** P1
@@ -12,9 +12,14 @@ operators a first-class way to remove stale sessions/units — per-row **and** i
(multi-select mass delete). Today the Sessions view can show many dead rows that look
live, and the only per-row action is "End", which applies to a *live* session and does
nothing for an already-dead one — so junk just piles up with no way to clear it.
Success = (a) reconnecting/offline persistent agents no longer leave behind retained
ghost sessions, and (b) an admin can select one or many session rows (and stale machine
rows) and remove them from the console.
The durable fix is at registration: the **same machine must resolve to one stable
identity** so a repeated execution cannot mint duplicates in the first place; reaping
and manual removal then become defense-in-depth and cleanup, not the primary mechanism.
Success = (a) the same machine, run repeatedly (even from a portable/misconfigured
copy), registers to **one** record/session — no duplicates; (b) reconnecting/offline
persistent agents no longer leave behind retained ghost sessions; and (c) an admin can
select one or many session rows (and stale machine rows) and remove them from the
console.
**Observed (live console, 2026-05-30):** the Sessions view listed **15 sessions, 0
live**, of which ~10 were duplicate `MANAGED` rows for a single machine
@@ -28,15 +33,24 @@ just been cleaned of a misbehaving GuruConnect *client* that was reconnecting in
The Sessions list is served from the **in-memory `SessionManager`**, not the database:
`GET /api/sessions``list_sessions` (`main.rs:636`) → `state.sessions.list_sessions()`
(`session/mod.rs:584`). Two compounding defects let ghosts accumulate there:
(`session/mod.rs:584`). Three compounding defects let ghosts accumulate there:
1. **Reconnect-reuse is keyed on a stable `agent_id`.** `register_agent`
0. **Machine identity is a config-file random UUID, not machine-derived.** The agent's
`agent_id` is a random UUID minted by `generate_agent_id()` (`agent/src/config.rs:90`)
on first run and persisted **only in the agent config file** (`config.rs:331`), or
taken from `GURUCONNECT_AGENT_ID` (`config.rs:366`). A portable or misconfigured
execution that cannot locate/write that config — e.g. the Pavon desktop launcher
`guruconnect-pavon-raidersreef.exe` run repeatedly from a user Desktop — regenerates
a **fresh** `agent_id` every launch. Because identity is not derived from the machine,
the same physical box presents as N different agents. The DB upsert
(`upsert_machine`, `ON CONFLICT (agent_id)`) and the session-reuse map both key on
this id, so an unstable id defeats *both* dedupe layers at the source.
1. **Reconnect-reuse is keyed on that `agent_id`.** `register_agent`
(`session/mod.rs:169`) reuses an existing session only when
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. If the
agent reconnects with a *new* `agent_id` (a per-process/regenerated identity, as the
misbehaving client did), the lookup misses and a **brand-new persistent session** is
created each time. The `agents` map holds only one session per agent_id, so prior
sessions become unreferenced yet remain in the `sessions` map.
`self.agents.get(&agent_id)` resolves to an `is_online == false` session. With a new
`agent_id` per launch (defect 0) the lookup misses and a **brand-new persistent
session** is created each time. The `agents` map holds only one session per agent_id,
so prior sessions become unreferenced yet remain in the `sessions` map.
2. **Persistent sessions are never reaped.** On disconnect, only *support* sessions are
removed entirely; persistent/managed sessions are deliberately retained
(`session/mod.rs:519542`) and there is **no TTL sweep**. An offline managed session
@@ -49,7 +63,20 @@ sessions, none of which the UI can clear.
### Included in v1
- **Lifecycle reaping (the fix):**
- **Stable, machine-derived identity (the primary fix):**
- The agent computes a deterministic `machine_uid` from durable machine identifiers —
primary source the Windows `MachineGuid`
(`HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid`), optionally folded with a stable
hardware id (board/BIOS serial) — hashed to a stable string. It is **recomputable**:
a lost/absent config self-heals to the *same* id rather than minting a new random
one. Persist a cached copy, but never depend on the config file for correctness.
- Registration keys on `machine_uid`: `upsert_machine`s `ON CONFLICT` and the
in-memory `agents`/session-reuse map both use it, so the same box converges to **one**
machine record and **one** managed session no matter how many times it executes.
- Carry `machine_uid` in the agent connect handshake (`transport/websocket.rs:40`
query params) / `AgentStatus`; keep the legacy random `agent_id` only as a
migration fallback.
- **Lifecycle reaping (defense-in-depth):**
- Periodic background sweep that removes persistent sessions whose agent has been
offline (`is_online == false`) longer than a TTL (default 10 min, configurable),
using the existing `last_heartbeat_instant`.
@@ -77,8 +104,13 @@ sessions, none of which the UI can clear.
### Explicitly out of scope
- Stabilizing the agent's `agent_id` across reinstalls (overlaps "Per-machine agent
keys", roadmap GuruRMM-Integration) — v1 dedupes by machine instead of requiring it.
- Surviving a full machine **reimage/clone** with the same identity. `MachineGuid`
regenerates on sysprep/reimage and is duplicated by naive disk clones, so a reimaged
box legitimately becomes a new `machine_uid` (and a clone collision is caught by the
auth binding below). Cross-reimage identity continuity is out of scope for v1.
- Replacing the shared `AGENT_API_KEY` with per-machine agent keys — tracked separately
(roadmap GuruRMM-Integration). SPEC-004 *assumes* that binding for its threat model
(see Security) and degrades safely without it, but does not implement it.
- "Mass edit" beyond bulk-end / bulk-remove (e.g. bulk tagging/renaming) — the request
mentioned it; deferred unless a concrete edit field is identified.
- Hard-deleting DB session history — v1 soft-deletes (`deleted_at`) to preserve the
@@ -86,10 +118,17 @@ sessions, none of which the UI can clear.
## Architecture
- **Relay-server (`server/src/session/mod.rs`):** add `reap_stale_persistent(ttl)` to
`SessionManager` and spawn a periodic task (e.g. every 60 s) from server startup;
extend `register_agent` to supersede prior same-machine sessions; add a `purge`-style
removal that the API can call for dead rows.
- **Agent (`agent/src/`):** new `identity` module computes `machine_uid` deterministically
(Windows `MachineGuid` primary; non-Windows fallback to a stable persisted UUID).
Replace/augment `generate_agent_id()` (`config.rs:90`) so the effective id is the
machine-derived value, with the config-file value used only as a cache. Send
`machine_uid` in the connect query string (`transport/websocket.rs:40`) and on
`AgentStatus`.
- **Relay-server (`server/src/session/mod.rs`):** key `register_agent` and the
`agents` map on `machine_uid` so the same machine reuses one session; add
`reap_stale_persistent(ttl)` to `SessionManager` + a periodic task (e.g. every 60 s)
from server startup; supersede any prior same-machine sessions on reconnect; add a
`purge`-style removal the API can call for dead rows.
- **DB (`server/src/db/sessions.rs` + migration `008`/`009`):** add
`deleted_at TIMESTAMPTZ` to `connect_sessions`; add `purge_session` (soft-delete) and
a bulk variant; `get_recent_sessions`/list queries filter `deleted_at IS NULL`.
@@ -101,11 +140,17 @@ sessions, none of which the UI can clear.
bulk-action bar, Remove confirmation reusing the `EndSessionDialog`/
`DeleteMachineDialog` patterns; `dashboard/src/api/sessions.ts` gains
`purgeSession` / `bulkSessions`.
- **Protobuf:** none — this is server/dashboard only.
- **Protobuf:** add `machine_uid` to the agent identity carried on `AgentStatus` (the
connect handshake passes it as a query param; mirroring it on `AgentStatus` lets the
server reconcile mid-session). Otherwise server/dashboard only.
## Implementation details
- Files to touch: `server/src/session/mod.rs:169,519,548,584` (reuse/reap/remove);
- Files to touch: `agent/src/identity/` (new — `machine_uid` derivation),
`agent/src/config.rs:90,366` (effective id = machine-derived),
`agent/src/transport/websocket.rs:40` (send `machine_uid`);
`server/src/session/mod.rs:169,519,548,584` (key on `machine_uid`; reuse/reap/remove);
`server/src/relay/mod.rs:584,591` (registration path);
`server/src/main.rs:376388,636,661` (routes + handlers); `server/src/db/sessions.rs`
(purge + bulk + `deleted_at` filtering); `server/migrations/` (new migration for
`deleted_at`); `dashboard/src/features/sessions/SessionsPage.tsx`,
@@ -116,6 +161,17 @@ sessions, none of which the UI can clear.
## Security considerations
- **Identity is not authorization.** A client-asserted `machine_uid` is self-reported
and therefore spoofable — on its own, agent A could claim agent B's `machine_uid` to
bind to (and hijack) B's session and machine record. The `machine_uid` must be
**bound to the agent's authenticated credential**: the server accepts a given
`machine_uid` only from a connection authenticated by that machine's own agent key
(or, for a brand-new machine, first-seen trust-on-first-use that pins the uid↔key
pair). This is why per-machine agent keys (roadmap) are the natural companion; until
they ship, the shared `AGENT_API_KEY` means `machine_uid` is a *correctness* improvement
(dedupe) but not yet a *trust* boundary — call this out so it isn't mistaken for one.
A clone collision (two boxes, same `MachineGuid`) surfaces here as two agents claiming
one uid and is resolved by the key binding, not by the uid alone.
- All purge/bulk endpoints require an authenticated admin (`AuthenticatedUser`, same
guard as `list_sessions`); never expose removal unauthenticated.
- Audit every removal to the `events` table (who, which session/machine, when, count
@@ -127,32 +183,49 @@ sessions, none of which the UI can clear.
## Testing strategy
- **Unit:** `register_agent` with a new agent_id for an existing hostname supersedes the
prior session (no duplicate retained). `reap_stale_persistent` removes offline-past-TTL
persistent sessions and spares online/within-TTL ones. `purge_session` soft-deletes and
filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, varying agent_id, same
hostname) → assert `list_sessions()` converges to one live session, not M. Purge a dead
session via API → gone from list + `deleted_at` set + audit row written. Bulk purge of K
ids removes exactly K.
- **Unit:** `machine_uid` derivation is deterministic — same machine inputs yield the
same uid across runs, and an absent config recomputes the same value (no fresh random
id). `register_agent` for the same `machine_uid` reuses/supersedes the prior session
(no duplicate retained) even when the legacy `agent_id` differs.
`reap_stale_persistent` removes offline-past-TTL persistent sessions and spares
online/within-TTL ones. `purge_session` soft-deletes and filters out of list queries.
- **Integration:** simulate a reconnect storm (M connects, **varying `agent_id` but the
same `machine_uid`**, as the Pavon launcher did) → assert `list_sessions()` converges
to one live session and `connect_machines` holds one row, not M. A spoof attempt
(uid X presented on a connection not authenticated for X) is rejected/not bound. Purge
a dead session via API → gone from list + `deleted_at` set + audit row written. Bulk
purge of K ids removes exactly K.
- **Manual:** on the live console, reproduce against the Pavon machines, confirm the
ghost rows can be multi-selected and removed and do not reappear after the sweep.
## Effort estimate & dependencies
- **Size: Medium.** Reaping + supersede logic is contained to `SessionManager`; the API
and dashboard work follows existing End/Delete patterns. The migration is trivial.
- **Depends on:** nothing blocking.
- **Size: Medium.** Reaping + supersede + purge/bulk + dashboard follow existing
patterns; the migration is trivial. The added agent-side `machine_uid` derivation and
threading it through the handshake/registration is the main new surface (bumps this
toward the upper end of Medium).
- **Depends on:** nothing blocking. **Pairs with** per-machine agent keys (roadmap) for
the full trust boundary on `machine_uid` — see Security; SPEC-004 degrades safely
without them.
- **Unblocks:** a trustworthy Sessions/Machines view (dead rows no longer masquerade as
live), and complements SPEC-002 Phase 2's dashboard hardening of the same surfaces.
live; one machine = one record/session), and complements SPEC-002 Phase 2's dashboard
hardening of the same surfaces.
## Open questions
1. **Reap TTL default** — 10 min proposed; confirm. Should it differ for managed vs.
support sessions?
2. **Dedupe key on reconnect** — by `hostname`, or by the per-machine agent key once
that lands? v1 proposes hostname; revisit when per-machine keys ship.
3. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
2. **`machine_uid` source mix** — `MachineGuid` alone, or folded with board/BIOS serial?
`MachineGuid` is stable and present everywhere but regenerates on sysprep and is
cloneable; adding a hardware serial reduces clone collisions but churns on hardware
swaps. Pick the recipe (proposed: `MachineGuid` primary, hashed).
3. **uid↔key binding model** — trust-on-first-use pinning of `machine_uid` to the agent
key, vs. requiring per-machine keys before honoring a uid. What's the interim policy
while the shared `AGENT_API_KEY` is still in use?
4. **Migration of existing rows** — legacy random-`agent_id` machine/session rows: let
them age out via the reaper + manual purge, or run a one-time reconcile that maps
known hosts to their new `machine_uid`? (Proposed: age-out + purge; no risky backfill.)
5. **Purge vs. keep history** — soft-delete (proposed) keeps `connect_sessions` history
for audit while hiding it from the console; confirm operators don't expect a hard
purge.
4. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?
6. **Bulk-action cap** — what's a sane max N per bulk call (e.g. 100)?