Files
guru-connect/specs/v2-secure-session-core/plan.md
Mike Swanson a453e7984e
Some checks failed
Build and Test / Build Server (Linux) (push) Failing after 3m20s
Build and Test / Build Agent (Windows) (push) Successful in 6m9s
Build and Test / Security Audit (push) Successful in 4m21s
Build and Test / Build Summary (push) Has been skipped
feat(server): viewer-token view-only/control split - closes CRITICAL #1
Authz-strength fix (coord todo c8916c89), code-reviewed APPROVED. Replaces the
weak "view" gate (held by every role) with a permission-tiered access mode
stamped inside the signed viewer token:
- mint: is_admin() || has_permission("control") -> CONTROL token; else
  has_permission("view") -> VIEW_ONLY token; else 403.
- enforce: the relay drops MouseEvent/KeyEvent/SpecialKey for a VIEW_ONLY token
  before forwarding (video still streams); CONTROL tokens forward under the
  Task-3 throttle. Mode is unforgeable (in the signature) and unbypassable
  (all other viewer->agent payloads hit the catch-all and are never forwarded).
A low-privilege viewer-role user can now at most watch, never control. New
ViewerAccess enum (view_only|control) on ViewerClaims; 3 unit tests.

Audit CRITICAL #1 now fully closed (mechanism in Task 3 + this authz strength).
Not cargo-check-verified locally (no toolchain) - the push triggers CI
(clippy -D warnings + build + test) which is the verification gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 19:24:32 -07:00

267 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v2 Secure Session Core — Implementation Plan
> Spec created: 2026-05-29
> Status: in progress — Tasks 1-3 DONE 2026-05-29 (Task 3 code-reviewed APPROVED). Viewer-token authz
> STRENGTH split IMPLEMENTED 2026-05-29 (self-reviewed; no Rust toolchain on this machine — not yet
> `cargo check`-verified; pending Code Review). This was the REQUIRED Phase-1-exit follow-up: the gate
> previously used `view` (held by EVERY default role incl. `viewer`) but a viewer token granted input
> CONTROL. DECIDED (Mike, 2026-05-29) + IMPLEMENTED: SPLIT VIEW_ONLY/CONTROL tokens — `view`-perm users
> get a watch-only token (relay refuses their input), admin/`control` users get a control token. See the
> "Task 3 authz-strength fix" block under Task 3 below. Resolves coord todo c8916c89 (coordinator marks
> done after review). Remaining follow-up: nothing revokes a minted viewer token on logout (bounded by
> 5-min TTL) — follow-up todo. Task 4 (rate limiting + single-use codes) next.
> CARRY-FORWARD: Task 3 MUST add a viewer-token AUTHORIZATION check (admin/permission gate) — Task 2
> fixed only the token *mechanism*; the authz gate is what actually closes audit CRITICAL #1.
> Policy DECIDED (Mike, 2026-05-29): admin-or-view-permission (`is_admin() || has_permission(...)`).
> Parent: `docs/specs/SPEC-002-v2-modernization-architecture.md` (Phase 1)
> Keystone: Tasks 14 are the "get-right-first" secure auth/session core — every audit CRITICAL/HIGH
> is closed there. Tasks 57 deliver the product capability on top. Do them in order.
## Task 0: Commit this spec
Commit the `specs/v2-secure-session-core/` directory before writing any code:
```
git add specs/v2-secure-session-core/
git commit -m "spec: add v2-secure-session-core shape spec"
```
Do not start Task 1 until this commit exists.
---
## Task 1 (KEYSTONE) [DONE 2026-05-29]: v2 schema — per-agent keys + tenancy-ready tables
> [DONE] migration `004_v2_secure_session_core.sql` + `db/agent_keys.rs` + `db/tenancy.rs` + struct/query
> updates across machines/sessions/support_codes/events/users. Code-reviewed APPROVED. Note: GC's db
> layer already uses runtime `sqlx::query()` (no macros) — the v2 "switch to runtime" was already true.
Files touched: `server/migrations/` (new v2 migration files), `server/src/db/` (rebuilt modules:
`agent_keys.rs` [new], `sessions.rs`, `machines.rs`, `support_codes.rs`, `events.rs`, `users.rs`,
`mod.rs`).
- New table `connect_agent_keys`: `id UUID`, `machine_id UUID FK`, `key_hash TEXT`, `tenant_id UUID NULL`,
`created_at`, `last_used_at`, `revoked_at TIMESTAMPTZ NULL`. Keys are `cak_`-prefixed, stored hashed
(SHA-256, mirroring v1's hash helper); plaintext returned once at issuance.
- Add nullable `tenant_id UUID` to `machines`, `sessions`, `support_codes`, `events`, `users`,
`connect_agent_keys`, defaulting to a single bootstrap tenant row. Add a `tenants` table with one seed row.
- `sessions` gains `is_managed BOOLEAN`, `source TEXT` (`'standalone'|'gururmm'`), `consent_state TEXT`
(`'not_required'|'pending'|'granted'|'denied'`), `tenant_id`.
- `support_codes`: add single-use semantics — `consumed_at TIMESTAMPTZ NULL`; widen the code column to
hold a higher-entropy human-readable code (see Task 4).
- Migrations are **idempotent** (`CREATE TABLE IF NOT EXISTS` / `ADD COLUMN IF NOT EXISTS`), applied on
server startup, recorded in `_sqlx_migrations`. New queries use runtime `sqlx::query()`.
- DB modules expose a `tenancy` helper that today resolves every call to the default tenant (Phase-4
switch point). Struct fields match columns (the v1 audit flagged 3 unmapped v003 columns — don't repeat).
Reference: `specs/native-remote-control/plan.md` Task 2 (`connect_agent_keys`); `.claude/standards/gururmm/sqlx-migrations.md`.
---
## Task 2 (KEYSTONE) [DONE 2026-05-29 — code-reviewed APPROVED]: Rebuilt auth model — plane separation + session-scoped viewer tokens
> CARRY-FORWARD TO TASK 3: viewer-token minting is gated only by `AuthenticatedUser` (authentication,
> not authorization). GC has a real `admin|operator|viewer` role + permissions model, so this is intra-
> tenant privilege escalation until a permission check is added. **The mechanism is fixed here; the authz
> check in Task 3 is what closes audit CRITICAL #1.** Metadata bug todo faf39fe0 resolved (migration 005).
> [IMPLEMENTED] `auth/agent_keys.rs` [new] (cak_ mint/SHA-256 hash/verify), `auth/jwt.rs`
> (`ViewerClaims` + `create_viewer_token`/`validate_viewer_token`, 5-min TTL, `purpose:"viewer"`),
> `auth/mod.rs` (module + re-export). Deleted the JWT-as-agent-key branch in `relay/mod.rs`
> `validate_agent_api_key` — now per-agent `cak_` key OR deprecated shared `AGENT_API_KEY` (WARNING-logged),
> never a user JWT. New endpoints: `POST/GET /api/machines/:agent_id/keys`,
> `DELETE /api/machines/:agent_id/keys/:key_id` (admin), `POST /api/sessions/:id/viewer-token` (dashboard JWT).
> db helpers added: `agent_keys::{list_for_machine,key_belongs_to_machine}`. Folded in migration
> `005_machine_metadata.sql` + Machine struct org/site/tags mapping (coord todo faf39fe0). No Rust
> toolchain on this machine — self-reviewed; not yet `cargo check`-verified.
Files touched: `server/src/auth/` (`mod.rs`, `jwt.rs`, `agent_keys.rs` [new], `token_blacklist.rs`,
`password.rs`), `server/src/api/auth.rs`, `server/src/api/sessions.rs`.
- **Delete the JWT-as-agent-key path.** Remove the `jwt_config.validate_token` branch from
`validate_agent_api_key` (`relay/mod.rs:224`); agent authentication validates a `cak_` per-agent key
(hash compare against `connect_agent_keys`, reject if `revoked_at` set) OR a support code — **never a
user JWT**.
- **Session-scoped viewer tokens:** new endpoint `POST /api/sessions/:id/viewer-token` (auth: dashboard
JWT + authorization that the user may view that session/machine) mints a short-lived (~5 min) JWT whose
claims include `session_id` and `tenant_id`. This replaces "any dashboard JWT can view any session."
- Keep Argon2id passwords; keep the blacklist but make `is_revoked()` callable from the WS layer (Task 3).
- Per-agent key issuance endpoint (admin): `POST /api/machines/:id/keys` → returns plaintext `cak_` once,
stores hash. `DELETE /api/machines/:id/keys/:key_id` sets `revoked_at`.
Reference: `relay/mod.rs:224` (`validate_agent_api_key` — the CRITICAL), `auth/mod.rs:116`
(blacklist already consulted for REST — extend to WS), `specs/native-remote-control/plan.md` Tasks 2/3/6;
`.claude/standards/security/credential-handling.md`, `.claude/standards/api/response-format.md`.
---
## Task 3 (KEYSTONE) [IMPLEMENTED 2026-05-29 — self-reviewed; no Rust toolchain on this machine, not yet `cargo check`-verified]: Secure relay WS handlers + bounded relay
> [IMPLEMENTED] Viewer WS now verifies the session-scoped VIEWER token
> (`validate_viewer_token`: sig+exp+`purpose`) + `token_blacklist.is_revoked` +
> `session_id` claim == requested session, before upgrade — raw login JWTs are no
> longer accepted (closes CRITICAL #1 mechanism + #2). Authz gate added to
> `mint_viewer_token`: `is_admin() || has_permission("view")` → 403 envelope on
> failure (closes CRITICAL #1 — uses the EXISTING `view` permission from GC's
> catalog; no new permission defined). Agent WS now binds persistent reattach to
> the authenticated machine identity: `validate_agent_api_key` returns an
> `AgentKeyAuth` enum carrying the `cak_` key's machine `agent_id` (resolved via
> new `db::machines::get_machine_by_id`); a mismatched query-string `agent_id` is
> ignored, a per-agent key whose machine can't be resolved fails closed
> (503). Frame caps set on BOTH upgrades (agent 4 MiB, viewer 64 KiB via
> `max_message_size`/`max_frame_size`) (closes WS-OOM HIGH). Viewer→agent input
> throttled to 200 events/sec/viewer via a refilling token bucket + non-blocking
> bounded `try_send` (drop/coalesce on overflow) (closes input-injection MEDIUM).
> Startup managed-session reconcile retained + clarified (persistent machines →
> offline in-memory sessions). Removed `#[allow(dead_code)]` on
> `validate_viewer_token` and `AuthenticatedUser::has_permission`. No token/secret
> logged; runtime `sqlx::query`/`query_as`. Files: `server/src/relay/mod.rs`,
> `server/src/api/sessions.rs`, `server/src/db/machines.rs`, `server/src/auth/mod.rs`,
> `server/src/auth/jwt.rs`, `server/src/main.rs`.
### Task 3 authz-strength fix — VIEW_ONLY/CONTROL token split [IMPLEMENTED 2026-05-29 — self-reviewed; no Rust toolchain on this machine, not yet `cargo check`-verified; pending Code Review]
> Closes audit CRITICAL #1 at full strength (coord todo c8916c89). The Task-3 gate
> minted a viewer token for any `is_admin() || has_permission("view")` user, but `view`
> is held by EVERY default role (incl. `viewer`) and the token granted input CONTROL —
> intra-tenant privilege escalation. Now the token carries an ACCESS MODE inside its
> signed claims and the relay enforces it:
>
> - `auth/jwt.rs`: new `ViewerAccess` enum (`ViewOnly` | `Control`, serde-renamed to
> `"view_only"`/`"control"`); `ViewerClaims` gains an `access: ViewerAccess` field;
> `create_viewer_token(..., access)` stamps it; `validate_viewer_token` returns it as
> part of the claims (sig+exp+`purpose` checks unchanged). New unit tests cover the
> round-trip, the lowercase wire form, and login-JWT rejection.
> - `auth/mod.rs`: re-export `ViewerAccess`.
> - `api/sessions.rs` (`mint_viewer_token`): TIERED mint — `is_admin() || has_permission("control")`
> → CONTROL token; else `has_permission("view")` → VIEW_ONLY token; else → 403 (standard
> envelope). Permission constants `SESSION_CONTROL_PERMISSION="control"` /
> `SESSION_VIEW_PERMISSION="view"`. Response echoes `access` (advisory; the signed claim
> is authoritative).
> - `relay/mod.rs`: `viewer_ws_handler` reads `claims.access` from the VERIFIED token and
> threads it into `handle_viewer_connection` (new `access: ViewerAccess` param). In the
> input path, a view-only token's `MouseEvent`/`KeyEvent`/`SpecialKey` are refused (a
> guarded match arm `if !access.can_control()` that silently drops + logs once-per-
> power-of-two), BEFORE the throttle/`try_send`. A control token forwards as before (with
> the Task-3 throttle). Video still streams to a view-only viewer; chat (not an injected-
> input vector) is still relayed. The mode cannot be forged — it lives in the signed token.
>
> Everything else from Task 3 (session_id-claim match, blacklist, frame caps, throttle,
> agent identity binding) is intact — this is purely additive access-mode enforcement.
>
> PHASE-2 REFINEMENT: this refuses to FORWARD input for a view-only token; it does NOT yet
> tie the viewer mode to the agent-side `SessionType.VIEW_ONLY` capture mode (the agent still
> does full capture). Deferred (deeper agent change).
Files touched: `server/src/relay/mod.rs`, `server/src/session/mod.rs`.
- **`viewer_ws_handler`** (`relay/mod.rs:242`): verify the viewer token's **signature + expiry +
blacklist + `session_id` claim == requested `session_id`** before `handle_viewer_connection`
(`relay/mod.rs:595`). Reject otherwise. (Fixes the any-JWT-joins-any-session + blacklist-bypass CRITICALs.)
- **Viewer-token AUTHORIZATION (carry-forward from Task 2 review) — this is what actually closes audit
CRITICAL #1.** The minting endpoint `POST /api/sessions/:id/viewer-token` (in `server/src/api/sessions.rs`)
must enforce a real permission predicate, not just `AuthenticatedUser`: `user.is_admin() ||
user.has_permission(<policy>)`. GC's role model (`admin|operator|viewer`) + permissions table already
exist (`server/src/auth/mod.rs`), so honoring the intra-tenant role distinction is cheap. **Policy
DECIDED (Mike, 2026-05-29): admin-or-view-permission** — `user.is_admin() || user.has_permission(<the
session-view/control permission in GC's catalog — use the real name; define one if absent>)`. Enforce
at the minting endpoint `mint_viewer_token` (the authz decision point); the WS then trusts the
session-scoped token. Multi-tenant client-access isolation stays deferred to Phase 4.
- **`agent_ws_handler`** (`relay/mod.rs:55`): authenticate via per-agent key OR support code only
(Task 2). Persistent reattach must bind to the authenticated machine identity, not a query-string
`agent_id` alone (`session/mod.rs:98`).
- **Frame caps:** set explicit `.max_message_size(...)`/`.max_frame_size(...)` on both
`WebSocketUpgrade`s; reject oversized frames before `to_vec()`/broadcast. (Fixes WS-OOM HIGH.)
- **Input throttle:** bound + rate-limit the viewer→agent input queue (`relay/mod.rs:669`); cap events/sec.
- Reconcile managed sessions from DB on startup so they aren't orphaned.
Reference: audit Pass E (`reports/2026-05-29-gc-audit.md` §"Pass 5"); `relay/mod.rs:242,55,595,669`.
---
## Task 4 (KEYSTONE): Working rate limiting + single-use support codes
Files touched: `server/src/middleware/rate_limit.rs` (rebuild — v1 is non-compiling),
`server/src/middleware/mod.rs`, `server/src/api/auth.rs` (login), `server/src/api/` (code validate),
`server/src/db/support_codes.rs`, `server/src/relay/mod.rs` (code bind).
- Rebuild a **compiling** rate-limit layer (fix the `tower_governor` generics, or a small in-memory
fixed-window limiter); re-enable `pub mod rate_limit`. Wire it to `POST /api/auth/login`,
`POST /api/auth/change-password`, and the support-code validate route.
- **Single-use codes:** consume atomically on first agent bind (accept only a `pending`, unconsumed code;
set `consumed_at`); reject a second presenter. (Fixes the reusable-code HIGH.)
- **Widen the code:** higher-entropy human-readable format (e.g. grouped base32, ~40+ bits) replacing
6-digit numeric; add per-IP lockout on repeated validate failures.
Reference: audit Pass B/E (rate limiting disabled/non-compiling; reusable codes); `middleware/mod.rs:3-11`.
---
## Task 5: Attended-mode consent
Files touched: `proto/guruconnect.proto`, `agent/src/session/mod.rs`, `agent/src/` (consent UI dialog),
`server/src/relay/mod.rs`, `server/src/session/mod.rs`.
- Add `ConsentRequest` / `ConsentResponse` to the proto (after `AdminCommand`).
- On an attended session, the agent shows a consent dialog to the end user; the server keeps the session
`consent_state = pending` and surfaces it to the technician only on `granted`. `denied`/timeout → tear down.
- Managed/unattended sessions follow per-tenant policy (default: silent for managed; `consent_state =
not_required`). Audit the consent decision to `events`.
Reference: `specs/native-remote-control/plan.md` Task 5 (Consent primitive); proto `AdminCommand` insertion point.
---
## Task 6: Native viewer — full key fidelity
Files touched: `agent/src/viewer/` (low-level hook + input capture), `agent/src/input/keyboard.rs`
(extend — salvaged), `agent/src/input/mod.rs`, `agent/src/bin/sas_service.rs` (wire — salvaged),
`proto/guruconnect.proto` (confirm `KeyEvent`/`SpecialKeyEvent` coverage).
- **Viewer capture:** install `WH_KEYBOARD_LL` while the viewer is in focused control; divert system
combos (Win key, Win+R, Alt+Tab, Ctrl+Esc) and forward as `KeyEvent` (VK + scan code + extended flag)
instead of letting the local shell act. Add a "send system keys to remote" toggle.
- **Agent injection:** scan-code `SendInput` (`KEYEVENTF_SCANCODE` + correct `KEYEVENTF_EXTENDEDKEY` for
right-Ctrl/Alt, arrows, Win, Insert/Home). Extend the salvaged `input/keyboard.rs`.
- **Ctrl+Alt+Del:** `SpecialKeyEvent.CTRL_ALT_DEL` → the salvaged SAS service (`bin/sas_service.rs`,
`SendSAS`, SYSTEM, `SoftwareSASGeneration` policy set by the managed installer).
- **Modifier hygiene:** track modifier up/down; **re-sync (release) on focus loss** to kill stuck modifiers.
Reference: SPEC-002 §4.1/§4.2; salvage ledger §2; `agent/src/input/keyboard.rs`, `agent/src/bin/sas_service.rs`.
---
## Task 7: Hardware H.264 encode + negotiated raw/Zstd fallback
Files touched: `agent/src/encoder/` (`mod.rs`, `h264.rs` [new], `raw.rs` [salvaged]),
`agent/src/capture/` (feed), `agent/src/viewer/` (decode), `proto/guruconnect.proto`
(`AgentStatus` capability, `SessionResponse` codec), `server/src/session/mod.rs` (negotiation).
- HW **H.264** via Windows Media Foundation (transparently NVENC/AMF/QuickSync) emitting the proto's
`EncodedFrame` (h264). Native viewer decodes via MF/D3D11.
- **Fallback:** salvaged raw BGRA + Zstd + dirty-rects for Win7 / no HW encoder.
- **Negotiation:** agent advertises HW-encode capability in `AgentStatus`; server selects the codec in
`SessionResponse`; default H.264, HEVC opt-in, raw fallback. One encode feeds the native viewer now
(and the Phase-2 WebRTC track later).
Reference: SPEC-002 §5; `agent/src/encoder/raw.rs` (salvaged), `proto/guruconnect.proto` (EncodedFrame already modeled).
---
## Task 8: Verification (end-to-end, observable)
- **Security (re-audit clean):** run `/gc-audit --pass=security` (and `--pass=rust`) → the three relay
CRITICALs and the rate-limiting/frame-cap/reusable-code HIGHs are gone. Manually confirm: a revoked
`cak_` key is rejected on `/ws/agent`; a viewer token for session A is rejected on session B; a
logged-out (blacklisted) viewer token is rejected on `/ws/viewer`; a user JWT is rejected as an agent key.
- **Attended flow:** generate a support code → run the one-time agent → end user sees + accepts consent →
technician's session appears only after acceptance; a denied consent tears down.
- **Key fidelity (the headline):** in a live session, confirm **Win+R opens Run on the remote**, **Ctrl+C
on remote / Ctrl+V locally** (and vice-versa, text), and **Ctrl+Alt+Del reaches the remote secure
desktop**. Confirm no stuck modifiers after alt-tabbing away and back.
- **Codec:** confirm a HW-H.264 machine negotiates h264 (check `SessionResponse`), a Win7/no-HW machine
falls back to raw+Zstd, both render correctly in the native viewer.
- **Rate limiting:** hammer `/api/auth/login` and the code-validate route → confirm throttling/lockout.
- **Migrations:** fresh DB applies the v2 migrations cleanly; `_sqlx_migrations` consistent; `tenant_id`
populated with the default tenant.