docs(audit): add inaugural gc-audit report 2026-05-29

First /gc-audit run (also a dry run validating the skill). 7 passes. 4 CRITICAL (3 relay-plane auth failures: any-JWT session hijack, viewer-WS blacklist bypass, JWT-accepted-as-agent-key; 1 functional: dashboard protobuf.ts wire-incompatible). Plus deploy.yml stub leaving prod 57 commits stale. Proposed roadmap/tech-debt deltas listed (not yet applied, pending review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 17:46:26 -07:00
parent ccc6ba9c02
commit 486debfc52
1 changed files with 243 additions and 0 deletions
--- a/reports/2026-05-29-gc-audit.md
+++ b/reports/2026-05-29-gc-audit.md
@@ -0,0 +1,243 @@
+# GuruConnect Audit Report — 2026-05-29
+
+**Auditor:** Claude (claude-opus-4-8, via /gc-audit)
+**Passes:** API Surface, Rust Quality, TypeScript/Dashboard, Protocol Integrity, Security & Session, CI/CD Pipeline, Docs Reconciliation
+**Previous audit:** First audit (this is the inaugural /gc-audit run — also a dry run to validate the skill)
+**Note:** Gitea was down during this run (internal :3000 refused, public 502). Pass G's Gitea-runner checks are COVERAGE GAPS to re-verify; deploy-host (172.16.3.30) live checks succeeded (separate segment). Living docs were NOT auto-edited this run — proposed deltas are listed for review (dry-run posture).
+
+---
+
+## Executive Summary
+
+| Pass | Total | Critical | High | Medium | Low |
+|------|-------|---------|------|--------|-----|
+| API Surface | 18 | 0 | 4 | 4 | 0 |
+| Rust Quality | 9 | 1 | 2 | 2 | 2 |
+| TypeScript/Dashboard | 12 | 2 | 1 | 3 | 4 |
+| Protocol Integrity | 7 | 1 | 2 | 1 | 2 |
+| Security & Session | 12 | 3 | 3 | 2 | 2 |
+| CI/CD Pipeline | 11 | 0 | 3 | 1 | 1 |
+| **TOTAL (delta-deduped)** | **~45 unique** | **4** | **9** | **~10** | **~10** |
+
+**Requires immediate action (CRITICAL):**
+1. `validate_agent_api_key` accepts any valid dashboard JWT as an agent API key → viewer→agent impersonation / session-slot seizure (`server/src/relay/mod.rs:224-239`).
+2. Viewer WebSocket: any valid JWT attaches to and controls ANY session by UUID — no per-session/per-machine authorization (`server/src/relay/mod.rs:242-288`).
+3. Viewer WebSocket skips the token blacklist → revoked/logged-out tokens still grant live remote control until natural expiry (`server/src/relay/mod.rs:260`).
+4. Dashboard `lib/protobuf.ts` is a fabricated fixed-offset codec, wire-incompatible with the real protobuf the agent/server speak → React viewer cannot decode frames or send input (`dashboard/src/lib/protobuf.ts:18-150`). *Functional-critical; currently dead code (prod uses static viewer.html) so production impact is deferred until the component library is mounted.*
+
+---
+
+## Pass 1: API Surface
+
+The server exposes a substantially larger REST surface than the static dashboard consumes. **No broken calls and no method mismatches** — every static-page `fetch` resolves to a real route with the correct method. The gaps are UI-orphaned server capability.
+
+### [HIGH] Entire `/api/machines` resource is UI-orphaned; Machines tab is session-backed
+**File:** `server/src/main.rs:316-319`; `server/static/dashboard.html:818-821`
+**Detail:** `GET /api/machines` (list), `GET /api/machines/:agent_id`, `DELETE /api/machines/:agent_id`, `GET /api/machines/:agent_id/history` have no frontend caller. The dashboard "Machines" tab is populated by `loadMachines()` which actually fetches `/api/sessions`. Persistent-machine inventory, history, and deletion have no UI; if the sessions payload lacks `is_online`/`agent_version`, the tab renders empty/Unknown.
+**Recommendation:** Repoint the Machines tab to `GET /api/machines`, or document that the dashboard intentionally shows live sessions; add machine history/delete UI.
+
+### [HIGH] Full `/api/releases` CRUD has no UI
+**File:** `server/src/main.rs:326-333`
+**Detail:** `GET/POST /api/releases`, `GET/PUT/DELETE /api/releases/:version` have zero frontend callers; no release-management page exists.
+**Recommendation:** Build a releases admin page, or confirm releases are CI/API-managed and mark ops-only.
+
+### [HIGH] `/api/users/:id/clients` (set_client_access) orphaned
+**File:** `server/src/main.rs:302`
+**Detail:** users.html has a `manage_clients` permission checkbox but never assigns which clients a user may access. Server supports per-user client scoping with no UI to set it.
+**Recommendation:** Add a client-assignment UI, or confirm client-access is managed elsewhere.
+
+### [HIGH] `useRemoteSession.ts` `/ws/viewer` URL omits the required JWT token
+**File:** `dashboard/src/hooks/useRemoteSession.ts:92`
+**Detail:** The React hook builds `?session_id=...` with no `token` (and no `viewer_name`), but the viewer WS requires a JWT in the `token` query param (static viewer.html includes it). If the component library is used against this server, viewer WS auth fails.
+**Recommendation:** Include the JWT token + viewer_name query params.
+
+### [MEDIUM] `/api/sessions/:id` (GET single), `/api/changelog/:component/:version`, `/api/auth/blacklist/{stats,cleanup}` orphaned (no UI)
+**Files:** `main.rs:313`, `:337-340`, `:284-291`
+**Detail:** Session-detail GET, changelog display, and blacklist ops endpoints have no UI consumer. Most are reasonable ops/API-only endpoints; flag and document.
+
+### [INFO] Genuinely-external routes (not dead)
+`/api/version` (agent polling), `/ws/agent` (Rust agent), `/health`, `/metrics` are correctly out-of-band.
+
+**Pass A summary:** All frontend calls resolve correctly; the issue is large UI-orphaned server CRUD — the entire `/api/machines` resource (the Machines tab is actually `/api/sessions`-backed), all of `/api/releases`, changelog, and `/api/users/:id/clients` client-scoping. Highest-impact: missing machine-management UI and the `useRemoteSession` missing-token bug.
+
+---
+
+## Pass 2: Rust Code Quality
+
+### [CRITICAL] Auth-plane leak — dashboard JWT accepted as agent API key
+**File:** `server/src/relay/mod.rs:224-239`
+**Detail:** `validate_agent_api_key` returns `true` for any token passing `jwt_config.validate_token`. A viewer-plane credential (dashboard JWT) is thus a valid agent-plane credential — a leaked/low-priv JWT can register/impersonate an agent. The code comment admits "allows using dashboard token for testing."
+**Recommendation:** Drop the JWT-accept branch; validate `api_key` only against `AGENT_API_KEY` (and future per-agent keys with `role == "agent"`). *(Also surfaced by Pass E as CRITICAL with session-level impact.)*
+
+### [HIGH] Initial-admin password logged to tracing on fallback path
+**File:** `server/src/main.rs:175`
+**Detail:** Primary path writes creds to a 0600 file; if `fs::write` fails, `info!("  Password: {}", password)` emits the cleartext password to the log stream (journald/aggregators).
+**Recommendation:** On write failure, print to stdout only or abort startup; never route the secret through the structured logger.
+
+### [HIGH] Rate limiting disabled and non-compiling
+**File:** `server/src/middleware/mod.rs:3-11`, `middleware/rate_limit.rs`, `main.rs:268,303`
+**Detail:** `rate_limit` module is commented out ("not yet functional"); `rate_limit.rs` has invalid type signatures; login/change-password/code-validate routes carry only `// TODO: Add rate limiting`. Brute-force protection on auth is absent. *(Cross-ref Pass E, Pass F — SEC-2 not fixed; roadmap `[x]` partially false.)*
+
+### [MEDIUM] Stub admin endpoint returns 501
+**File:** `server/src/api/auth_logout.rs:99-131`
+**Detail:** `POST /api/auth/admin/revoke-user` is registered/reachable but always `NOT_IMPLEMENTED` — admins cannot force-revoke a compromised user's tokens.
+
+### [MEDIUM] Agent run-loop `.unwrap()` on transport
+**File:** `agent/src/session/mod.rs:312,392,465`
+**Detail:** `self.transport.as_mut().unwrap()` — in-loop invariant today, but any reconnect refactor turns it into a hard agent crash. Use `ok_or_else(...)?`.
+
+### [LOW] Support codes logged in cleartext; `/metrics` mutex `.unwrap()` panic
+**Files:** `relay/mod.rs:107,134,165,377`, `main.rs:413,415`
+
+### [INFO] Acceptable panics / correct usage
+Startup `JWT_SECRET` presence + ≥32-char enforcement (`main.rs:94-100`) is the required fail-closed behavior; static-literal `parse().unwrap()` in CORS/headers is infallible; agent `println!`/`eprintln!` are user-facing CLI output (version-info/help), not logging violations. sqlx compile-time macros used throughout — correct for GC, not flagged. Logout blacklist IS consulted per-request for REST (`auth/mod.rs:116`); JWT `exp` enforced; Argon2id confirmed; no raw `e.to_string()` leaks to clients.
+
+**Pass B summary:** Server is broadly disciplined (tracing, anyhow, Argon2id, exp enforcement, per-request blacklist on REST, clean error handling, sqlx macros correctly used). Standout problem is the CRITICAL one-way auth-plane leak; secondary: admin-password-on-fallback logging and fully-disabled rate limiting.
+
+---
+
+## Pass 3: TypeScript / Dashboard Quality
+
+### [CRITICAL] `lib/protobuf.ts` is not protobuf — fabricated fixed-offset layout, wire-incompatible
+**File:** `dashboard/src/lib/protobuf.ts:93-150` (decode), `:18-88` (encode)
+**Detail:** The server relays genuine prost-encoded protobuf `Message` bytes verbatim (`relay/mod.rs:449,454`). A real `Message{video_frame}` starts with tag byte `0x52` (field 10 << 3 | 2). The decoder instead checks `getUint8(0) === 10` and reads phantom fixed offsets, so **every real frame returns null** (canvas never renders). The static `server/static/viewer.html` parses the same wire tree correctly — the two implementations have diverged.
+**Recommendation:** Delete the hand-rolled codec; port viewer.html's varint/length-delimited parser, or use protobufjs/generated code.
+
+### [CRITICAL] Encoders emit the same fabricated layout — input events undecodable by the agent
+**File:** `dashboard/src/lib/protobuf.ts:18-46, 51-88`
+**Detail:** `encodeMouseEvent`/`encodeKeyEvent` write a flat struct with a leading message-id byte. The server decodes viewer input via `proto::Message::decode` (`relay/mod.rs:666`) and rejects these as malformed → clicks/keys silently dropped.
+
+### [HIGH] Decoder bounds-checking incomplete — short/hostile frames read OOB
+**File:** `dashboard/src/lib/protobuf.ts:93-117`
+**Detail:** Only `length < 2` guarded, then reads through offset 21. A 3–20-byte frame throws `RangeError` into an uncaught promise (`useRemoteSession.ts:60`). Validate full header length; wrap decode in try/catch.
+
+### [MEDIUM] `sequence` decoded but out-of-order frames never dropped; WS hook reconnects forever with no backoff/error surface; `fzstd` decompression is a no-op stub
+**Files:** `RemoteViewer.tsx:40-87`, `useRemoteSession.ts:105-125`, `protobuf.ts:156-162`
+
+### [LOW] Single union cast hides out-of-range `encoding`; SessionControls `<select>` lacks label; redundant cleanup
+### [INFO] BGRA→RGBA conversion correct; tsconfig `strict` + unused-symbol checks on; zero `any`/`@ts-ignore`/`console.*` in `dashboard/src/`. The React component library appears **dead/divergent** — prod serves the hand-written static viewer.
+
+**Pass C summary:** Headline defect is the fabricated, wire-incompatible protobuf layer; every real frame returns null and input is silently dropped. The component library must be removed or rebuilt against the real wire format before it is ever mounted. Otherwise TS hygiene is good (strict tsconfig, no any/console).
+
+---
+
+## Pass 4: Protocol & Wire-Format Integrity
+
+**Four-way drift matrix (abridged — full matrix from Agent D):** proto / agent(prost) / server-relay(prost) / dashboard `protobuf.ts` / static `viewer.html`. The proto schema, agent prost codec, server relay codec, and static viewer.html are **mutually consistent and wire-correct**. The `dashboard/protobuf.ts` column is uniformly **WRONG** (fabricated offsets) for video_frame(10)/mouse_event(20)/key_event(21). Field NAMES in `types/protocol.ts` map correctly (camelCase↔snake_case); the drift is entirely the wire codec.
+
+### [CRITICAL] `dashboard/protobuf.ts` wire-incompatible (independently confirmed) — see Pass C.
+
+### [HIGH] No inbound WebSocket frame-size cap before decode/relay
+**File:** `server/src/relay/mod.rs:209,278,447-454`
+**Detail:** Neither `agent_ws_handler` nor `viewer_ws_handler` sets `max_message_size`/`max_frame_size`. A hostile agent can stream max-size frames, each `to_vec()`'d and broadcast to all viewers → memory pressure/OOM.
+**Recommendation:** Set explicit small caps on both upgrades; reject oversized frames before broadcast. *(Cross-ref Pass E.)*
+
+### [HIGH] Two divergent dashboard wire implementations exist; only `viewer.html` is correct.
+
+### [MEDIUM] Version strings out of sync
+**File:** workspace `Cargo.toml:18` (`0.2.2`) vs `agent`/`server` (`0.2.0`) vs `dashboard/package.json` (`0.2.0`)
+**Detail:** agent/server pin their own version instead of `version.workspace = true`; agent embeds `0.2.0` at build. Expected pre-auto-bump (SPEC-001 §3). Use `version.workspace = true` or land the auto-bump.
+
+### [LOW] v003 machine columns written by raw SQL but not mapped into the `Machine` struct; `switch_display`(12)/`quality_settings`(40)/`update_status`(76) defined in proto but undispatched.
+
+### Migration integrity: **clean** — 001/002/003 sequential, no gaps; every table has a column-accurate `db/<name>.rs` module.
+
+**Pass D summary:** The Rust/server protocol representations are consistent and correct; the single severe defect is the fabricated `dashboard/protobuf.ts`. Secondary: no WS frame size cap (OOM), version split, and a few undispatched proto variants. Migrations clean.
+
+---
+
+## Pass 5: Security & Remote-Session Integrity
+
+The REST/dashboard plane is reasonably hardened; the **remote-control relay plane — the product's actual threat surface — holds three independent CRITICAL auth-boundary failures.**
+
+### [CRITICAL] Any valid JWT can attach to and control ANY session — no per-session authorization
+**File:** `server/src/relay/mod.rs:242-288`
+**Detail:** `viewer_ws_handler` verifies token signature/expiry but performs zero authorization binding between user and `session_id`. Any authenticated user (any role, no client grants) can connect to `/ws/viewer?session_id=<UUID>&token=<any JWT>` and receive video + send input. `/api/sessions` lists all live session UUIDs to any authenticated user, so IDs aren't secret → full remote-control hijack of any in-progress session.
+**Recommendation:** Enforce `has_permission("control")`/view + client-access scoping against the session's machine before `join_session`.
+
+### [CRITICAL] Viewer WebSocket does not check the token blacklist — revocation bypass
+**File:** `server/src/relay/mod.rs:260`
+**Detail:** Calls `validate_token` directly, never `token_blacklist.is_revoked()`. A revoked/logged-out token still grants screen view + input injection until natural expiry (up to 24h) — SEC-5 revocation does not cover the relay plane.
+**Recommendation:** Add the `is_revoked` gate (as in `auth/mod.rs:116`) to both viewer and JWT-based agent WS handlers.
+
+### [CRITICAL] Dashboard JWT accepted as agent API key — viewer→agent impersonation
+**File:** `server/src/relay/mod.rs:224-239` — see Pass B. Session-level impact: a JWT-bearer who knows a target `agent_id` can seize that machine's persistent-session slot (`session/mod.rs:98-131`, reattach keyed solely on query-string `agent_id`).
+
+### [HIGH] Support codes effectively reusable / second-agent attach
+**File:** `server/src/relay/mod.rs:102-132`, `support_codes.rs:252-260`
+**Detail:** Agent gate accepts a code whose status is `pending` OR `connected` — never single-use-consumed on first connect, so a second/attacker agent presenting the same 6-digit code is also admitted. Combined with unauthenticated `/api/codes/:code/validate`, a ~900k numeric space, and no rate limiting → brute-forceable.
+**Recommendation:** Consume the code atomically on first bind (accept only `pending`); rate-limit the validate endpoint.
+
+### [HIGH] No WS frame-size cap (OOM/DoS) — see Pass D.
+### [HIGH] Auth/validate rate limiting disabled & non-compiling (SEC-2 not fixed) — see Pass B.
+
+### [MEDIUM] No rate limit / consent gate on viewer→agent input injection
+**File:** `relay/mod.rs:669-679`, `agent/src/session/mod.rs:502-515`
+**Detail:** Input events forwarded with no throttle; agent injects each via `SendInput` with no per-second cap and no operator control-grant. Sustained full-rate input floods the target.
+
+### [MEDIUM] TLS/WSS not enforced in-app (plaintext if proxy bypassed); [LOW] JWT/support-codes in URL query strings get logged by TraceLayer/proxy.
+
+### [INFO] Settled SEC items still hold: SEC-3 (parameterized SQL), SEC-4 (partial — IP/event logging), SEC-9 (Argon2id), SEC-13 (JWT exp). SEC-5 only partially holds (blacklist not enforced on viewer WS — see CRITICAL above).
+
+**Pass E summary:** The relay plane has three CRITICAL auth failures (any-JWT session hijack, viewer-WS blacklist bypass, JWT-as-agent-key), compounded by reusable brute-forceable support codes, absent WS frame caps, absent input rate limiting/consent, and disabled rate limiting. Priority: close the JWT-as-agent-key leak and add per-session authorization + blacklist checks to the relay handlers first.
+
+---
+
+## Pass 6: CI/CD Pipeline Health
+
+### [HIGH] `deploy.yml` production deploy step is a stub → prod is 57+ commits stale
+**File:** `.gitea/workflows/deploy.yml:71-78`; live: running binary mtime 2026-01-18, git HEAD on host `1bfd476` vs submodule HEAD `ccc6ba9`
+**Detail:** Deploy step only `echo`s "Deployment command would run here"; the scp+ssh deploy.sh lines are commented out. The workflow builds/packages/uploads an artifact but never lands it. The production host has never received a CI-driven deploy and predates all v0.2.x tags + the SPEC-001 pipeline.
+**Recommendation:** Wire the real SSH deploy (deploy key in repo secrets) or document prod deploy as intentionally manual.
+
+### [HIGH] Served agent artifact stale + likely unsigned
+**Detail:** `https://connect.azcomputerguru.com/downloads/guruconnect.exe` → 200, last-modified 2026-01-18, predating SPEC-001 §2 Azure Trusted Signing → distributed exe is almost certainly unsigned.
+**Recommendation:** Run release.yml and deploy the signed exe; confirm Authenticode signature on the served binary.
+
+### [HIGH→GAP] Gitea runner registration UNVERIFIABLE (outage)
+**Detail:** Could not reach Gitea to confirm a live `ubuntu-latest` runner and the Pluto `windows-msvc` runner. **Risk:** if no Windows/Pluto runner is online, the agent build job in both build-and-test.yml and release.yml cannot run. Re-check post-outage via `GET /api/v1/repos/azcomputerguru/guru-connect/actions/runners` + `.claude/machines/pluto.md`.
+
+### [MEDIUM] SPEC-001 doc drift — "no Pluto dependency" contradicts shipped workflows
+**Detail:** SPEC-001 states all jobs run on `ubuntu-latest` with no Windows runner; shipped workflows build the agent natively on the Pluto `windows-msvc` runner. Update SPEC-001 to reflect the native-Pluto decision.
+
+### [OK] Clippy `-D warnings` + `cargo audit` are HARD-FAIL gates (the active-lock work) — committed at HEAD `ccc6ba9`, dev lock released. SPEC-001 gate target already met.
+### [OK] Live deploy host: `guruconnect.service` active, `/health` 200, `/metrics` populated, backup timer armed (next 2026-05-30 02:00 UTC), 3/3 migrations applied.
+### [INFO] cargo-audit gate runs on push/PR to main but is NOT chained into release.yml/deploy.yml (tag-triggered). prometheus.yml has `rule_files`/`alerting` commented out → alerts.yml not wired (verify on host).
+
+**Pass G summary:** Workflow code is in good shape and the SPEC-001 clippy/audit hard-fail target is already met. The real gap is the stubbed `deploy.yml` deploy step — corroborated by live state: the running binary and served agent exe are stuck at 2026-01-18 / 57 commits behind, so production runs pre-CI-era code and distributes a likely-unsigned agent. Re-check post-outage: runner registration (esp. Pluto), release-workflow secrets, Actions run history.
+
+---
+
+## Proposed FEATURE_ROADMAP.md Delta (Agent F — NOT yet applied; dry-run)
+
+**Flip `[ ]` → `[x]` (STALE-INCOMPLETE — shipped end-to-end):**
+- Code signing — Azure Trusted Signing in CI — proof: `release.yml` `build-sign-publish` (jsign TRUSTEDSIGNING, L417-433)
+- Automatic versioning — `release.yml` `version` job (conventional-commit bump, L73-187)
+- Changelog generation & API — `release.yml` git-cliff + `server/src/api/changelog.rs` + route `main.rs:337-340`
+- Feature-request workflow — `.claude/commands/gc-feature-request.md`
+- Roadmap / ADR / spec tracking — `docs/FEATURE_ROADMAP.md` + `docs/ARCHITECTURE_DECISIONS.md` + `docs/specs/SPEC-001`
+- Coord-API registration — `guruconnect` project_key in root CLAUDE.md project-keys table
+
+**Split / flag (STALE-COMPLETE):**
+- Server/API `[x]` "JWT, Argon2id, **rate limiting**, security headers" → keep `[x]` for JWT/Argon2id/headers; add new `[ ] Rate limiting — [HIGH] disabled & non-compiling` (`middleware/mod.rs:3-11`)
+
+**Annotate (PARTIAL, keep box):**
+- `[~]` React/TS web viewer → append "wire-incompatible, non-functional vs real protocol (`dashboard/src/lib/protobuf.ts`); decompressZstd is a stub"
+- `[ ]` CI security audit gate wired to release → append "audit gate exists + hard-fails in build-and-test.yml; NOT chained into release.yml/deploy.yml"
+
+**Verified accurate (no change):** all Core RC `[x]` items, GuruRMM-Integration `[ ]` items, Sessions/machines/support-codes/events `[x]`, Phase-1 hardening `[x]`, programmatic pre-create `[ ]`.
+
+## Proposed TECHNICAL_DEBT.md Delta (Agent F — NOT yet applied)
+- Move to "Completed Items" (2026-05-29): CI security audit (cargo audit) hard gate; SPEC-001 release engineering (signing/versioning/changelog) pipeline.
+- Confirmed still OPEN: #4/#19 rate-limit failed auth; #5 audit logging (`audit_logs`); #6 UI session timeout; #9 deploy SSH automation; #3 sd_notify watchdog.
+- Stale header: doc says "Phase 1, 89%, Last Updated 2026-01-18" — predates the 2026-05-29 release-engineering work.
+
+---
+
+## Recommended Action Order
+1. **Relay-plane auth (all CRITICAL):** close the JWT-as-agent-key leak (`relay/mod.rs:224-239`); add per-session authorization + token-blacklist checks to `viewer_ws_handler`.
+2. **HIGH security:** single-use support codes; WS inbound frame-size caps; implement + wire real rate limiting (login, change-password, code-validate); stop logging the fallback admin password.
+3. **HIGH pipeline:** wire the real `deploy.yml` SSH deploy; run release.yml to ship a signed, current agent; re-verify Gitea runner registration post-outage.
+4. **Dashboard:** remove or rebuild `dashboard/src/lib/protobuf.ts` against the real wire format before the React viewer is mounted; fix `useRemoteSession` missing JWT token.
+5. **MEDIUM batch:** input-injection throttle + consent gate; machines/releases UI; version `workspace = true`; SPEC-001 Pluto doc fix; chain cargo-audit into release/deploy.
+6. **Docs:** apply the roadmap + tech-debt deltas above (pending approval).