From 2fe0b903151bf9ba3cb016321a357ed9212f2371 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Tue, 2 Jun 2026 14:57:33 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-02 14:57:28 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-02 14:57:28 --- .../2026-06-02-mike-guruconnect-enrollment.md | 92 +++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 session-logs/2026-06-02-mike-guruconnect-enrollment.md diff --git a/session-logs/2026-06-02-mike-guruconnect-enrollment.md b/session-logs/2026-06-02-mike-guruconnect-enrollment.md new file mode 100644 index 0000000..4cb3a2a --- /dev/null +++ b/session-logs/2026-06-02-mike-guruconnect-enrollment.md @@ -0,0 +1,92 @@ +# Session Log — 2026-06-02 — GuruConnect Zero-Touch Enrollment + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +The session began as a request to deploy the GuruConnect (GC) agent onto Howard's RMM test machines "to see how things are working," but investigation showed the managed-agent enrollment path wasn't ready, so the work pivoted into designing and building that path end-to-end. Two live test machines were identified under the `AZ Computer Guru` client (`RMM-TEST-MACHINE` / site `Howard-VM`, and `WIN-TG2STMODJG8` / site `Discovery test site`); a duplicate v0.6.38 `RMM-TEST-MACHINE` ghost was noted. The deploy was deliberately parked once it became clear that (a) the GC relay had moved to per-agent `cak_` keys with only a deprecated shared `AGENT_API_KEY` fallback, and (b) the existing embedded-config download path appends a config blob to the end of the signed `.exe`, which voids Authenticode — colliding with the CI code-signing shipped earlier in the day. + +The first concrete deliverable was a **signed beta/test release channel**: a `channel: stable | beta` `workflow_dispatch` input added to GuruConnect's `release.yml`, where `beta` produces a signed, prerelease-tagged build (skipping the semver bump/changelog) through the identical fail-closed Azure Trusted Signing path. This closed the "every binary handed to a tester must be signed" gap. A long interview with Mike then shaped the enrollment architecture, producing **SPEC-016 (zero-touch per-site agent enrollment)**: a per-site rotatable enrollment key (long secret + short hex `vN (XXXX)` fingerprint), self-registration that mints a per-machine `cak_` bound to a hardware-salted `machine_uid`, auto-approve + new-enrollment alert, collision-gated activation, and a sign-once base binary + per-site signed wrapper (resolving SPEC-007's signing question). Three enrollment paths (per-site embedded key / flag overrides / technician-assisted) were defined, with only the per-site path in v1 and the others reserved. + +Implementation proceeded in reviewed, CI-gated phases. **SPEC-016 Phase A** (server backend: `POST /api/enroll`, per-site key issue/rotate/verify, `machine_uid` dedup + collision gate, `cak_` minting, rate-limiting, migration `010`) was built, reviewed (one HIGH cross-site-move hijack + one MEDIUM timing oracle fixed), and merged via PR #5. **SPEC-016 Phase B** (agent: hardware-salted `machine_uid`, first-run enroll client, DPAPI+ACL `cak_` store, run-mode wiring) was built, reviewed (six findings incl. a re-image-stability bug and a credential-store self-lockout, all fixed), and landed on `main`. During Phase B, Mike authored **SPEC-017 (end-user/sub-user remote access)** and an auto-sync fast-forwarded the whole Phase B line plus SPEC-017 onto `main` outside the PR gate; the planned PR-split became moot but the commits remained cleanly separated. + +The credential-store review exposed that the managed agent runs in the **interactive user context** (`HKCU Run`), not SYSTEM, so the SYSTEM-ACL'd `cak_` store would self-lock the agent on second launch. Mike chose **Option A** (run the managed agent as a SYSTEM service), which is also required for the upcoming session-switching feature. This became **SPEC-018 (managed-agent SYSTEM service host + session broker)** — split into its own spec (per Mike) rather than folded into SPEC-013. **SPEC-018 Phase 1** (LocalSystem service install/lifecycle, SCM dispatch, run-as-SYSTEM, HKCU→service autostart swap; session broker/per-session capture worker deferred to Phase 2) was built, reviewed (APPROVE WITH NITS — connected-stop interruption + FFI panic guard fixed), re-reviewed (confirmed closed), and merged via PR #7. + +All code work flowed through the coordinator pattern: Coding Agent (opus) → Code Review Agent (opus) → fix pass → focused re-review → CI-green → merge → submodule pointer bump. Two transient Windows-runner CI failures occurred (jobs that never started); both were diagnosed against local builds and cleared on re-run. The session ended at a clean checkpoint: Phase A + Phase B + SPEC-018 Phase 1 merged, three specs in place, with SPEC-018 Phase 2 (capture worker) as the next critical-path item. + +## Key Decisions + +- **Beta builds get signed via a `channel` input on `release.yml`, not by signing every CI build.** Keeps signing secrets out of PR-triggered runs; `beta` rides the existing fail-closed sign path, skips bump/changelog, publishes a prerelease tag. +- **Two-tier credential model:** a low-sensitivity, rotatable, long per-site enrollment key (embedded, never typed) gates registration; a high-sensitivity per-machine `cak_` is self-provisioned on first run. Rotating a site key blocks new enrollments from old installers but leaves enrolled agents (holding `cak_`) untouched. +- **Enrollment key fingerprint = monotonic version + short HEX** (`v3 (7F2A)`), deliberately NOT the RMM word-code style (`GREEN-FALCON`), so GuruConnect and GuruRMM artifacts are never visually conflated. +- **`machine_uid` salted from stable HARDWARE signals** (SMBIOS UUID primary; motherboard + disk serial fallback), NOT a per-install random salt — so it survives an OS re-image on the same hardware (re-image dedup) while separating distinct boxes. Uniqueness scoped per-tenant `(tenant_id, machine_uid)`. +- **Collision-gated activation:** a detected `machine_uid` collision (e.g. template-cloned VMs sharing a hardware UUID) drops the endpoint to `pending` + alert, requiring dashboard confirmation — the one deliberate exception to auto-approve. +- **Sign once + per-site signed wrapper, both `.exe` and MSI** (ScreenConnect offers both). The signed agent stays byte-identical; per-site config is delivered around the signed bytes (wrapper/MSI/CLI), never appended into the PE. +- **Cross-site enroll is default-refused (`409 ENROLL_SITE_CONFLICT`)** in Phase A — the deliberate-move (`--reassign`) path is deferred — closing a within-tenant machine-hijack vector. +- **Managed agent → SYSTEM service (Option A)**, chosen over patching the store ACL for user-context, because true unattended access and the upcoming session-switching feature both require SYSTEM. Spec'd as its own SPEC-018, separate from SPEC-013. +- **Accepted the auto-sync fast-forward of `main`** rather than rewriting published history against a live auto-sync; SPEC-018 landed as its own commits, and the "separate specs" intent held via clean commit separation. + +## Problems Encountered + +- **"Entire fleet offline" in RMM** — a rendering artifact: the API returns `is_connected: null`, which the display treated as offline. Real liveness is `last_seen`; both target machines had checked in seconds earlier. +- **Append-config voids Authenticode** — the existing `download_agent` path appends `[MAGIC][len][json]` after the signed PE, invalidating the signature. Resolved architecturally in SPEC-016: sign-once base + per-site wrapper, retire the managed append path. (The attended support-code path is filename-based and already signing-safe — left unchanged.) +- **Phase A HIGH — cross-site machine-move hijack:** a valid site-B key + a guessed `machine_uid` could silently repoint a site-A machine and mint a `cak_`. Fixed: default-refuse different-site enroll (409) before any repoint/mint. +- **Phase A MEDIUM — timing/enumeration oracle:** unknown `site_code` returned fast (no Argon2) vs. wrong key (Argon2), distinguishing valid sites. Fixed: dummy Argon2id verify on all reject paths. +- **Phase B HIGH — `machine_uid` not re-image-stable:** `MachineGuid` (regenerated per OS install) was mixed into the salted digest. Fixed: derive from hardware salt only; `MachineGuid` is last-resort fallback. +- **Phase B CRITICAL — credential-store self-lockout:** the SYSTEM+Admins ACL'd `cak_` store is unreadable by the user-context agent → bricks on 2nd launch. Root cause: managed agent runs via `HKCU Run`, not SYSTEM. Resolved by deciding Option A (SYSTEM service = SPEC-018) and keeping a fail-fast guard in the interim. +- **SPEC-017 number collision:** Mike authored SPEC-017 (end-user access) mid-flight; the planned service-host spec became SPEC-018, and the dangling `SPEC-017` references in the Phase B fail-fast guard were corrected to `SPEC-018`. +- **Auto-sync bypassed the PR/CI gate** — `main` was fast-forwarded up the Phase B + SPEC-017 line outside the PR merge flow. No harm (code was reviewed + green), but flagged: enable branch protection on `main`. +- **Two transient Windows-runner CI failures** (jobs reported `failure` with empty "job not started" logs). Diagnosed as runner infra (identical code green on prior run + local build clean); cleared on re-run. Gitea's combined commit-status also lingers stale failed jobs in its aggregation. +- **SPEC-018 Phase 1 HIGH — non-cancellable connected session:** the SCM stop flag wasn't checked inside the inner session loop, so a connected service couldn't stop gracefully. Fixed with an optional shutdown param threaded into `run_with_tray` (None for non-service callers = unchanged) + a clean WS-close sentinel path. MEDIUM: agent-runtime panic unwound across the `extern "system"` service entry → wrapped in `catch_unwind` → `ServiceSpecific(1)`. + +## Configuration Changes + +GuruConnect repo (`azcomputerguru/guru-connect`, submodule `projects/msp-tools/guru-connect`): +- **Created:** `docs/specs/SPEC-016-zero-touch-enrollment.md`, `docs/specs/SPEC-018-managed-agent-service-host.md` (SPEC-017 created by Mike). `server/migrations/010_spec016_enrollment.sql`. `server/src/auth/enrollment_keys.rs`, `server/src/api/enroll.rs`, `server/src/api/sites.rs`, `server/src/db/sites.rs`, `server/src/db/enrollment_keys.rs`. `agent/src/enroll.rs`, `agent/src/credential_store.rs`, `agent/src/service/mod.rs`. +- **Modified:** `.gitea/workflows/release.yml` (beta channel). `docs/FEATURE_ROADMAP.md` (signing shipped, beta channel, SPEC-016/018 entries, SPEC-007 signing-question resolved). `server/src/{main.rs, auth/mod.rs, db/mod.rs, db/machines.rs, db/events.rs, middleware/rate_limit.rs, api/mod.rs}`. `agent/src/{main.rs, config.rs, install.rs, identity.rs, session/mod.rs}`. `agent/Cargo.toml` (Win32_Security_Cryptography feature; `windows-service` already present). +- ClaudeTools parent: submodule pointer bumped repeatedly to track `guru-connect` main. + +## Credentials & Secrets + +No new credentials created. Existing vaulted credentials used (cite paths, not values): +- Gitea API token: `services/gitea.sops.yaml` → `credentials.api.api-token` (used for PR create/merge/close + CI status polling against internal Gitea). +- GuruRMM admin API: `infrastructure/gururmm-server.sops.yaml` → `credentials.gururmm-api.admin-email` / `admin-password`. +- GC agent enrollment secrets (`cak_`, per-site `cek_` keys) are minted/stored by the new code — stored hashed (Argon2id) server-side; `cak_` stored DPAPI-machine-encrypted in a SYSTEM-ACL'd `%ProgramData%\GuruConnect\credentials\agent.cak` on the endpoint. None exist yet (no live enrollment performed). + +## Infrastructure & Servers + +- **GuruConnect server:** `connect.azcomputerguru.com` (NPM → port 3002 on 172.16.3.30). Relay agent plane `wss://connect.azcomputerguru.com/ws/agent`; new enrollment endpoint `https://connect.azcomputerguru.com/api/enroll` (public). PostgreSQL backend (runtime `sqlx`, no offline cache). +- **GuruRMM API:** `http://172.16.3.30:3001` (JWT auth). +- **Coordination API:** `http://172.16.3.30:8001/api/coord`. +- **Gitea (internal):** `http://172.16.3.20:3000` (preferred on-network); public `git.azcomputerguru.com`. +- **CI:** Gitea Actions `build-and-test.yml` (fmt + clippy -D warnings + Linux build + Postgres-gated tests + Windows agent build + cargo-audit). Windows agent built natively on the Pluto `windows-msvc` runner. +- **GC test machines (GuruRMM):** `RMM-TEST-MACHINE` id `99d6d692-99e0-4359-9f9c-f43be89f49e5` (site Howard-VM, v0.6.52, live); `WIN-TG2STMODJG8` id `eee9f26d-0dbc-4b8e-8e42-3a901b4ff73a` (site Discovery test site, v0.6.52, live); stale `RMM-TEST-MACHINE` id `7d3456f5-...` (v0.6.38, ghost). + +## Commands & Outputs + +- Workflow re-trigger (Gitea has no REST rerun; rerun endpoints return 404): `POST /api/v1/repos/azcomputerguru/guru-connect/actions/workflows/build-and-test.yml/dispatches -d '{"ref":"main"}'` → HTTP 204. +- CI status: `GET /api/v1/repos/.../commits//status` (combined; per-job `.state` is null — use `/actions/tasks` `.workflow_runs[]` filtered by `head_branch`). +- PR merge: `POST /api/v1/repos/.../pulls//merge -d '{"Do":"merge","delete_branch_after_merge":true}'`. +- Local agent build (definitive transient-vs-real check): `cargo check -p guruconnect` (windows-msvc) — clean in 2-7s, confirming CI Windows failures were runner infra. + +## Pending / Incomplete Tasks + +- **SPEC-018 Phase 2 (critical path):** session broker + per-session capture worker (`CreateProcessAsUserW` into the active desktop), service↔worker IPC (ACL'd named pipe), `SERVICE_CONTROL_SESSIONCHANGE` handling. Required for actual desktop capture when running as SYSTEM. X-Large; needs a Windows VM to integration-test. +- **VM integration tests (deferred, not runnable on dev host):** service install/auto-start/stop/uninstall as LocalSystem; `cak_`-as-SYSTEM enroll→store→connect round-trip (the SPEC-016 Phase B end-to-end the guard currently blocks); HKCU→service swap with no autostart gap. +- **Parked:** GC agent deploy on Howard's test machines (`RMM-TEST-MACHINE`, `WIN-TG2STMODJG8`) — waits on SPEC-018 Phase 2 for full function. +- **Enable branch protection on `guru-connect` `main`** so merges go through the CI gate (auto-sync bypassed it once). +- **Memory dedup** (coord todo): retire the 49 orphaned original memory files superseded by the morning consolidation; reconcile MEMORY.md. Flagged by both Howard (ACG-TECH03L) and GURU-KALI self-check. +- **Self-check baseline manifest:** `/autotask` is a per-operator/local command, not a per-machine requirement — move it out of the required baseline (GURU-KALI false-positive). +- **Stale lock to review:** `gururmm / agent/embedded.rs+server/install.rs (per-site EXE re-sign)` held by `GURU-5070/claude-main`, TTL to 23:27 — appears to be a prior-session lock on the GuruRMM equivalent of the same signing problem; verify and release if orphaned. + +## Reference Information + +- **GuruConnect main HEAD:** `11af9dff8e5ad1f8a121278be581455cc7a6bedd`. **ClaudeTools parent HEAD:** `bf7079383f5e358cbb5fa5c8bd6fc4553b29a72c`. +- **PRs:** #5 SPEC-016 Phase A (merged, `89c3718`); #6 SPEC-016 Phase B (closed already-merged via auto-sync; head `4c49b73`); #7 SPEC-018 Phase 1 (merged, `11af9df`). +- **Key commits (guru-connect):** beta channel `87f2295`; SPEC-016 spec `18429f6`, decisions `c286a29`; Phase B `d0b8db0`/`6a000d0`/`87c6e17`/`52477e4`/`367906b`; SPEC-017 (Mike) `4c49b73`; SPEC-018 spec `94c07c2`; ref fix `55b9c97`; SPEC-018 Phase 1 `7602b43`/`a0e0d5f`. +- **Specs:** `docs/specs/SPEC-016-zero-touch-enrollment.md`, `SPEC-017-end-user-remote-access.md` (Mike), `SPEC-018-managed-agent-service-host.md`. Roadmap: `docs/FEATURE_ROADMAP.md`. +- **Coord:** enrollment-spec todo `dbfe6a56` (done); memory-dedup todo (pending); component `guruconnect/server` = built (Phase A), `guruconnect/agent` = built (Phase B + SPEC-018 P1, end-to-end blocked on Phase 2). +- **Service:** Windows service internal name `GuruConnectAgent`, launch arg `service-run`, LocalSystem auto-start, `sc failure` recovery restart/5000.