Files
claudetools/session-logs/2026-06-02-mike-guruconnect-enrollment.md
Mike Swanson 2fe0b90315 sync: auto-sync from GURU-5070 at 2026-06-02 14:57:28
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-02 14:57:28
2026-06-02 14:57:33 -07:00

15 KiB

Session Log — 2026-06-02 — GuruConnect Zero-Touch Enrollment

User

  • User: Mike Swanson (mike)
  • Machine: GURU-5070
  • Role: admin

Session Summary

The session began as a request to deploy the GuruConnect (GC) agent onto Howard's RMM test machines "to see how things are working," but investigation showed the managed-agent enrollment path wasn't ready, so the work pivoted into designing and building that path end-to-end. Two live test machines were identified under the AZ Computer Guru client (RMM-TEST-MACHINE / site Howard-VM, and WIN-TG2STMODJG8 / site Discovery test site); a duplicate v0.6.38 RMM-TEST-MACHINE ghost was noted. The deploy was deliberately parked once it became clear that (a) the GC relay had moved to per-agent cak_ keys with only a deprecated shared AGENT_API_KEY fallback, and (b) the existing embedded-config download path appends a config blob to the end of the signed .exe, which voids Authenticode — colliding with the CI code-signing shipped earlier in the day.

The first concrete deliverable was a signed beta/test release channel: a channel: stable | beta workflow_dispatch input added to GuruConnect's release.yml, where beta produces a signed, prerelease-tagged build (skipping the semver bump/changelog) through the identical fail-closed Azure Trusted Signing path. This closed the "every binary handed to a tester must be signed" gap. A long interview with Mike then shaped the enrollment architecture, producing SPEC-016 (zero-touch per-site agent enrollment): a per-site rotatable enrollment key (long secret + short hex vN (XXXX) fingerprint), self-registration that mints a per-machine cak_ bound to a hardware-salted machine_uid, auto-approve + new-enrollment alert, collision-gated activation, and a sign-once base binary + per-site signed wrapper (resolving SPEC-007's signing question). Three enrollment paths (per-site embedded key / flag overrides / technician-assisted) were defined, with only the per-site path in v1 and the others reserved.

Implementation proceeded in reviewed, CI-gated phases. SPEC-016 Phase A (server backend: POST /api/enroll, per-site key issue/rotate/verify, machine_uid dedup + collision gate, cak_ minting, rate-limiting, migration 010) was built, reviewed (one HIGH cross-site-move hijack + one MEDIUM timing oracle fixed), and merged via PR #5. SPEC-016 Phase B (agent: hardware-salted machine_uid, first-run enroll client, DPAPI+ACL cak_ store, run-mode wiring) was built, reviewed (six findings incl. a re-image-stability bug and a credential-store self-lockout, all fixed), and landed on main. During Phase B, Mike authored SPEC-017 (end-user/sub-user remote access) and an auto-sync fast-forwarded the whole Phase B line plus SPEC-017 onto main outside the PR gate; the planned PR-split became moot but the commits remained cleanly separated.

The credential-store review exposed that the managed agent runs in the interactive user context (HKCU Run), not SYSTEM, so the SYSTEM-ACL'd cak_ store would self-lock the agent on second launch. Mike chose Option A (run the managed agent as a SYSTEM service), which is also required for the upcoming session-switching feature. This became SPEC-018 (managed-agent SYSTEM service host + session broker) — split into its own spec (per Mike) rather than folded into SPEC-013. SPEC-018 Phase 1 (LocalSystem service install/lifecycle, SCM dispatch, run-as-SYSTEM, HKCU→service autostart swap; session broker/per-session capture worker deferred to Phase 2) was built, reviewed (APPROVE WITH NITS — connected-stop interruption + FFI panic guard fixed), re-reviewed (confirmed closed), and merged via PR #7.

All code work flowed through the coordinator pattern: Coding Agent (opus) → Code Review Agent (opus) → fix pass → focused re-review → CI-green → merge → submodule pointer bump. Two transient Windows-runner CI failures occurred (jobs that never started); both were diagnosed against local builds and cleared on re-run. The session ended at a clean checkpoint: Phase A + Phase B + SPEC-018 Phase 1 merged, three specs in place, with SPEC-018 Phase 2 (capture worker) as the next critical-path item.

Key Decisions

  • Beta builds get signed via a channel input on release.yml, not by signing every CI build. Keeps signing secrets out of PR-triggered runs; beta rides the existing fail-closed sign path, skips bump/changelog, publishes a prerelease tag.
  • Two-tier credential model: a low-sensitivity, rotatable, long per-site enrollment key (embedded, never typed) gates registration; a high-sensitivity per-machine cak_ is self-provisioned on first run. Rotating a site key blocks new enrollments from old installers but leaves enrolled agents (holding cak_) untouched.
  • Enrollment key fingerprint = monotonic version + short HEX (v3 (7F2A)), deliberately NOT the RMM word-code style (GREEN-FALCON), so GuruConnect and GuruRMM artifacts are never visually conflated.
  • machine_uid salted from stable HARDWARE signals (SMBIOS UUID primary; motherboard + disk serial fallback), NOT a per-install random salt — so it survives an OS re-image on the same hardware (re-image dedup) while separating distinct boxes. Uniqueness scoped per-tenant (tenant_id, machine_uid).
  • Collision-gated activation: a detected machine_uid collision (e.g. template-cloned VMs sharing a hardware UUID) drops the endpoint to pending + alert, requiring dashboard confirmation — the one deliberate exception to auto-approve.
  • Sign once + per-site signed wrapper, both .exe and MSI (ScreenConnect offers both). The signed agent stays byte-identical; per-site config is delivered around the signed bytes (wrapper/MSI/CLI), never appended into the PE.
  • Cross-site enroll is default-refused (409 ENROLL_SITE_CONFLICT) in Phase A — the deliberate-move (--reassign) path is deferred — closing a within-tenant machine-hijack vector.
  • Managed agent → SYSTEM service (Option A), chosen over patching the store ACL for user-context, because true unattended access and the upcoming session-switching feature both require SYSTEM. Spec'd as its own SPEC-018, separate from SPEC-013.
  • Accepted the auto-sync fast-forward of main rather than rewriting published history against a live auto-sync; SPEC-018 landed as its own commits, and the "separate specs" intent held via clean commit separation.

Problems Encountered

  • "Entire fleet offline" in RMM — a rendering artifact: the API returns is_connected: null, which the display treated as offline. Real liveness is last_seen; both target machines had checked in seconds earlier.
  • Append-config voids Authenticode — the existing download_agent path appends [MAGIC][len][json] after the signed PE, invalidating the signature. Resolved architecturally in SPEC-016: sign-once base + per-site wrapper, retire the managed append path. (The attended support-code path is filename-based and already signing-safe — left unchanged.)
  • Phase A HIGH — cross-site machine-move hijack: a valid site-B key + a guessed machine_uid could silently repoint a site-A machine and mint a cak_. Fixed: default-refuse different-site enroll (409) before any repoint/mint.
  • Phase A MEDIUM — timing/enumeration oracle: unknown site_code returned fast (no Argon2) vs. wrong key (Argon2), distinguishing valid sites. Fixed: dummy Argon2id verify on all reject paths.
  • Phase B HIGH — machine_uid not re-image-stable: MachineGuid (regenerated per OS install) was mixed into the salted digest. Fixed: derive from hardware salt only; MachineGuid is last-resort fallback.
  • Phase B CRITICAL — credential-store self-lockout: the SYSTEM+Admins ACL'd cak_ store is unreadable by the user-context agent → bricks on 2nd launch. Root cause: managed agent runs via HKCU Run, not SYSTEM. Resolved by deciding Option A (SYSTEM service = SPEC-018) and keeping a fail-fast guard in the interim.
  • SPEC-017 number collision: Mike authored SPEC-017 (end-user access) mid-flight; the planned service-host spec became SPEC-018, and the dangling SPEC-017 references in the Phase B fail-fast guard were corrected to SPEC-018.
  • Auto-sync bypassed the PR/CI gatemain was fast-forwarded up the Phase B + SPEC-017 line outside the PR merge flow. No harm (code was reviewed + green), but flagged: enable branch protection on main.
  • Two transient Windows-runner CI failures (jobs reported failure with empty "job not started" logs). Diagnosed as runner infra (identical code green on prior run + local build clean); cleared on re-run. Gitea's combined commit-status also lingers stale failed jobs in its aggregation.
  • SPEC-018 Phase 1 HIGH — non-cancellable connected session: the SCM stop flag wasn't checked inside the inner session loop, so a connected service couldn't stop gracefully. Fixed with an optional shutdown param threaded into run_with_tray (None for non-service callers = unchanged) + a clean WS-close sentinel path. MEDIUM: agent-runtime panic unwound across the extern "system" service entry → wrapped in catch_unwindServiceSpecific(1).

Configuration Changes

GuruConnect repo (azcomputerguru/guru-connect, submodule projects/msp-tools/guru-connect):

  • Created: docs/specs/SPEC-016-zero-touch-enrollment.md, docs/specs/SPEC-018-managed-agent-service-host.md (SPEC-017 created by Mike). server/migrations/010_spec016_enrollment.sql. server/src/auth/enrollment_keys.rs, server/src/api/enroll.rs, server/src/api/sites.rs, server/src/db/sites.rs, server/src/db/enrollment_keys.rs. agent/src/enroll.rs, agent/src/credential_store.rs, agent/src/service/mod.rs.
  • Modified: .gitea/workflows/release.yml (beta channel). docs/FEATURE_ROADMAP.md (signing shipped, beta channel, SPEC-016/018 entries, SPEC-007 signing-question resolved). server/src/{main.rs, auth/mod.rs, db/mod.rs, db/machines.rs, db/events.rs, middleware/rate_limit.rs, api/mod.rs}. agent/src/{main.rs, config.rs, install.rs, identity.rs, session/mod.rs}. agent/Cargo.toml (Win32_Security_Cryptography feature; windows-service already present).
  • ClaudeTools parent: submodule pointer bumped repeatedly to track guru-connect main.

Credentials & Secrets

No new credentials created. Existing vaulted credentials used (cite paths, not values):

  • Gitea API token: services/gitea.sops.yamlcredentials.api.api-token (used for PR create/merge/close + CI status polling against internal Gitea).
  • GuruRMM admin API: infrastructure/gururmm-server.sops.yamlcredentials.gururmm-api.admin-email / admin-password.
  • GC agent enrollment secrets (cak_, per-site cek_ keys) are minted/stored by the new code — stored hashed (Argon2id) server-side; cak_ stored DPAPI-machine-encrypted in a SYSTEM-ACL'd %ProgramData%\GuruConnect\credentials\agent.cak on the endpoint. None exist yet (no live enrollment performed).

Infrastructure & Servers

  • GuruConnect server: connect.azcomputerguru.com (NPM → port 3002 on 172.16.3.30). Relay agent plane wss://connect.azcomputerguru.com/ws/agent; new enrollment endpoint https://connect.azcomputerguru.com/api/enroll (public). PostgreSQL backend (runtime sqlx, no offline cache).
  • GuruRMM API: http://172.16.3.30:3001 (JWT auth).
  • Coordination API: http://172.16.3.30:8001/api/coord.
  • Gitea (internal): http://172.16.3.20:3000 (preferred on-network); public git.azcomputerguru.com.
  • CI: Gitea Actions build-and-test.yml (fmt + clippy -D warnings + Linux build + Postgres-gated tests + Windows agent build + cargo-audit). Windows agent built natively on the Pluto windows-msvc runner.
  • GC test machines (GuruRMM): RMM-TEST-MACHINE id 99d6d692-99e0-4359-9f9c-f43be89f49e5 (site Howard-VM, v0.6.52, live); WIN-TG2STMODJG8 id eee9f26d-0dbc-4b8e-8e42-3a901b4ff73a (site Discovery test site, v0.6.52, live); stale RMM-TEST-MACHINE id 7d3456f5-... (v0.6.38, ghost).

Commands & Outputs

  • Workflow re-trigger (Gitea has no REST rerun; rerun endpoints return 404): POST /api/v1/repos/azcomputerguru/guru-connect/actions/workflows/build-and-test.yml/dispatches -d '{"ref":"main"}' → HTTP 204.
  • CI status: GET /api/v1/repos/.../commits/<sha>/status (combined; per-job .state is null — use /actions/tasks .workflow_runs[] filtered by head_branch).
  • PR merge: POST /api/v1/repos/.../pulls/<n>/merge -d '{"Do":"merge","delete_branch_after_merge":true}'.
  • Local agent build (definitive transient-vs-real check): cargo check -p guruconnect (windows-msvc) — clean in 2-7s, confirming CI Windows failures were runner infra.

Pending / Incomplete Tasks

  • SPEC-018 Phase 2 (critical path): session broker + per-session capture worker (CreateProcessAsUserW into the active desktop), service↔worker IPC (ACL'd named pipe), SERVICE_CONTROL_SESSIONCHANGE handling. Required for actual desktop capture when running as SYSTEM. X-Large; needs a Windows VM to integration-test.
  • VM integration tests (deferred, not runnable on dev host): service install/auto-start/stop/uninstall as LocalSystem; cak_-as-SYSTEM enroll→store→connect round-trip (the SPEC-016 Phase B end-to-end the guard currently blocks); HKCU→service swap with no autostart gap.
  • Parked: GC agent deploy on Howard's test machines (RMM-TEST-MACHINE, WIN-TG2STMODJG8) — waits on SPEC-018 Phase 2 for full function.
  • Enable branch protection on guru-connect main so merges go through the CI gate (auto-sync bypassed it once).
  • Memory dedup (coord todo): retire the 49 orphaned original memory files superseded by the morning consolidation; reconcile MEMORY.md. Flagged by both Howard (ACG-TECH03L) and GURU-KALI self-check.
  • Self-check baseline manifest: /autotask is a per-operator/local command, not a per-machine requirement — move it out of the required baseline (GURU-KALI false-positive).
  • Stale lock to review: gururmm / agent/embedded.rs+server/install.rs (per-site EXE re-sign) held by GURU-5070/claude-main, TTL to 23:27 — appears to be a prior-session lock on the GuruRMM equivalent of the same signing problem; verify and release if orphaned.

Reference Information

  • GuruConnect main HEAD: 11af9dff8e5ad1f8a121278be581455cc7a6bedd. ClaudeTools parent HEAD: bf7079383f5e358cbb5fa5c8bd6fc4553b29a72c.
  • PRs: #5 SPEC-016 Phase A (merged, 89c3718); #6 SPEC-016 Phase B (closed already-merged via auto-sync; head 4c49b73); #7 SPEC-018 Phase 1 (merged, 11af9df).
  • Key commits (guru-connect): beta channel 87f2295; SPEC-016 spec 18429f6, decisions c286a29; Phase B d0b8db0/6a000d0/87c6e17/52477e4/367906b; SPEC-017 (Mike) 4c49b73; SPEC-018 spec 94c07c2; ref fix 55b9c97; SPEC-018 Phase 1 7602b43/a0e0d5f.
  • Specs: docs/specs/SPEC-016-zero-touch-enrollment.md, SPEC-017-end-user-remote-access.md (Mike), SPEC-018-managed-agent-service-host.md. Roadmap: docs/FEATURE_ROADMAP.md.
  • Coord: enrollment-spec todo dbfe6a56 (done); memory-dedup todo (pending); component guruconnect/server = built (Phase A), guruconnect/agent = built (Phase B + SPEC-018 P1, end-to-end blocked on Phase 2).
  • Service: Windows service internal name GuruConnectAgent, launch arg service-run, LocalSystem auto-start, sc failure recovery restart/5000.