15 KiB
Session Log — 2026-06-02 — GuruConnect Zero-Touch Enrollment
User
- User: Mike Swanson (mike)
- Machine: GURU-5070
- Role: admin
Session Summary
The session began as a request to deploy the GuruConnect (GC) agent onto Howard's RMM test machines "to see how things are working," but investigation showed the managed-agent enrollment path wasn't ready, so the work pivoted into designing and building that path end-to-end. Two live test machines were identified under the AZ Computer Guru client (RMM-TEST-MACHINE / site Howard-VM, and WIN-TG2STMODJG8 / site Discovery test site); a duplicate v0.6.38 RMM-TEST-MACHINE ghost was noted. The deploy was deliberately parked once it became clear that (a) the GC relay had moved to per-agent cak_ keys with only a deprecated shared AGENT_API_KEY fallback, and (b) the existing embedded-config download path appends a config blob to the end of the signed .exe, which voids Authenticode — colliding with the CI code-signing shipped earlier in the day.
The first concrete deliverable was a signed beta/test release channel: a channel: stable | beta workflow_dispatch input added to GuruConnect's release.yml, where beta produces a signed, prerelease-tagged build (skipping the semver bump/changelog) through the identical fail-closed Azure Trusted Signing path. This closed the "every binary handed to a tester must be signed" gap. A long interview with Mike then shaped the enrollment architecture, producing SPEC-016 (zero-touch per-site agent enrollment): a per-site rotatable enrollment key (long secret + short hex vN (XXXX) fingerprint), self-registration that mints a per-machine cak_ bound to a hardware-salted machine_uid, auto-approve + new-enrollment alert, collision-gated activation, and a sign-once base binary + per-site signed wrapper (resolving SPEC-007's signing question). Three enrollment paths (per-site embedded key / flag overrides / technician-assisted) were defined, with only the per-site path in v1 and the others reserved.
Implementation proceeded in reviewed, CI-gated phases. SPEC-016 Phase A (server backend: POST /api/enroll, per-site key issue/rotate/verify, machine_uid dedup + collision gate, cak_ minting, rate-limiting, migration 010) was built, reviewed (one HIGH cross-site-move hijack + one MEDIUM timing oracle fixed), and merged via PR #5. SPEC-016 Phase B (agent: hardware-salted machine_uid, first-run enroll client, DPAPI+ACL cak_ store, run-mode wiring) was built, reviewed (six findings incl. a re-image-stability bug and a credential-store self-lockout, all fixed), and landed on main. During Phase B, Mike authored SPEC-017 (end-user/sub-user remote access) and an auto-sync fast-forwarded the whole Phase B line plus SPEC-017 onto main outside the PR gate; the planned PR-split became moot but the commits remained cleanly separated.
The credential-store review exposed that the managed agent runs in the interactive user context (HKCU Run), not SYSTEM, so the SYSTEM-ACL'd cak_ store would self-lock the agent on second launch. Mike chose Option A (run the managed agent as a SYSTEM service), which is also required for the upcoming session-switching feature. This became SPEC-018 (managed-agent SYSTEM service host + session broker) — split into its own spec (per Mike) rather than folded into SPEC-013. SPEC-018 Phase 1 (LocalSystem service install/lifecycle, SCM dispatch, run-as-SYSTEM, HKCU→service autostart swap; session broker/per-session capture worker deferred to Phase 2) was built, reviewed (APPROVE WITH NITS — connected-stop interruption + FFI panic guard fixed), re-reviewed (confirmed closed), and merged via PR #7.
All code work flowed through the coordinator pattern: Coding Agent (opus) → Code Review Agent (opus) → fix pass → focused re-review → CI-green → merge → submodule pointer bump. Two transient Windows-runner CI failures occurred (jobs that never started); both were diagnosed against local builds and cleared on re-run. The session ended at a clean checkpoint: Phase A + Phase B + SPEC-018 Phase 1 merged, three specs in place, with SPEC-018 Phase 2 (capture worker) as the next critical-path item.
Key Decisions
- Beta builds get signed via a
channelinput onrelease.yml, not by signing every CI build. Keeps signing secrets out of PR-triggered runs;betarides the existing fail-closed sign path, skips bump/changelog, publishes a prerelease tag. - Two-tier credential model: a low-sensitivity, rotatable, long per-site enrollment key (embedded, never typed) gates registration; a high-sensitivity per-machine
cak_is self-provisioned on first run. Rotating a site key blocks new enrollments from old installers but leaves enrolled agents (holdingcak_) untouched. - Enrollment key fingerprint = monotonic version + short HEX (
v3 (7F2A)), deliberately NOT the RMM word-code style (GREEN-FALCON), so GuruConnect and GuruRMM artifacts are never visually conflated. machine_uidsalted from stable HARDWARE signals (SMBIOS UUID primary; motherboard + disk serial fallback), NOT a per-install random salt — so it survives an OS re-image on the same hardware (re-image dedup) while separating distinct boxes. Uniqueness scoped per-tenant(tenant_id, machine_uid).- Collision-gated activation: a detected
machine_uidcollision (e.g. template-cloned VMs sharing a hardware UUID) drops the endpoint topending+ alert, requiring dashboard confirmation — the one deliberate exception to auto-approve. - Sign once + per-site signed wrapper, both
.exeand MSI (ScreenConnect offers both). The signed agent stays byte-identical; per-site config is delivered around the signed bytes (wrapper/MSI/CLI), never appended into the PE. - Cross-site enroll is default-refused (
409 ENROLL_SITE_CONFLICT) in Phase A — the deliberate-move (--reassign) path is deferred — closing a within-tenant machine-hijack vector. - Managed agent → SYSTEM service (Option A), chosen over patching the store ACL for user-context, because true unattended access and the upcoming session-switching feature both require SYSTEM. Spec'd as its own SPEC-018, separate from SPEC-013.
- Accepted the auto-sync fast-forward of
mainrather than rewriting published history against a live auto-sync; SPEC-018 landed as its own commits, and the "separate specs" intent held via clean commit separation.
Problems Encountered
- "Entire fleet offline" in RMM — a rendering artifact: the API returns
is_connected: null, which the display treated as offline. Real liveness islast_seen; both target machines had checked in seconds earlier. - Append-config voids Authenticode — the existing
download_agentpath appends[MAGIC][len][json]after the signed PE, invalidating the signature. Resolved architecturally in SPEC-016: sign-once base + per-site wrapper, retire the managed append path. (The attended support-code path is filename-based and already signing-safe — left unchanged.) - Phase A HIGH — cross-site machine-move hijack: a valid site-B key + a guessed
machine_uidcould silently repoint a site-A machine and mint acak_. Fixed: default-refuse different-site enroll (409) before any repoint/mint. - Phase A MEDIUM — timing/enumeration oracle: unknown
site_codereturned fast (no Argon2) vs. wrong key (Argon2), distinguishing valid sites. Fixed: dummy Argon2id verify on all reject paths. - Phase B HIGH —
machine_uidnot re-image-stable:MachineGuid(regenerated per OS install) was mixed into the salted digest. Fixed: derive from hardware salt only;MachineGuidis last-resort fallback. - Phase B CRITICAL — credential-store self-lockout: the SYSTEM+Admins ACL'd
cak_store is unreadable by the user-context agent → bricks on 2nd launch. Root cause: managed agent runs viaHKCU Run, not SYSTEM. Resolved by deciding Option A (SYSTEM service = SPEC-018) and keeping a fail-fast guard in the interim. - SPEC-017 number collision: Mike authored SPEC-017 (end-user access) mid-flight; the planned service-host spec became SPEC-018, and the dangling
SPEC-017references in the Phase B fail-fast guard were corrected toSPEC-018. - Auto-sync bypassed the PR/CI gate —
mainwas fast-forwarded up the Phase B + SPEC-017 line outside the PR merge flow. No harm (code was reviewed + green), but flagged: enable branch protection onmain. - Two transient Windows-runner CI failures (jobs reported
failurewith empty "job not started" logs). Diagnosed as runner infra (identical code green on prior run + local build clean); cleared on re-run. Gitea's combined commit-status also lingers stale failed jobs in its aggregation. - SPEC-018 Phase 1 HIGH — non-cancellable connected session: the SCM stop flag wasn't checked inside the inner session loop, so a connected service couldn't stop gracefully. Fixed with an optional shutdown param threaded into
run_with_tray(None for non-service callers = unchanged) + a clean WS-close sentinel path. MEDIUM: agent-runtime panic unwound across theextern "system"service entry → wrapped incatch_unwind→ServiceSpecific(1).
Configuration Changes
GuruConnect repo (azcomputerguru/guru-connect, submodule projects/msp-tools/guru-connect):
- Created:
docs/specs/SPEC-016-zero-touch-enrollment.md,docs/specs/SPEC-018-managed-agent-service-host.md(SPEC-017 created by Mike).server/migrations/010_spec016_enrollment.sql.server/src/auth/enrollment_keys.rs,server/src/api/enroll.rs,server/src/api/sites.rs,server/src/db/sites.rs,server/src/db/enrollment_keys.rs.agent/src/enroll.rs,agent/src/credential_store.rs,agent/src/service/mod.rs. - Modified:
.gitea/workflows/release.yml(beta channel).docs/FEATURE_ROADMAP.md(signing shipped, beta channel, SPEC-016/018 entries, SPEC-007 signing-question resolved).server/src/{main.rs, auth/mod.rs, db/mod.rs, db/machines.rs, db/events.rs, middleware/rate_limit.rs, api/mod.rs}.agent/src/{main.rs, config.rs, install.rs, identity.rs, session/mod.rs}.agent/Cargo.toml(Win32_Security_Cryptography feature;windows-servicealready present). - ClaudeTools parent: submodule pointer bumped repeatedly to track
guru-connectmain.
Credentials & Secrets
No new credentials created. Existing vaulted credentials used (cite paths, not values):
- Gitea API token:
services/gitea.sops.yaml→credentials.api.api-token(used for PR create/merge/close + CI status polling against internal Gitea). - GuruRMM admin API:
infrastructure/gururmm-server.sops.yaml→credentials.gururmm-api.admin-email/admin-password. - GC agent enrollment secrets (
cak_, per-sitecek_keys) are minted/stored by the new code — stored hashed (Argon2id) server-side;cak_stored DPAPI-machine-encrypted in a SYSTEM-ACL'd%ProgramData%\GuruConnect\credentials\agent.cakon the endpoint. None exist yet (no live enrollment performed).
Infrastructure & Servers
- GuruConnect server:
connect.azcomputerguru.com(NPM → port 3002 on 172.16.3.30). Relay agent planewss://connect.azcomputerguru.com/ws/agent; new enrollment endpointhttps://connect.azcomputerguru.com/api/enroll(public). PostgreSQL backend (runtimesqlx, no offline cache). - GuruRMM API:
http://172.16.3.30:3001(JWT auth). - Coordination API:
http://172.16.3.30:8001/api/coord. - Gitea (internal):
http://172.16.3.20:3000(preferred on-network); publicgit.azcomputerguru.com. - CI: Gitea Actions
build-and-test.yml(fmt + clippy -D warnings + Linux build + Postgres-gated tests + Windows agent build + cargo-audit). Windows agent built natively on the Plutowindows-msvcrunner. - GC test machines (GuruRMM):
RMM-TEST-MACHINEid99d6d692-99e0-4359-9f9c-f43be89f49e5(site Howard-VM, v0.6.52, live);WIN-TG2STMODJG8ideee9f26d-0dbc-4b8e-8e42-3a901b4ff73a(site Discovery test site, v0.6.52, live); staleRMM-TEST-MACHINEid7d3456f5-...(v0.6.38, ghost).
Commands & Outputs
- Workflow re-trigger (Gitea has no REST rerun; rerun endpoints return 404):
POST /api/v1/repos/azcomputerguru/guru-connect/actions/workflows/build-and-test.yml/dispatches -d '{"ref":"main"}'→ HTTP 204. - CI status:
GET /api/v1/repos/.../commits/<sha>/status(combined; per-job.stateis null — use/actions/tasks.workflow_runs[]filtered byhead_branch). - PR merge:
POST /api/v1/repos/.../pulls/<n>/merge -d '{"Do":"merge","delete_branch_after_merge":true}'. - Local agent build (definitive transient-vs-real check):
cargo check -p guruconnect(windows-msvc) — clean in 2-7s, confirming CI Windows failures were runner infra.
Pending / Incomplete Tasks
- SPEC-018 Phase 2 (critical path): session broker + per-session capture worker (
CreateProcessAsUserWinto the active desktop), service↔worker IPC (ACL'd named pipe),SERVICE_CONTROL_SESSIONCHANGEhandling. Required for actual desktop capture when running as SYSTEM. X-Large; needs a Windows VM to integration-test. - VM integration tests (deferred, not runnable on dev host): service install/auto-start/stop/uninstall as LocalSystem;
cak_-as-SYSTEM enroll→store→connect round-trip (the SPEC-016 Phase B end-to-end the guard currently blocks); HKCU→service swap with no autostart gap. - Parked: GC agent deploy on Howard's test machines (
RMM-TEST-MACHINE,WIN-TG2STMODJG8) — waits on SPEC-018 Phase 2 for full function. - Enable branch protection on
guru-connectmainso merges go through the CI gate (auto-sync bypassed it once). - Memory dedup (coord todo): retire the 49 orphaned original memory files superseded by the morning consolidation; reconcile MEMORY.md. Flagged by both Howard (ACG-TECH03L) and GURU-KALI self-check.
- Self-check baseline manifest:
/autotaskis a per-operator/local command, not a per-machine requirement — move it out of the required baseline (GURU-KALI false-positive). - Stale lock to review:
gururmm / agent/embedded.rs+server/install.rs (per-site EXE re-sign)held byGURU-5070/claude-main, TTL to 23:27 — appears to be a prior-session lock on the GuruRMM equivalent of the same signing problem; verify and release if orphaned.
Reference Information
- GuruConnect main HEAD:
11af9dff8e5ad1f8a121278be581455cc7a6bedd. ClaudeTools parent HEAD:bf7079383f5e358cbb5fa5c8bd6fc4553b29a72c. - PRs: #5 SPEC-016 Phase A (merged,
89c3718); #6 SPEC-016 Phase B (closed already-merged via auto-sync; head4c49b73); #7 SPEC-018 Phase 1 (merged,11af9df). - Key commits (guru-connect): beta channel
87f2295; SPEC-016 spec18429f6, decisionsc286a29; Phase Bd0b8db0/6a000d0/87c6e17/52477e4/367906b; SPEC-017 (Mike)4c49b73; SPEC-018 spec94c07c2; ref fix55b9c97; SPEC-018 Phase 17602b43/a0e0d5f. - Specs:
docs/specs/SPEC-016-zero-touch-enrollment.md,SPEC-017-end-user-remote-access.md(Mike),SPEC-018-managed-agent-service-host.md. Roadmap:docs/FEATURE_ROADMAP.md. - Coord: enrollment-spec todo
dbfe6a56(done); memory-dedup todo (pending); componentguruconnect/server= built (Phase A),guruconnect/agent= built (Phase B + SPEC-018 P1, end-to-end blocked on Phase 2). - Service: Windows service internal name
GuruConnectAgent, launch argservice-run, LocalSystem auto-start,sc failurerecovery restart/5000.