fix(agent): SPEC-018 review fixes — agent_id persistence, managed fallback, HKEY typing

Address the SPEC-018 Phase 1 code review (reports/2026-06-03-spec018-review.md): - Bug 2 (config.rs): stop agent_id churn on every restart. The embedded-config path always wins in Config::load, so the saved agent_id was never read back. Add Config::persisted_agent_id() and reuse a prior id from the TOML; only mint a new UUID when none exists. - Bug 1 (main.rs): remove the non-functional in-process fallback in run_permanent_agent_managed. A managed agent's cak_ store is SYSTEM-only ACL'd, so a non-elevated in-process run cannot authenticate (load_cak permission-denied, or enroll C1 read-back failure). Return an actionable "install elevated" error instead of pretending to provide an agent; update the misleading comments. - Issue 6 (startup.rs): replace the fragile transmute::<HANDLE, HKEY> with the windows crate's typed HKEY out-param; add SAFETY comments. cargo check -p guruconnect --target x86_64-pc-windows-msvc passes clean. Deferred lower-severity items tracked in #8. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 16:27:27 -07:00
parent 11af9dff8e
commit 9eaabdd6a5
5 changed files with 212 additions and 30 deletions
--- a/agent/src/config.rs
+++ b/agent/src/config.rs
@@ -377,6 +377,26 @@ impl Config {
        false
    }

+    /// Best-effort read of a previously-persisted `agent_id` from the on-disk
+    /// TOML at [`Self::config_path`].
+    ///
+    /// The embedded blob never carries an `agent_id` (it is minted at first
+    /// run), so for a managed agent the only stable source across restarts is
+    /// the TOML that a prior run wrote via [`Self::save`]. Returns `Some(id)`
+    /// only when the file exists, parses, and contains a non-empty `agent_id`;
+    /// any missing-file / read / parse error yields `None` so the caller falls
+    /// back to generating a fresh id.
+    fn persisted_agent_id() -> Option<String> {
+        let config_path = Self::config_path();
+        let contents = std::fs::read_to_string(&config_path).ok()?;
+        let parsed: Config = toml::from_str(&contents).ok()?;
+        if parsed.agent_id.is_empty() {
+            None
+        } else {
+            Some(parsed.agent_id)
+        }
+    }
+
    /// Load configuration from embedded config, file, or environment
    pub fn load() -> Result<Self> {
        // Priority 1: Try loading from embedded config
@@ -389,7 +409,12 @@ impl Config {
                api_key: embedded.api_key.unwrap_or_default(),
                enrollment_key: embedded.enrollment_key,
                site_code: embedded.site_code,
-                agent_id: generate_agent_id(),
+                // The embedded blob carries no agent_id, and load() always
+                // prefers this embedded path — so a freshly generated id would
+                // never be read back, churning the agent_id on every restart.
+                // Reuse the id a prior run persisted to the TOML if present;
+                // only mint a new one when none exists yet.
+                agent_id: Self::persisted_agent_id().unwrap_or_else(generate_agent_id),
                hostname_override: None,
                company: embedded.company,
                site: embedded.site,
@@ -401,9 +426,11 @@ impl Config {
                encoding: EncodingConfig::default(),
            };

-            // Save to file for persistence (so agent_id is preserved). The
-            // #[serde(skip)] enrollment fields are intentionally NOT written to
-            // the on-disk TOML — they are install-time material only.
+            // Persist so a freshly-minted agent_id is available to read back on
+            // the next launch (the embedded path always wins, so the TOML is the
+            // only place the stable id can live). The #[serde(skip)] enrollment
+            // fields are intentionally NOT written to the on-disk TOML — they are
+            // install-time material only.
            let _ = config.save();
            return Ok(config);
        }
--- a/agent/src/main.rs
+++ b/agent/src/main.rs
@@ -475,9 +475,11 @@ pub fn run_managed_agent_service(
 ///     agent — it exits quietly.
 ///   - On first run, install (which installs + starts the service and removes the
 ///     legacy `HKCU\…\Run` autostart), then exit and let the service carry the
-///     agent. If the service install fails (e.g. not elevated), fall back to
-///     running the agent in-process for this run so the machine is not left with no
-///     agent at all.
+///     agent. The managed install REQUIRES elevation: the per-machine credential
+///     store is SYSTEM-only, so the SPEC-016 enrollment path cannot authenticate
+///     from a non-elevated, in-process context. There is therefore no in-process
+///     fallback — if the install fails, we return an actionable error telling the
+///     operator to re-run as Administrator.
 #[cfg(windows)]
 fn run_permanent_agent_managed() -> Result<()> {
    if service::is_service_installed() {
@@ -490,10 +492,23 @@ fn run_permanent_agent_managed() -> Result<()> {

    info!("First run - installing managed agent service");
    if let Err(e) = install::install(false) {
-        warn!(
-            "Managed service install failed ({e:#}); falling back to in-process agent for this run"
+        // No in-process fallback: a managed agent authenticates with a per-machine
+        // cak_ whose credential store is ACL'd to SYSTEM only. Running the agent in
+        // this non-elevated process would either fail to read an existing cak_
+        // (permission denied against the SYSTEM-only ACL) or, on a fresh machine,
+        // fail enrollment's C1 store-and-read-back verification — leaving the
+        // machine with no working agent while pretending otherwise. Surface a clear,
+        // actionable error instead.
+        error!(
+            "Managed agent install failed ({e:#}). The managed service must be installed \
+             elevated (Administrator) — the per-machine credential store is SYSTEM-only and \
+             an in-process fallback cannot authenticate. Re-run as Administrator."
        );
-        return run_agent_mode(None);
+        return Err(anyhow::anyhow!(
+            "managed agent install failed ({e:#}); the managed service must be installed \
+             elevated (Administrator) — the per-machine credential store is SYSTEM-only and \
+             an in-process fallback cannot authenticate. Re-run as Administrator."
+        ));
    }

    info!("Managed agent service installed; handing off to the service");
--- a/agent/src/startup.rs
+++ b/agent/src/startup.rs
@@ -9,7 +9,7 @@ use tracing::{info, warn};
 use windows::core::PCWSTR;
 #[cfg(windows)]
 use windows::Win32::System::Registry::{
-    RegCloseKey, RegDeleteValueW, RegOpenKeyExW, RegSetValueExW, HKEY_CURRENT_USER, KEY_WRITE,
+    RegCloseKey, RegDeleteValueW, RegOpenKeyExW, RegSetValueExW, HKEY, HKEY_CURRENT_USER, KEY_WRITE,
    REG_SZ,
 };

@@ -42,40 +42,39 @@ pub fn add_to_startup() -> Result<()> {
        .chain(std::iter::once(0))
        .collect();

+    // SAFETY: FFI into the Win32 registry API. `key_path`/`value_name`/`value_data`
+    // are NUL-terminated wide strings that outlive the calls. `RegOpenKeyExW`
+    // writes the opened key into `hkey`; we only use it after confirming success,
+    // and always pair it with `RegCloseKey`.
    unsafe {
-        let mut hkey = windows::Win32::Foundation::HANDLE::default();
+        let mut hkey = HKEY::default();

-        // Open the Run key
+        // Open the Run key. RegOpenKeyExW takes a `*mut HKEY` out-param.
        let result = RegOpenKeyExW(
            HKEY_CURRENT_USER,
            PCWSTR(key_path.as_ptr()),
            0,
            KEY_WRITE,
-            &mut hkey as *mut _ as *mut _,
+            &mut hkey,
        );

        if result.is_err() {
            anyhow::bail!("Failed to open registry key: {:?}", result);
        }

-        let hkey_raw = std::mem::transmute::<
-            windows::Win32::Foundation::HANDLE,
-            windows::Win32::System::Registry::HKEY,
-        >(hkey);
-
        // Set the value
        let data_bytes =
            std::slice::from_raw_parts(value_data.as_ptr() as *const u8, value_data.len() * 2);

        let set_result = RegSetValueExW(
-            hkey_raw,
+            hkey,
            PCWSTR(value_name.as_ptr()),
            0,
            REG_SZ,
            Some(data_bytes),
        );

-        let _ = RegCloseKey(hkey_raw);
+        let _ = RegCloseKey(hkey);

        if set_result.is_err() {
            anyhow::bail!("Failed to set registry value: {:?}", set_result);
@@ -103,15 +102,19 @@ pub fn remove_from_startup() -> Result<()> {
        .chain(std::iter::once(0))
        .collect();

+    // SAFETY: FFI into the Win32 registry API. `key_path`/`value_name` are
+    // NUL-terminated wide strings that outlive the calls. `RegOpenKeyExW` writes
+    // the opened key into `hkey`; we only use it after confirming success, and
+    // always pair it with `RegCloseKey`.
    unsafe {
-        let mut hkey = windows::Win32::Foundation::HANDLE::default();
+        let mut hkey = HKEY::default();

        let result = RegOpenKeyExW(
            HKEY_CURRENT_USER,
            PCWSTR(key_path.as_ptr()),
            0,
            KEY_WRITE,
-            &mut hkey as *mut _ as *mut _,
+            &mut hkey,
        );

        if result.is_err() {
@@ -119,14 +122,9 @@ pub fn remove_from_startup() -> Result<()> {
            return Ok(()); // Not an error if key doesn't exist
        }

-        let hkey_raw = std::mem::transmute::<
-            windows::Win32::Foundation::HANDLE,
-            windows::Win32::System::Registry::HKEY,
-        >(hkey);
+        let delete_result = RegDeleteValueW(hkey, PCWSTR(value_name.as_ptr()));

-        let delete_result = RegDeleteValueW(hkey_raw, PCWSTR(value_name.as_ptr()));
-
-        let _ = RegCloseKey(hkey_raw);
+        let _ = RegCloseKey(hkey);

        if delete_result.is_err() {
            warn!("Registry value may not exist: {:?}", delete_result);
--- a/reports/2026-06-03-spec018-review.md
+++ b/reports/2026-06-03-spec018-review.md
@@ -0,0 +1,79 @@
+# Code Review Notes: GuruConnect SPEC-018 Phase 1 (Managed Agent as LocalSystem Service Host)
+
+**Review date:** 2026-06-03  
+**Reviewer:** Grok (meticulous code reviewer persona)  
+**Scope:** SPEC-018 Phase 1 changes (merge 11af9df, including review-fix commit a0e0d5f). Diff: `git diff 55b9c97..11af9df` (agent/src/{install.rs,main.rs,service/mod.rs,session/mod.rs}). Full reads of: `agent/src/service/mod.rs`, relevant sections + full run_* functions in `agent/src/main.rs`, `agent/src/session/mod.rs`, `agent/src/credential_store.rs`, `agent/src/install.rs`, `agent/src/startup.rs`, `agent/src/config.rs`, `agent/src/identity.rs`, `agent/src/enroll.rs`, `agent/src/transport/websocket.rs`, `agent/Cargo.toml`, and light greps/server-side enrollment in `server/src/api/enroll.rs` + rate_limit. Also cross-checked CLAUDE.md constraints, Windows service practices, auth paths, shutdown propagation, panic guard, and recent SPEC-016 enrollment/identity code.  
+**Commands used for exploration:** `git log --oneline 55b9c97..11af9df`, `git diff ... --name-only`, file reads (chunked for large), rg/grep for `\.unwrap\(\)`, `\.expect\(`, `unsafe`, `AtomicBool|shutdown|SERVICE_STOP_SENTINEL`, `println!`, `enroll|machine_uid`, etc. No code changes performed.
+
+## Summary
+The SPEC-018 Phase 1 implementation is high-quality, security-conscious, and largely correct for its narrow scope (enrollment + relay only; no capture/input/desktop in Session 0). It correctly introduces the LocalSystem service as the single autostart for managed agents (embedded config), wires cooperative shutdown via `Arc<AtomicBool>` (observed both in outer `run_agent` reconnects and inner `SessionManager::run_with_tray` connected loop via `SERVICE_STOP_SENTINEL`), uses `panic::catch_unwind(AssertUnwindSafe)` + downcast around the Tokio runtime to protect the SCM FFI boundary (mapping to `ServiceSpecific(1)` for recovery), implements idempotent SCM install/uninstall with bounded retry for `ERROR_SERVICE_MARKED_FOR_DELETE`, configures crash recovery via `sc failure`, skips tray/autostart HKCU Run for the service path (preventing double-agents and Session-0 nonsense), and documents Phase 2 seams (broker, `CreateProcessAsUserW`, `SESSIONCHANGE`, IPC) extensively in comments. Adherence to CLAUDE.md is excellent: `tracing` (no `println` in core paths), `anyhow`/`thiserror`, async, no hardcoded secrets, strict auth (managed path always requires cak_/enroll material or legacy key; support codes only for interactive `run_agent_mode`; no unauth agent paths), UUIDs/soft-delete conventions (on server), etc. High-privilege SYSTEM surface is handled with care (ACL'd credential store readable only in-context, fail-fast guards retained as safety net, best-effort status reporting).
+
+However, there are a handful of correctness bugs (especially the non-elevated managed "fallback" that doesn't actually run an agent, and unconditional `agent_id` churn for all embedded-config managed agents on every restart), several `unwrap()`s in hot paths (session loop), a stale `#[allow(dead_code)]`, and some pre-existing but now-in-scope fragile unsafe/transmute patterns in registry code touched by the autostart changes. Service lifecycle, status reporting, and install/retry logic are solid with only minor nits. Phase 2 gaps are clearly and repeatedly called out (no omissions). Light broader review of recent enrollment/identity (SPEC-016) work (which this builds directly on) found related issues in config load priority + agent_id stability and a couple of test-only unwraps. No evidence of race conditions in the AtomicBool propagation or handler blocking. Overall: mergeable with fixes for the flagged bugs; strong engineering but not flawless on edge cases for a SYSTEM RCE surface.
+
+**Issue counts (by severity):** 2 bug, 1 bug/suggestion, 4 suggestion/nit, 1 nit. Total 8. (No issues found in: auth enforcement for agents, use of tracing/anyhow, panic guard intent + sentinel contract (tests pin it), no-double-agent logic on happy path, credential ACL + C1 verification when running as SYSTEM, Windows service handler non-blocking + recovery config, or Phase 2 documentation.)
+
+## Issues
+
+### Issue 1 -- Severity: bug
+- File: agent/src/main.rs:496 (inside `run_permanent_agent_managed`, called from PermanentAgent detection at 320 and 186)
+- Description: When `install::install(false)` fails (e.g., the managed binary is launched interactively without Administrator rights), the code does `warn!(... falling back to in-process agent for this run); return run_agent_mode(None);`. The surrounding comment (and similar in 294-296) claims this ensures "the machine is not left with no agent at all." However, for a managed/PermanentAgent binary (which has embedded `enrollment_key` + `site_code` or a prior `cak_`), `run_agent_mode` → `resolve_agent_credential` will *always* fail: (a) if a `cak_` exists, `load_cak` hits `LoadCakError::Io { permission_denied: true }` and the explicit guard returns a hard error ("must run as the GuruConnect SYSTEM service"); (b) if no `cak_` yet, it calls `enroll::run_enrollment` → `credential_store::store_cak`, whose C1 read-back verification (`load_cak` immediately after write) will see the SYSTEM+Admins ACL and return the perm-denied error, causing `store_cak` and thus enrollment to fail. The "fallback" path therefore exits with error (no relay connection) in the exact case it was meant to save. This is a direct correctness gap for the high-privilege managed path.
+- Suggestion: Remove the fallback for the `has_embedded_config()` / managed case (surface a clear error requiring elevation for first-run managed install, as already stated in `install_managed_service` docs). Or introduce a true degraded fallback (e.g., force a per-user cak_ path or legacy api_key mode) with loud warnings. Update all comments and the "falling back" log. Consider a distinct code path or flag so non-elevated managed binaries don't pretend to provide agent functionality.
+- Status: open
+
+### Issue 2 -- Severity: bug
+- File: agent/src/config.rs:392 (in `Config::load` embedded branch, lines 382-408; also interacts with `detect_run_mode:255`, `has_embedded_config:285`, `read_embedded_config:290`, and PermanentAgent paths in main.rs:166+)
+- Description: Embedded config (the trigger for `RunMode::PermanentAgent` / managed installs per SPEC-016) never includes an `agent_id`. On every `load()` via the embedded priority-1 path, it does `agent_id: generate_agent_id()` (fresh v4 UUID) unconditionally, then `let _ = config.save()`. Because `load()` always prefers `read_embedded_config().is_ok()` (the exe still carries the magic blob post-install), the toml written by `save()` is *never read back* for agent_id on subsequent launches. Consequence: every service start/restart (and every interactive launch of a managed binary) produces a brand-new `agent_id`. This value is passed to `WebSocketTransport::connect(..., &self.config.agent_id, ...)` (session/mod.rs:121) and used for server-side connection identity/tracking. While `machine_uid` + the `cak_`-bound machine row are the stable dedup keys (per identity.rs and server enroll), churning `agent_id` on every restart is still observable churn in logs, sessions, and any agent_id-keyed state. The comment "Save to file for persistence (so agent_id is preserved)" is actively false for the embedded case that managed agents always take.
+- Suggestion: In the embedded construction block, first try loading an existing `agent_id` from the on-disk toml (via a helper or the file-load logic) if present and non-empty; only generate if absent. Alternatively, persist the generated agent_id into the embedded blob at installer creation time (or treat the server's minted agent_id as authoritative and stop sending client-generated one for managed agents). This is in the "recent enrollment/identity code" under light broader review.
+- Status: open
+
+### Issue 3 -- Severity: bug
+- File: agent/src/session/mod.rs:386 (recv path), 486 (chat poll), 559 (frame send) — all inside `run_with_tray` after the `if self.transport.is_none() { bail }` guard at 316 and before the connectivity check at 579-587.
+- Description: Direct `.unwrap()` on `self.transport.as_mut().unwrap()` (and `.as_ref()` in the is_connected check is safe, but the mut ones are not). These execute on *every* message, every outgoing chat, and every video frame while streaming/connected. Although current control flow (transport only set in `connect()`, never set to `None` inside the loop, `release_streaming` doesn't touch it, and the bottom-of-loop check breaks before next iteration) makes them "safe" today, this violates the review rule to flag `unwrap()` in hot paths. A future change (e.g., error path that clears transport, or making run_with_tray re-entrant, or transport becoming Option in more states) turns this into a panic of the entire agent (especially bad under the SYSTEM service). Also present in the stop-sentinel path (which does use `if let Some` for close).
+- Suggestion: Replace with `.expect("transport must be Some inside run_with_tray after the NotConnected guard and while the outer loop considers us connected")` (or better, hold the transport behind a non-Option after the initial check, or use `if let Some(t) = ... { ... } else { break; }` guards). Audit similar patterns elsewhere in the connected loop.
+- Status: open
+
+### Issue 4 -- Severity: suggestion
+- File: agent/src/main.rs:442-466 (the `catch_unwind` + match in `run_managed_agent_service`); related: service/mod.rs:103-107 (service_main), 112 (run_service body), 167 (the call site), 250-256 (comment explaining finding M)
+- Description: The required panic guard across the SCM FFI boundary (`extern "system" ffi_service_main -> service_main -> run_service -> run_managed...`) exists and correctly converts unwind to `Err(...)` → `ServiceExitCode::ServiceSpecific(1)` (so SCM recovery restarts instead of UB/abort). The downcast logic is careful (handles &str/String, falls back without re-panic). However, the `catch_unwind(AssertUnwindSafe(|| { rt.block_on(...) }))` only covers the *inner agent runtime*. Panics originating in `run_service()` itself (e.g., in initial `service_control_handler::register`, the two `set_status` calls before/after the agent loop, the `if let Err` logging, or `set_status_with_exit` for Stopped) would still unwind out of `run_service`/`service_main` across the FFI with no guard. The top-level `if let Err(e) = run_service()` in service_main cannot catch a panic.
+- Suggestion: Wrap more of `run_service()` (or the whole body after the shutdown flag setup) in an outer `catch_unwind`, or add a guard inside `service_main`. This makes the "panic guard across SCM FFI" complete rather than scoped only to the "finding M" agent block_on. (Current scope is probably sufficient in practice, as panics are unlikely in the thin status/reporting code.)
+- Status: open
+
+### Issue 5 -- Severity: suggestion
+- File: agent/src/service/mod.rs:353 (`std::thread::sleep(BACKOFF)` inside `create_service_with_retry` loop), 470-476 (`stop_if_running` 10x 500ms polling loop); also called from install.rs:327/369 and startup paths changed by SPEC-018.
+- Description: These blocking sleeps are *only* in the synchronous install/uninstall code paths (CLI `guruconnect install` / uninstall, which run under the current user's context before handing off to the service). They are not inside the SCM control `event_handler` closure (which correctly does only `store(true)` + `NoError` / `NotImplemented` with no I/O or waits — satisfying "no blocking in handler"). However, they are still blocking the main thread during (re)install, and the fixed polling is a minor source of non-determinism/race on slow SCM. The retry logic itself (gated on `deleted_existing`, bounded attempts) is a clear improvement over the prior fixed 2s sleep.
+- Suggestion: Acceptable for install paths. Consider `std::thread::sleep` → a small async sleep if these ever move under tokio, or document "install-time only." The 500ms stop poll is fine for best-effort delete prep.
+- Status: open
+
+### Issue 6 -- Severity: bug (pre-existing, but now in scope due to SPEC-018 touching autostart/install/uninstall paths)
+- File: agent/src/startup.rs:61 (`std::mem::transmute::<HANDLE, HKEY>(hkey)` after `RegOpenKeyExW`), 122-125 (identical in `remove_from_startup`), plus similar raw pointer casts and unsafe registry blocks in install.rs:152+ (protocol handler), 380+ (is_protocol_handler_registered). Also related unsafe in credential_store (DPAPI) and identity (RegGetValueW) which are part of the enrollment/identity surface reviewed.
+- Description: The transmute from a `HANDLE` (obtained via the windows crate's `RegOpenKeyExW` call site that takes a `*mut _`) to `HKEY` for subsequent `RegSetValueExW`/`RegDeleteValueW` is a fragile, non-idiomatic, and potentially unsound hack. The `windows` 0.58 crate provides properly typed `HKEY` and safe-ish wrappers; mixing `HANDLE` + transmute + `as *mut _` casts risks wrong handle values, leaks, or UB. This code predates SPEC-018 but is executed on the managed install path (which does `remove_from_startup` inside `install_managed_service` and `uninstall_managed_service`).
+- Suggestion: Refactor to use the crate's typed registry APIs directly (pass `&mut HKEY` to the open call, or use `RegOpenKeyExW` overloads that return the handle type). Add `// SAFETY:` comments with justification for any remaining FFI. This is a correctness + maintainability issue for registry paths now used by the SYSTEM service host.
+- Status: open
+
+### Issue 7 -- Severity: nit
+- File: agent/src/transport/websocket.rs:208 (`pub async fn close(&mut self) -> Result<()> { ... }`)
+- Description: The method has `#[allow(dead_code)]` (and the call site in the pre-SPEC-018 code may have been conditional). SPEC-018 now calls it unconditionally in the service-stop path: `session/mod.rs:1134` (`if let Some(transport) = self.transport.as_mut() { if let Err(e) = transport.close().await { ... } }`). The allow is now misleading (though harmless, as the fn is live).
+- Suggestion: Remove `#[allow(dead_code)]`.
+- Status: open
+
+### Issue 8 -- Severity: nit (light broader review of recent enrollment/identity code)
+- File: agent/src/enroll.rs:329,359,360,373,380 (all in `#[cfg(test)]`); also scattered test unwraps in credential_store.rs:379+ (DPAPI tests), encoder tests, etc.
+- Description: Tests use `.unwrap()` liberally (e.g., `serde_json::to_value(&req).unwrap()`, response parses, `dpapi_protect(...).expect(...)`). This is conventional and acceptable in unit tests (they are expected to succeed on the test host), but still "remaining .unwrap()". No runtime impact. (The production paths in enroll use proper error classification into `AttemptError` and never unwrap on HTTP/JSON.)
+- Suggestion: For hygiene, change test unwraps to `?` inside `#[test] fn` (which can return Result) or `.expect("test precondition")`. Not a priority.
+- Status: open
+
+**No issues found in these areas (explicitly checked):**
+- Authentication / never accepting unauthenticated agents: Managed service path always forces `config.support_code = None` and goes through `resolve_agent_credential` (which requires cak_, enrollment material, or deprecated api_key; errors hard on nothing usable). Support-code sessions are exclusively the interactive `run_agent_mode` path. Viewer WS requires JWT (out of scope here). `run_dispatcher` / `service-run` cannot be usefully invoked interactively.
+- Cooperative shutdown + graceful WS close: AtomicBool (SeqCst) set only in handler; polled at top of connected loop (every ~1-100ms) *and* in reconnect backoff (250ms); sentinel error drives clean `transport.close()` (sends WS Close frame) + `return Ok(())` with no reconnect. Tests pin the sentinel contract and "no regression for non-service paths."
+- Service lifecycle / status / recovery / no blocking in handler: Correct `StartPending` → `Running` (accept STOP|SHUTDOWN) → (on stop) `StopPending` → `Stopped` (with Win32(0) or ServiceSpecific(1)). Handler closure is trivial move + store + match. `configure_recovery` + `sc failure` for restart-on-crash. Idempotent create/delete with retry for SCM delete races. `is_service_installed` is total (never panics).
+- No double-agents / autostart hygiene (happy path): `run_permanent_agent_managed` early-exits if service installed; managed install removes HKCU Run (best-effort); service path skips `add_to_startup` and tray creation entirely. Non-managed paths untouched.
+- Credential handling as SYSTEM: `run_managed_agent_service` runs as SYSTEM → `load_cak` (and `store_cak` during enroll) succeed because ACL grants SYSTEM full control. The perm-denied guard + C1 read-back in store_cak are retained exactly as safety net for non-SYSTEM invocations of managed binaries. No secrets logged.
+- Use of tracing/anyhow/thiserror/async/clippy-adjacent style: New code uses `tracing::{error,info,warn}`, `anyhow::Result` + `?` + `.context`, `thiserror` for `LoadCakError`. No `println!` in agent runtime (only in version-info / fallback message boxes / separate sas_service bin).
+- Phase 2 gaps called out: Explicitly and repeatedly in `service/mod.rs` module-level docs (lines 24-41), inline comments (e.g., 124, 422 in main, 1085+ in session for the "no capture yet" rationale and seams for broker/SESSIONCHANGE/CreateProcessAsUser/IPC). No silent omissions.
+- Other CLAUDE.md: Single static binary target, Win7+ (PS/CIM used for identity are present), no redist deps (icacls/sc are inbox), DB conventions on server side (UUID PKs, snake_case, events for audit), etc.
+
+**Executive verdict:** SPEC-018 Phase 1 is a solid, well-defended implementation of the SYSTEM service host with correct lifecycle, shutdown, and guardrails. The two "bug" issues (non-functional fallback on install failure for managed agents; agent_id churn on every restart) are real correctness problems that should be fixed before wider managed deployment, as they affect reliability of the high-privilege path and identity stability. The unwraps and registry transmute are lower-severity but worth cleaning for a SYSTEM binary. The rest of the work (including the excellent documentation of what's deliberately left for Phase 2) meets or exceeds the project's standards and the requirements in CLAUDE.md. Recommend addressing Issues 1-3 (and 6) prior to release; the rest can be follow-ups. The changes do not introduce new auth holes or unauthenticated agent paths.
+
+**File written:** D:\GrokTools\guru-connect-review-SPEC018.md
+
+(End of review notes.)
--- a/session-logs/2026-06-03-session.md
+++ b/session-logs/2026-06-03-session.md
@@ -0,0 +1,63 @@
+# Session Log — 2026-06-03 — GuruConnect SPEC-018 review validation + fixes
+
+## User
+- **User:** Mike Swanson (mike)
+- **Machine:** GURU-5070
+- **Role:** admin
+
+---
+
+## Session Summary
+
+Mike forwarded a thorough external code review of GuruConnect SPEC-018 Phase 1 (managed agent as LocalSystem service host; merge 11af9df) performed by a Grok reviewer persona and written to `D:\GrokTools\guru-connect-review-SPEC018.md`. Task: look over the project and validate.
+
+Independently validated the two flagged bugs and Issue 6 by reading the actual code (not just relaying). Confirmed all three as real. Added a refinement the review missed: the non-functional managed fallback (Bug 1) *does* still work for a deprecated legacy-`api_key` managed binary, but is broken specifically for the modern SPEC-016 enrollment path — sharpening the fix.
+
+Copied the review into the project at `reports/2026-06-03-spec018-review.md`, claimed a coord lock on `guruconnect`, created branch `fix/spec018-review-bugs`, and had the Coding Agent implement the three fixes. `cargo check -p guruconnect --target x86_64-pc-windows-msvc` passes clean (no errors/warnings). Filed Gitea issue #8 for the deferred lower-severity items. Changes remain uncommitted on the branch pending Mike's PR-vs-direct-to-main decision.
+
+---
+
+## Key Decisions
+
+- **Validated, did not rubber-stamp.** Read the code at each cited location to confirm Bug 1 (main.rs:496), Bug 2 (config.rs:392), Issue 6 (startup.rs transmute) before acting.
+- **Bug 1 fix = remove the fallback, surface an elevation error** (rather than build a degraded fallback). Matches `install_managed_service` docs; the managed model is elevated-install. The deprecated legacy-key edge case also errors now — acceptable and honest.
+- **Bug 2 fix = read persisted agent_id from the TOML first**, generate only if absent — stops agent_id churn on every restart while keeping machine_uid/cak_ as the stable keys.
+- **Issue 6 fix = typed `HKEY` from the windows crate** (no `HANDLE`+transmute). `install.rs` was already typed (no change).
+- **Deferred Issues 3/4/5/7/8** (hot-path unwraps, panic-guard scope, nits) to Gitea #8 — lower severity, follow-ups.
+- **No commit yet** — branch held for human review of diffs + PR-vs-main choice.
+
+---
+
+## Configuration Changes
+
+**In submodule `projects/msp-tools/guru-connect` (branch `fix/spec018-review-bugs`, UNCOMMITTED):**
+- `agent/src/config.rs` — added `Config::persisted_agent_id()`; embedded branch now `agent_id: Self::persisted_agent_id().unwrap_or_else(generate_agent_id)`; corrected comment.
+- `agent/src/main.rs` — `run_permanent_agent_managed`: removed `run_agent_mode(None)` fallback, now `error!` + `Err(...)` requiring elevation; updated doc/inline comments.
+- `agent/src/startup.rs` — replaced `transmute::<HANDLE,HKEY>` with `HKEY::default()` + `&mut hkey`; added SAFETY comments.
+- Created `reports/2026-06-03-spec018-review.md` (copy of the external review).
+- Stray untracked `tmp-spec018.diff` left untouched (from the Grok session).
+
+---
+
+## Commands & Outputs
+
+- Validation greps/reads: `run_permanent_agent_managed` at main.rs:482, fallback at :496; `Config::load` embedded branch config.rs:382-409 (`agent_id: generate_agent_id()` unconditional, save() never read back); `resolve_agent_credential` main.rs:515 (load_cak permission_denied guard / enroll C1 read-back).
+- `cargo check -p guruconnect --target x86_64-pc-windows-msvc` → Finished clean, no warnings from the changes.
+- Coord lock id `0cfd6269-4548-46d4-8436-c829e42f79d8` (guruconnect / agent/src, ttl 2h, GURU-5070/claude-main).
+
+---
+
+## Pending / Incomplete Tasks
+
+- **Awaiting Mike's decision:** push branch + open PR (recommended, matches SPEC-018 PR #7 convention) vs. commit straight to `main`.
+- On decision: commit the 3 fixes + the review report, push, (PR/merge), then bump the parent-repo submodule pointer on next `/sync`, update the coord `guruconnect` component, and release lock `0cfd6269`.
+- Deferred hardening: Gitea **guru-connect#8** (Issues 3/4/5/7/8).
+
+---
+
+## Reference Information
+
+- External review: `D:\GrokTools\guru-connect-review-SPEC018.md` → copied to `reports/2026-06-03-spec018-review.md`.
+- Branch: `fix/spec018-review-bugs` (off `main` @ 11af9df).
+- Gitea issue: https://git.azcomputerguru.com/azcomputerguru/guru-connect/issues/8
+- Files: `agent/src/{config.rs,main.rs,startup.rs}`.