fix(agent): SPEC-016 Phase B review fixes (re-image-stable machine_uid, ACL TOCTOU, load_cak error classes, PS timeout, fail-fast guard)
H1: derive machine_uid from the durable hardware salt ALONE (SMBIOS UUID, or board+disk serial) plus a fixed namespace, so it survives an OS re-image (which regenerates MachineGuid). MachineGuid is demoted to a last-resort signal used only when no hardware salt is readable (volatile, reboot-only floor). Re-image stability proven by salted_uid_is_reimage_stable_independent_of_machine_guid. H2: in store_cak, lock the directory ACL BEFORE any secret bytes are written; the temp file is created inside the already-locked dir, then renamed. No ciphertext ever exists at an inherited/world-readable path. Ordering made an explicit precondition, not an unstated inheritance assumption. M1: load_cak now returns a LoadCakError enum distinguishing Io (incl. PermissionDenied — operational) from Decrypt (the real tamper/wrong-machine signal). Only a successful READ whose DPAPI decrypt fails hard-stops. M2: the PowerShell SMBIOS/board/disk shell-out is spawned and waited on with a 10s wall-clock bound; on timeout the child is killed and the signal is treated as missing (falls back through the chain), never panics. Keeps CREATE_NO_WINDOW -NonInteractive -NoProfile. L1: warn! breadcrumb when the salted derivation degrades to MachineGuid-only, so the server-side collision-gate operator has a clue. No secret values logged. C1: keep the SYSTEM+Administrators ACL (Option A target). store_cak now does a read-back verification immediately after writing and fails at ENROLL time if this context cannot read its own store; resolve_agent_credential fails fast with an actionable SPEC-017 message on an access-denied store instead of silently re-enrolling/bricking. Guarded comment notes this is satisfied once the SYSTEM service host lands. Deferred items (clear_cak placeholder, legacy api_key path) left as-is. Verification on x86_64-pc-windows-msvc: cargo fmt --check clean, clippy -D warnings clean, release build OK, 52 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,6 +32,40 @@
|
||||
|
||||
use anyhow::{anyhow, Context, Result};
|
||||
use std::path::PathBuf;
|
||||
use thiserror::Error;
|
||||
|
||||
/// Failure classes for [`load_cak`], so callers can distinguish an *operational*
|
||||
/// problem (the file exists but this process cannot open/read it — e.g. running in
|
||||
/// the wrong security context against a SYSTEM-only-ACL'd store) from the real
|
||||
/// *tamper / wrong-machine* signal (the file was read successfully but DPAPI
|
||||
/// decryption failed).
|
||||
///
|
||||
/// The distinction matters for the run-mode resolver (`main.rs`):
|
||||
/// - [`LoadCakError::Io`] is recoverable/actionable — log it and STOP (do not
|
||||
/// silently re-enroll over a store we simply can't read in this context).
|
||||
/// - [`LoadCakError::Decrypt`] is a hard tamper signal — STOP, do not re-enroll.
|
||||
#[derive(Debug, Error)]
|
||||
pub enum LoadCakError {
|
||||
/// The store path could not be resolved (e.g. `%ProgramData%` unset).
|
||||
#[error("could not resolve credential store path: {0}")]
|
||||
Path(String),
|
||||
|
||||
/// An IO/open/read error reaching the stored blob — INCLUDING
|
||||
/// `PermissionDenied` (the running context lacks rights to the SYSTEM-only
|
||||
/// store). Operational, not a tamper signal.
|
||||
#[error("credential store is present but could not be read in this context: {source}")]
|
||||
Io {
|
||||
/// Whether this was specifically an access-denied error (drives the
|
||||
/// run-mode fail-fast guard in `main.rs`).
|
||||
permission_denied: bool,
|
||||
source: std::io::Error,
|
||||
},
|
||||
|
||||
/// The blob was read successfully but DPAPI decryption FAILED — the real
|
||||
/// tamper / wrong-machine / corruption signal. A hard stop; never re-enroll.
|
||||
#[error("stored credential failed to decrypt (wrong machine, tampered, or corrupted): {0}")]
|
||||
Decrypt(String),
|
||||
}
|
||||
|
||||
/// Directory holding the protected credential file.
|
||||
fn credentials_dir() -> Result<PathBuf> {
|
||||
@@ -49,57 +83,111 @@ fn cak_path() -> Result<PathBuf> {
|
||||
|
||||
/// Persist `cak` encrypted at rest.
|
||||
///
|
||||
/// 1. Ensures the credentials directory exists and is locked down (SYSTEM +
|
||||
/// Administrators full control, inheritance removed).
|
||||
/// 2. DPAPI-machine-encrypts the plaintext.
|
||||
/// 3. Writes the ciphertext to `agent.cak` and locks the file ACL too.
|
||||
/// Ordering is security-critical (H2 — TOCTOU): the directory ACL is locked
|
||||
/// BEFORE any secret bytes touch the filesystem, and the temp file is written
|
||||
/// INSIDE the already-locked directory, so no ciphertext ever exists at a path
|
||||
/// carrying an inherited (potentially world-readable) ACL:
|
||||
///
|
||||
/// 1. `create_dir_all(dir)` — ensure the directory exists.
|
||||
/// 2. `lock_down_acl(dir)` — remove inherited ACEs and grant SYSTEM +
|
||||
/// Administrators full control, made inheritable `(OI)(CI)` so children
|
||||
/// created afterward are covered. This is an explicit precondition for the
|
||||
/// write that follows — NOT an unstated inheritance assumption.
|
||||
/// 3. DPAPI-machine-encrypt the plaintext.
|
||||
/// 4. Write the ciphertext to a temp file inside the now-locked directory, then
|
||||
/// rename over the target (atomic-ish replace).
|
||||
/// 5. `lock_down_acl(file)` — assert the file's own ACL (belt-and-suspenders; the
|
||||
/// file already inherits the directory's restrictive ACEs).
|
||||
/// 6. C1 read-back: immediately attempt [`load_cak`] to PROVE the running
|
||||
/// security context can read its own store. If it cannot (e.g. a non-SYSTEM
|
||||
/// run wrote a SYSTEM-only store it can no longer read), fail HERE at enroll
|
||||
/// time with an actionable error — rather than silently bricking on the next
|
||||
/// boot when the steady-state path tries to load it.
|
||||
///
|
||||
/// Returns an error (never logs the plaintext) on any failure so the caller can
|
||||
/// surface it / retry. A partial write is replaced atomically via a temp file +
|
||||
/// rename within the same protected directory.
|
||||
/// surface it / retry.
|
||||
pub fn store_cak(cak: &str) -> Result<()> {
|
||||
// 1 + 2: lock the directory ACL BEFORE writing any secret (H2 / TOCTOU).
|
||||
let dir = credentials_dir()?;
|
||||
std::fs::create_dir_all(&dir)
|
||||
.with_context(|| format!("failed to create credentials dir {dir:?}"))?;
|
||||
lock_down_acl(&dir).context("failed to restrict credentials directory ACL")?;
|
||||
|
||||
// 3: encrypt only after the destination directory is locked down.
|
||||
let ciphertext = dpapi_protect(cak.as_bytes()).context("DPAPI encryption of cak_ failed")?;
|
||||
|
||||
// 4: write the temp file INSIDE the already-locked directory, then rename.
|
||||
let path = cak_path()?;
|
||||
// Atomic-ish replace: write to a sibling temp file, then rename over the
|
||||
// target. Both live in the already-locked-down directory, so the temp file
|
||||
// inherits the restrictive ACL (inheritance was re-enabled for children when
|
||||
// we granted on the dir; we still belt-and-suspenders lock the final file).
|
||||
let tmp = path.with_extension("cak.tmp");
|
||||
std::fs::write(&tmp, &ciphertext)
|
||||
.with_context(|| format!("failed to write temp credential file {tmp:?}"))?;
|
||||
std::fs::rename(&tmp, &path)
|
||||
.with_context(|| format!("failed to place credential file {path:?}"))?;
|
||||
|
||||
// 5: assert the file ACL too (the file already inherits the dir's ACEs).
|
||||
lock_down_acl(&path).context("failed to restrict credential file ACL")?;
|
||||
|
||||
tracing::info!("[ENROLL] stored per-machine credential (encrypted at rest)");
|
||||
Ok(())
|
||||
// 6: C1 read-back — confirm THIS context can read back what it just wrote.
|
||||
// Catches the "wrote a SYSTEM-only store from a non-SYSTEM context" footgun at
|
||||
// enroll time instead of as a silent brick on the next launch.
|
||||
match load_cak() {
|
||||
Ok(Some(_)) => {
|
||||
tracing::info!("[ENROLL] stored per-machine credential (encrypted at rest)");
|
||||
Ok(())
|
||||
}
|
||||
Ok(None) => Err(anyhow!(
|
||||
"stored the credential but read-back returned nothing — refusing to proceed \
|
||||
with an unverifiable credential store"
|
||||
)),
|
||||
Err(LoadCakError::Io {
|
||||
permission_denied: true,
|
||||
..
|
||||
}) => Err(anyhow!(
|
||||
"[ENROLL] wrote the credential store but cannot read it back in THIS security \
|
||||
context (access denied). The store is ACL'd to SYSTEM + Administrators by \
|
||||
design; the managed agent must run as the GuruConnect SYSTEM service (see \
|
||||
SPEC-017) to read it. Refusing to leave an unreadable store behind."
|
||||
)),
|
||||
Err(e) => Err(anyhow::Error::new(e)
|
||||
.context("stored the credential but the immediate read-back verification failed")),
|
||||
}
|
||||
}
|
||||
|
||||
/// Load and decrypt the stored `cak_`, or `None` if no credential is stored yet.
|
||||
/// Load and decrypt the stored `cak_`, or `Ok(None)` if no credential is stored.
|
||||
///
|
||||
/// A present-but-undecryptable blob (e.g. copied from another machine, or
|
||||
/// corrupted) is treated as a hard error rather than `None`, so the caller does
|
||||
/// not silently re-enroll over a tampered store without noticing.
|
||||
pub fn load_cak() -> Result<Option<String>> {
|
||||
let path = cak_path()?;
|
||||
/// Error classification (M1) — the caller MUST treat these differently:
|
||||
/// - `Ok(None)` -> no store yet (NotFound or empty); enroll is fine.
|
||||
/// - [`LoadCakError::Io`] -> the store exists but is unreadable in this
|
||||
/// context (open/read error, INCLUDING access-denied). Operational; the caller
|
||||
/// logs it and STOPS — it must NOT silently re-enroll over a store it merely
|
||||
/// cannot read here.
|
||||
/// - [`LoadCakError::Decrypt`] -> the bytes were read but DPAPI decryption
|
||||
/// FAILED (wrong machine / tampered / corrupted). A hard tamper signal; STOP.
|
||||
///
|
||||
/// Only a successful READ whose decrypt fails is the tamper signal — an IO or
|
||||
/// permission error is never conflated with tamper.
|
||||
pub fn load_cak() -> std::result::Result<Option<String>, LoadCakError> {
|
||||
let path = cak_path().map_err(|e| LoadCakError::Path(e.to_string()))?;
|
||||
let ciphertext = match std::fs::read(&path) {
|
||||
Ok(bytes) => bytes,
|
||||
Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(None),
|
||||
Err(e) => return Err(e).with_context(|| format!("failed to read {path:?}")),
|
||||
Err(e) => {
|
||||
let permission_denied = e.kind() == std::io::ErrorKind::PermissionDenied;
|
||||
return Err(LoadCakError::Io {
|
||||
permission_denied,
|
||||
source: e,
|
||||
});
|
||||
}
|
||||
};
|
||||
if ciphertext.is_empty() {
|
||||
return Ok(None);
|
||||
}
|
||||
let plaintext = dpapi_unprotect(&ciphertext)
|
||||
.context("DPAPI decryption of stored cak_ failed (wrong machine or corrupted blob?)")?;
|
||||
let cak =
|
||||
String::from_utf8(plaintext).context("stored cak_ was not valid UTF-8 after decryption")?;
|
||||
// Reaching here means the READ succeeded — so a decrypt failure now IS the real
|
||||
// tamper / wrong-machine signal (never conflated with an IO/permission error).
|
||||
let plaintext =
|
||||
dpapi_unprotect(&ciphertext).map_err(|e| LoadCakError::Decrypt(e.to_string()))?;
|
||||
let cak = String::from_utf8(plaintext)
|
||||
.map_err(|e| LoadCakError::Decrypt(format!("decrypted bytes were not valid UTF-8: {e}")))?;
|
||||
if cak.is_empty() {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user