fix(agent): SPEC-018 Phase 1 review fixes (cancellable session loop, panic guard, service-create retry)
All checks were successful
Build and Test / Build Agent (Windows) (pull_request) Successful in 10m23s
Build and Test / Build Server (Linux) (pull_request) Successful in 14m47s
Build and Test / Security Audit (pull_request) Successful in 5m29s
Build and Test / Build Summary (pull_request) Successful in 20s

H: thread the SCM cooperative-stop flag into the connected session loop
(run_with_tray) via a new Option<&Arc<AtomicBool>> param. The flag was only
observed by the outer run_agent reconnect loop, which never runs while a
session is connected, so an SCM Stop/Shutdown left the service Running until
force-kill. The inner loop now checks it each tick, closes the WS cleanly, and
returns the SERVICE_STOP sentinel that the outer loop maps to a graceful stop.
The new param is optional: attended/viewer/interactive callers pass None and
behave exactly as before.

M: wrap the managed-agent runtime block_on in catch_unwind(AssertUnwindSafe) so
a panic in the agent future cannot unwind across the extern "system" service
entry (UB/abort). A caught panic becomes an Err -> ServiceExitCode::ServiceSpecific(1)
so SCM recovery engages cleanly.

L1: replace the fixed 2s sleep after delete() on reinstall with a bounded retry
on CreateService returning ERROR_SERVICE_MARKED_FOR_DELETE (1072), gated on
having actually deleted a prior instance.

L2: clarify the --elevated -> force_user_install mapping (comment only).

N1: add a clap-metadata test pinning the service-run subcommand name to
SERVICE_RUN_ARG, cross-linked from the existing literal test.

N2: correct the service doc comments now that graceful stop interrupts the
connected case too.

Verified on Windows host: cargo fmt --check, clippy -D warnings, release build
(x86_64-pc-windows-msvc), and cargo test (58 passed) all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-02 13:57:41 -07:00
parent 7602b4346a
commit a0e0d5f1e7
3 changed files with 258 additions and 17 deletions

View File

@@ -12,7 +12,11 @@
//! relay WSS connection.
//! 2. Report a correct service lifecycle to the SCM (`StartPending` ->
//! `Running` -> `StopPending` -> `Stopped`) and handle `Stop`/`Shutdown`
//! gracefully (signal the agent loop to close the WS connection and exit).
//! gracefully. The control handler sets a shared shutdown flag; the agent
//! runtime observes it both between reconnect attempts AND inside the
//! connected session loop (SPEC-018 finding H), so a stop received while a
//! session is live breaks out promptly, closes the WS connection cleanly,
//! and exits — rather than waiting for the SCM to force-kill.
//! 3. Provide install/uninstall of the service (LocalSystem, auto-start, crash
//! recovery) so managed mode uses the service as its single autostart
//! instead of the per-user `HKCU\…\Run` entry.
@@ -122,6 +126,11 @@ fn run_service() -> Result<()> {
// we intentionally do not accept SESSIONCHANGE yet.
ServiceControl::Stop | ServiceControl::Shutdown => {
info!("received {control_event:?}; signalling agent to shut down");
// Set the cooperative-stop flag. The agent runtime observes it on
// every idle tick of the connected session loop and between
// reconnect attempts (SPEC-018 finding H), so it breaks out and
// closes the WebSocket cleanly within ~100ms even if a session is
// currently connected.
shutdown_for_handler.store(true, Ordering::SeqCst);
ServiceControlHandlerResult::NoError
}
@@ -253,6 +262,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
.context("failed to connect to the Service Control Manager (run as Administrator)")?;
// Remove any prior installation so the binary path / args are refreshed.
let mut deleted_existing = false;
if let Ok(existing) = manager.open_service(
SERVICE_NAME,
ServiceAccess::QUERY_STATUS | ServiceAccess::STOP | ServiceAccess::DELETE,
@@ -263,9 +273,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
.delete()
.context("failed to delete the existing service before reinstall")?;
drop(existing);
// The SCM marks a service for deletion but only removes it once all handles
// close; a brief settle avoids a CreateService "marked for deletion" race.
std::thread::sleep(Duration::from_secs(2));
deleted_existing = true;
}
let service_info = ServiceInfo {
@@ -282,8 +290,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
account_password: None,
};
let service = manager
.create_service(&service_info, ServiceAccess::CHANGE_CONFIG)
let service = create_service_with_retry(&manager, &service_info, deleted_existing)
.context("failed to create the GuruConnect managed agent service")?;
service
@@ -300,6 +307,56 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
Ok(())
}
/// Create the service, retrying briefly if the SCM still has the prior instance
/// "marked for deletion" (SPEC-018 finding L1).
///
/// When a service is deleted, the SCM only removes it from its database once every
/// open handle to it closes; until then a fresh `CreateService` fails with
/// `ERROR_SERVICE_MARKED_FOR_DELETE` (1072). The previous implementation papered
/// over this with a fixed 2s sleep after `delete()`, which is both slower than
/// necessary in the common case and still racy on a busy box. Instead we attempt
/// the create immediately and, only if we just deleted an existing instance and
/// hit 1072, retry a few times with short backoff — succeeding as soon as the SCM
/// finishes the removal, and giving up with the real error if it never does.
///
/// The retry is gated on `deleted_existing`: on a clean first install there was no
/// prior instance, so a 1072 there is unexpected and is surfaced immediately
/// rather than masked by retries.
fn create_service_with_retry(
manager: &ServiceManager,
service_info: &ServiceInfo,
deleted_existing: bool,
) -> Result<windows_service::service::Service, windows_service::Error> {
// ERROR_SERVICE_MARKED_FOR_DELETE (winerror.h). The service is gone from the
// caller's perspective but the SCM has not finished reaping it.
const ERROR_SERVICE_MARKED_FOR_DELETE: i32 = 1072;
// Bounded: ~5 attempts over ~2s total worst case (matches the old fixed sleep
// ceiling) but returns the instant the SCM is ready.
const MAX_ATTEMPTS: u32 = 5;
const BACKOFF: Duration = Duration::from_millis(400);
let mut attempt = 0;
loop {
attempt += 1;
match manager.create_service(service_info, ServiceAccess::CHANGE_CONFIG) {
Ok(service) => return Ok(service),
Err(windows_service::Error::Winapi(ref io_err))
if deleted_existing
&& io_err.raw_os_error() == Some(ERROR_SERVICE_MARKED_FOR_DELETE)
&& attempt < MAX_ATTEMPTS =>
{
warn!(
"{SERVICE_NAME} still marked for deletion by the SCM \
(attempt {attempt}/{MAX_ATTEMPTS}); retrying in {}ms",
BACKOFF.as_millis()
);
std::thread::sleep(BACKOFF);
}
Err(e) => return Err(e),
}
}
}
/// Configure SCM crash-recovery so the service restarts on unexpected exit.
///
/// `windows-service` 0.7 does not expose `ChangeServiceConfig2` recovery actions
@@ -429,6 +486,12 @@ mod tests {
/// `service-run` subcommand `main.rs` dispatches into [`run_dispatcher`]; a
/// mismatch would register a service the SCM could start but that would fall
/// through to normal (non-service) mode and immediately exit.
///
/// This pins the value of the constant itself. The companion test
/// `tests::service_run_subcommand_matches_scm_launch_arg` in `main.rs` pins the
/// other half — that the clap `#[command(name = "service-run")]` attribute on
/// `Commands::ServiceRun` resolves to this same constant — so the two string
/// literals cannot silently drift apart.
#[test]
fn service_run_arg_matches_subcommand_name() {
assert_eq!(SERVICE_RUN_ARG, "service-run");