fix(agent): SPEC-018 Phase 1 review fixes (cancellable session loop, panic guard, service-create retry)
All checks were successful
Build and Test / Build Agent (Windows) (pull_request) Successful in 10m23s
Build and Test / Build Server (Linux) (pull_request) Successful in 14m47s
Build and Test / Security Audit (pull_request) Successful in 5m29s
Build and Test / Build Summary (pull_request) Successful in 20s
All checks were successful
Build and Test / Build Agent (Windows) (pull_request) Successful in 10m23s
Build and Test / Build Server (Linux) (pull_request) Successful in 14m47s
Build and Test / Security Audit (pull_request) Successful in 5m29s
Build and Test / Build Summary (pull_request) Successful in 20s
H: thread the SCM cooperative-stop flag into the connected session loop (run_with_tray) via a new Option<&Arc<AtomicBool>> param. The flag was only observed by the outer run_agent reconnect loop, which never runs while a session is connected, so an SCM Stop/Shutdown left the service Running until force-kill. The inner loop now checks it each tick, closes the WS cleanly, and returns the SERVICE_STOP sentinel that the outer loop maps to a graceful stop. The new param is optional: attended/viewer/interactive callers pass None and behave exactly as before. M: wrap the managed-agent runtime block_on in catch_unwind(AssertUnwindSafe) so a panic in the agent future cannot unwind across the extern "system" service entry (UB/abort). A caught panic becomes an Err -> ServiceExitCode::ServiceSpecific(1) so SCM recovery engages cleanly. L1: replace the fixed 2s sleep after delete() on reinstall with a bounded retry on CreateService returning ERROR_SERVICE_MARKED_FOR_DELETE (1072), gated on having actually deleted a prior instance. L2: clarify the --elevated -> force_user_install mapping (comment only). N1: add a clap-metadata test pinning the service-run subcommand name to SERVICE_RUN_ARG, cross-linked from the existing literal test. N2: correct the service doc comments now that graceful stop interrupts the connected case too. Verified on Windows host: cargo fmt --check, clippy -D warnings, release build (x86_64-pc-windows-msvc), and cargo test (58 passed) all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -12,7 +12,11 @@
|
||||
//! relay WSS connection.
|
||||
//! 2. Report a correct service lifecycle to the SCM (`StartPending` ->
|
||||
//! `Running` -> `StopPending` -> `Stopped`) and handle `Stop`/`Shutdown`
|
||||
//! gracefully (signal the agent loop to close the WS connection and exit).
|
||||
//! gracefully. The control handler sets a shared shutdown flag; the agent
|
||||
//! runtime observes it both between reconnect attempts AND inside the
|
||||
//! connected session loop (SPEC-018 finding H), so a stop received while a
|
||||
//! session is live breaks out promptly, closes the WS connection cleanly,
|
||||
//! and exits — rather than waiting for the SCM to force-kill.
|
||||
//! 3. Provide install/uninstall of the service (LocalSystem, auto-start, crash
|
||||
//! recovery) so managed mode uses the service as its single autostart
|
||||
//! instead of the per-user `HKCU\…\Run` entry.
|
||||
@@ -122,6 +126,11 @@ fn run_service() -> Result<()> {
|
||||
// we intentionally do not accept SESSIONCHANGE yet.
|
||||
ServiceControl::Stop | ServiceControl::Shutdown => {
|
||||
info!("received {control_event:?}; signalling agent to shut down");
|
||||
// Set the cooperative-stop flag. The agent runtime observes it on
|
||||
// every idle tick of the connected session loop and between
|
||||
// reconnect attempts (SPEC-018 finding H), so it breaks out and
|
||||
// closes the WebSocket cleanly within ~100ms even if a session is
|
||||
// currently connected.
|
||||
shutdown_for_handler.store(true, Ordering::SeqCst);
|
||||
ServiceControlHandlerResult::NoError
|
||||
}
|
||||
@@ -253,6 +262,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
|
||||
.context("failed to connect to the Service Control Manager (run as Administrator)")?;
|
||||
|
||||
// Remove any prior installation so the binary path / args are refreshed.
|
||||
let mut deleted_existing = false;
|
||||
if let Ok(existing) = manager.open_service(
|
||||
SERVICE_NAME,
|
||||
ServiceAccess::QUERY_STATUS | ServiceAccess::STOP | ServiceAccess::DELETE,
|
||||
@@ -263,9 +273,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
|
||||
.delete()
|
||||
.context("failed to delete the existing service before reinstall")?;
|
||||
drop(existing);
|
||||
// The SCM marks a service for deletion but only removes it once all handles
|
||||
// close; a brief settle avoids a CreateService "marked for deletion" race.
|
||||
std::thread::sleep(Duration::from_secs(2));
|
||||
deleted_existing = true;
|
||||
}
|
||||
|
||||
let service_info = ServiceInfo {
|
||||
@@ -282,8 +290,7 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
|
||||
account_password: None,
|
||||
};
|
||||
|
||||
let service = manager
|
||||
.create_service(&service_info, ServiceAccess::CHANGE_CONFIG)
|
||||
let service = create_service_with_retry(&manager, &service_info, deleted_existing)
|
||||
.context("failed to create the GuruConnect managed agent service")?;
|
||||
|
||||
service
|
||||
@@ -300,6 +307,56 @@ pub fn install_service(exe_path: &std::path::Path) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Create the service, retrying briefly if the SCM still has the prior instance
|
||||
/// "marked for deletion" (SPEC-018 finding L1).
|
||||
///
|
||||
/// When a service is deleted, the SCM only removes it from its database once every
|
||||
/// open handle to it closes; until then a fresh `CreateService` fails with
|
||||
/// `ERROR_SERVICE_MARKED_FOR_DELETE` (1072). The previous implementation papered
|
||||
/// over this with a fixed 2s sleep after `delete()`, which is both slower than
|
||||
/// necessary in the common case and still racy on a busy box. Instead we attempt
|
||||
/// the create immediately and, only if we just deleted an existing instance and
|
||||
/// hit 1072, retry a few times with short backoff — succeeding as soon as the SCM
|
||||
/// finishes the removal, and giving up with the real error if it never does.
|
||||
///
|
||||
/// The retry is gated on `deleted_existing`: on a clean first install there was no
|
||||
/// prior instance, so a 1072 there is unexpected and is surfaced immediately
|
||||
/// rather than masked by retries.
|
||||
fn create_service_with_retry(
|
||||
manager: &ServiceManager,
|
||||
service_info: &ServiceInfo,
|
||||
deleted_existing: bool,
|
||||
) -> Result<windows_service::service::Service, windows_service::Error> {
|
||||
// ERROR_SERVICE_MARKED_FOR_DELETE (winerror.h). The service is gone from the
|
||||
// caller's perspective but the SCM has not finished reaping it.
|
||||
const ERROR_SERVICE_MARKED_FOR_DELETE: i32 = 1072;
|
||||
// Bounded: ~5 attempts over ~2s total worst case (matches the old fixed sleep
|
||||
// ceiling) but returns the instant the SCM is ready.
|
||||
const MAX_ATTEMPTS: u32 = 5;
|
||||
const BACKOFF: Duration = Duration::from_millis(400);
|
||||
|
||||
let mut attempt = 0;
|
||||
loop {
|
||||
attempt += 1;
|
||||
match manager.create_service(service_info, ServiceAccess::CHANGE_CONFIG) {
|
||||
Ok(service) => return Ok(service),
|
||||
Err(windows_service::Error::Winapi(ref io_err))
|
||||
if deleted_existing
|
||||
&& io_err.raw_os_error() == Some(ERROR_SERVICE_MARKED_FOR_DELETE)
|
||||
&& attempt < MAX_ATTEMPTS =>
|
||||
{
|
||||
warn!(
|
||||
"{SERVICE_NAME} still marked for deletion by the SCM \
|
||||
(attempt {attempt}/{MAX_ATTEMPTS}); retrying in {}ms",
|
||||
BACKOFF.as_millis()
|
||||
);
|
||||
std::thread::sleep(BACKOFF);
|
||||
}
|
||||
Err(e) => return Err(e),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Configure SCM crash-recovery so the service restarts on unexpected exit.
|
||||
///
|
||||
/// `windows-service` 0.7 does not expose `ChangeServiceConfig2` recovery actions
|
||||
@@ -429,6 +486,12 @@ mod tests {
|
||||
/// `service-run` subcommand `main.rs` dispatches into [`run_dispatcher`]; a
|
||||
/// mismatch would register a service the SCM could start but that would fall
|
||||
/// through to normal (non-service) mode and immediately exit.
|
||||
///
|
||||
/// This pins the value of the constant itself. The companion test
|
||||
/// `tests::service_run_subcommand_matches_scm_launch_arg` in `main.rs` pins the
|
||||
/// other half — that the clap `#[command(name = "service-run")]` attribute on
|
||||
/// `Commands::ServiceRun` resolves to this same constant — so the two string
|
||||
/// literals cannot silently drift apart.
|
||||
#[test]
|
||||
fn service_run_arg_matches_subcommand_name() {
|
||||
assert_eq!(SERVICE_RUN_ARG, "service-run");
|
||||
|
||||
Reference in New Issue
Block a user