# SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows) **Status:** Proposed **Priority:** P2 **Requested By:** Mike (2026-05-30) **Estimated Effort:** Large (phaseable) ## Overview Make every error in GuruConnect carry information a human can act on: a single structured API error envelope with a stable code and a **correlation id** that also appears in the server logs, consistent contextual logging on the server and agent, an end to silent error swallowing, and a dashboard that shows the *real* cause (with the id to quote to support) instead of a hardcoded generic line. Success = given any error a tech sees, they (or we) can determine what failed and why — and a user-reported "it broke" comes with a correlation id that pulls the exact server log line. This is a cross-cutting quality initiative. **A parallel feature request governs GuruRMM** (both are Rust/Axum); keep the conventions consistent where practical, but this spec governs GuruConnect only. ## Current state (researched — this is the problem) - **Inconsistent API error shapes.** Bare `(StatusCode, &str)` tuples (`server/src/api/mod.rs:101,106`), a per-file `ErrorResponse` struct (`server/src/api/auth.rs:36`), and **two different JSON envelopes** in the wild — `{error}` and `{detail, error_code, status_code}` — which the dashboard already has to union to cope (`dashboard/src/api/client.ts:48-52`). No single error type. - **No correlation/request id.** Only `TraceLayer::new_for_http()` (`server/src/main.rs:471`); nothing ties a response a user saw to a log line. - **37 `let _ =` swallows** in `server/src/` — many on fallible DB ops: `create_session` (`relay/mod.rs:594`), `end_session` (`:686`), `log_event` (`:132,191,303,604`), `mark_machine_offline` (`:687`), `link_session_to_code` (`:640`). This is the exact pattern that hid the **migration-005 missing-columns bug** — `update_machine_metadata` failed silently for an unknown period. - **Generic dashboard message.** `MachinesPage.tsx:202` shows "The GuruConnect server did not respond. Check the relay status, then retry." regardless of cause (auth expired? 500? network? CORS?), even though `ApiError` (`client.ts:12`) already carries status + code. - **Thin agent errors.** The GuruConnect-client auth-failure-loop incident (2026-05-30) produced no useful local signal about *why* auth failed or *which* server it was hitting. ## Scope ### Included in v1 - **One structured API error type.** A server `AppError` enum (`thiserror`) with a single `IntoResponse` producing one envelope: `{ error_code: string, message: string, detail?: object, correlation_id: string }` + consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-file `ErrorResponse`; converge the two envelopes to one. - **Correlation-id middleware.** Generate (or accept inbound `x-request-id`) a request id, put it on the tracing span, echo it in the response header **and** in every `AppError` body, so a reported id greps directly to the log. - **Logging convention.** Errors logged via `tracing` include the operation + relevant identifiers (`agent_id`/`machine_uid`/`session_id`) + the underlying error chain (`anyhow` context / `thiserror` source) — never a bare "failed". Use `#[instrument]` span fields so context is attached automatically. - **Kill silent swallows.** Sweep all `let _ = ()`: convert to `if let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }`. Where fire-and-forget is genuinely correct (e.g. best-effort socket close on an already-dead connection), keep it but add a one-line comment stating why. A clippy lint / CI grep guards against new bare `let _ =` on `Result`. - **Dashboard surfaces the real error.** Render `ApiError.code` + `message` + `correlation_id` ("quote this to support: …"); distinguish network-down (fetch threw) vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic at `MachinesPage.tsx:202`; reuse one error component across pages. - **Agent error detail.** Connection/auth/capture failures log *what* failed and *why* — the target server URL, the relay's stated reason, the auth mode attempted — so a misbehaving agent is self-diagnosing (directly addresses the auth-loop incident). ### Explicitly out of scope - Centralized log aggregation / external error tracking (Sentry-style) — local structured logs + correlation id only in v1. - User-facing localized error catalog / i18n. - Changing protobuf-level agent↔relay error frames beyond adding context to existing disconnect reasons (a deeper protocol error model is a later item). - Rewriting business logic — this is error *plumbing/observability*, not behavior change. ## Architecture - **Relay-server:** new `server/src/error.rs` (`AppError` + `IntoResponse` + status map); handlers return `Result`; correlation-id middleware in `server/src/main.rs` (layer beside `TraceLayer`, :471). `relay/mod.rs` swallows fixed in place with context logging. - **Agent:** `anyhow` context added at error boundaries in `agent/src/transport/` and `agent/src/session/`; connection/auth failures `error!`-logged with server URL + reason. - **Dashboard:** `dashboard/src/api/client.ts` keeps/extends `ApiError` to carry `correlation_id`; a shared `ErrorBanner`/`ErrorState` component; pages stop hardcoding messages. - **Protobuf/DB:** none required (existing `Disconnect.reason` string can carry richer text; no schema change). ## Implementation details - Files to touch: `server/src/error.rs` (new); `server/src/api/*.rs` (adopt `AppError`, drop `ErrorResponse`/tuple returns — `mod.rs:101,106`, `auth.rs:36`); `server/src/main.rs` (correlation-id middleware, :471); `server/src/relay/mod.rs` (swallow sweep — :132,191, 303,594,604,640,686,687,695, …); `agent/src/transport/*`, `agent/src/session/mod.rs` (error context); `dashboard/src/api/client.ts` (correlation id), shared error component, `dashboard/src/features/machines/MachinesPage.tsx:202` (use it). - Stable `error_code` values are an enum (e.g. `AUTH_EXPIRED`, `MACHINE_NOT_FOUND`, `DB_UNAVAILABLE`, `VALIDATION_FAILED`) — documented and reused, not free-text. ## Security considerations - **Don't leak internals to clients.** The structured body carries a safe `message` + code; the full error chain / SQL / stack stays in the **server log** keyed by correlation id. A client gets "DB_UNAVAILABLE (id abc123)", not the connection string. - Correlation id is a random opaque token (no embedded PII/sequence that leaks volume). - Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username exists), while the *log* records the specific reason — valuable internally, safe externally. - The swallow sweep must not start logging secrets (api keys, tokens) that were previously silently dropped — scrub identifiers, never log key material. ## Testing strategy - **Unit:** each `AppError` variant maps to the right status + code + envelope; correlation id present on every error body and matches the response header. - **Integration:** a handler error returns the structured envelope with a correlation id that appears in the captured log; an inbound `x-request-id` is honored end-to-end. A forced DB failure on a former `let _ =` path now emits a `warn!/error!` (assert via log capture) instead of vanishing. - **Dashboard:** 401 renders "session expired / re-login"; 500 renders code + correlation id; fetch-throw renders "can't reach server" — three distinct states, not one generic. - **CI guard:** lint/grep fails the build on a new bare `let _ =` over a `Result` in `server/`. ## Effort estimate & dependencies - **Size: Large but phaseable.** Phase 1 (highest value): `AppError` + correlation-id middleware + the `let _ =` swallow sweep. Phase 2: dashboard error rendering. Phase 3: agent error context. Each ships independently. - **Depends on:** nothing blocking. Synergizes with **SPEC-004** (the agent-auth-loop diagnosis would have been trivial with agent error detail) and improves every other spec's debuggability. - **Unblocks:** faster incident triage across the product; a real correlation-id workflow for support. ## Open questions 1. **Envelope migration.** Converge to the single new envelope and update the dashboard union in lockstep, or keep accepting both for a deprecation window? (Dashboard already unions both — low risk to converge.) 2. **`error_code` taxonomy.** Agree the initial enum of stable codes and their HTTP status mapping. 3. **Correlation id source.** Generate server-side always, or trust an inbound `x-request-id` from the dashboard/NPM when present (and from where is it trusted)? 4. **Swallow policy granularity.** Blanket "log all `let _ =` on Result", or an allowlist for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment + keep best-effort socket closes.