Cross-cutting error-quality initiative: one structured AppError envelope (stable error_code + message + correlation_id) replacing the current ad-hoc mix (bare (StatusCode,&str) tuples, per-file ErrorResponse, two JSON envelopes the dashboard already unions); correlation-id middleware tied to tracing spans + response header so a reported id greps the log; contextual error logging with identifiers + error chain; sweep the 37 server `let _ =` swallows (the pattern that silently hid migration-005's missing columns); dashboard renders the real cause + correlation id (drop the hardcoded generic at MachinesPage.tsx:202); agent logs why/where auth/connection failed (the auth-loop incident gave no local signal). Phaseable; Large. Parallel RMM request keeps conventions aligned. Requested by Mike 2026-05-30. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.8 KiB
SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows)
Status: Proposed Priority: P2 Requested By: Mike (2026-05-30) Estimated Effort: Large (phaseable)
Overview
Make every error in GuruConnect carry information a human can act on: a single structured API error envelope with a stable code and a correlation id that also appears in the server logs, consistent contextual logging on the server and agent, an end to silent error swallowing, and a dashboard that shows the real cause (with the id to quote to support) instead of a hardcoded generic line. Success = given any error a tech sees, they (or we) can determine what failed and why — and a user-reported "it broke" comes with a correlation id that pulls the exact server log line.
This is a cross-cutting quality initiative. A parallel feature request governs GuruRMM (both are Rust/Axum); keep the conventions consistent where practical, but this spec governs GuruConnect only.
Current state (researched — this is the problem)
- Inconsistent API error shapes. Bare
(StatusCode, &str)tuples (server/src/api/mod.rs:101,106), a per-fileErrorResponsestruct (server/src/api/auth.rs:36), and two different JSON envelopes in the wild —{error}and{detail, error_code, status_code}— which the dashboard already has to union to cope (dashboard/src/api/client.ts:48-52). No single error type. - No correlation/request id. Only
TraceLayer::new_for_http()(server/src/main.rs:471); nothing ties a response a user saw to a log line. - 37
let _ =swallows inserver/src/— many on fallible DB ops:create_session(relay/mod.rs:594),end_session(:686),log_event(:132,191,303,604),mark_machine_offline(:687),link_session_to_code(:640). This is the exact pattern that hid the migration-005 missing-columns bug —update_machine_metadatafailed silently for an unknown period. - Generic dashboard message.
MachinesPage.tsx:202shows "The GuruConnect server did not respond. Check the relay status, then retry." regardless of cause (auth expired? 500? network? CORS?), even thoughApiError(client.ts:12) already carries status + code. - Thin agent errors. The GuruConnect-client auth-failure-loop incident (2026-05-30) produced no useful local signal about why auth failed or which server it was hitting.
Scope
Included in v1
- One structured API error type. A server
AppErrorenum (thiserror) with a singleIntoResponseproducing one envelope:{ error_code: string, message: string, detail?: object, correlation_id: string }+ consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-fileErrorResponse; converge the two envelopes to one. - Correlation-id middleware. Generate (or accept inbound
x-request-id) a request id, put it on the tracing span, echo it in the response header and in everyAppErrorbody, so a reported id greps directly to the log. - Logging convention. Errors logged via
tracinginclude the operation + relevant identifiers (agent_id/machine_uid/session_id) + the underlying error chain (anyhowcontext /thiserrorsource) — never a bare "failed". Use#[instrument]span fields so context is attached automatically. - Kill silent swallows. Sweep all
let _ = <fallible>(): convert toif let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }. Where fire-and-forget is genuinely correct (e.g. best-effort socket close on an already-dead connection), keep it but add a one-line comment stating why. A clippy lint / CI grep guards against new barelet _ =onResult. - Dashboard surfaces the real error. Render
ApiError.code+message+correlation_id("quote this to support: …"); distinguish network-down (fetch threw) vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic atMachinesPage.tsx:202; reuse one error component across pages. - Agent error detail. Connection/auth/capture failures log what failed and why — the target server URL, the relay's stated reason, the auth mode attempted — so a misbehaving agent is self-diagnosing (directly addresses the auth-loop incident).
Explicitly out of scope
- Centralized log aggregation / external error tracking (Sentry-style) — local structured logs + correlation id only in v1.
- User-facing localized error catalog / i18n.
- Changing protobuf-level agent↔relay error frames beyond adding context to existing disconnect reasons (a deeper protocol error model is a later item).
- Rewriting business logic — this is error plumbing/observability, not behavior change.
Architecture
- Relay-server: new
server/src/error.rs(AppError+IntoResponse+ status map); handlers returnResult<T, AppError>; correlation-id middleware inserver/src/main.rs(layer besideTraceLayer, :471).relay/mod.rsswallows fixed in place with context logging. - Agent:
anyhowcontext added at error boundaries inagent/src/transport/andagent/src/session/; connection/auth failureserror!-logged with server URL + reason. - Dashboard:
dashboard/src/api/client.tskeeps/extendsApiErrorto carrycorrelation_id; a sharedErrorBanner/ErrorStatecomponent; pages stop hardcoding messages. - Protobuf/DB: none required (existing
Disconnect.reasonstring can carry richer text; no schema change).
Implementation details
- Files to touch:
server/src/error.rs(new);server/src/api/*.rs(adoptAppError, dropErrorResponse/tuple returns —mod.rs:101,106,auth.rs:36);server/src/main.rs(correlation-id middleware, :471);server/src/relay/mod.rs(swallow sweep — :132,191, 303,594,604,640,686,687,695, …);agent/src/transport/*,agent/src/session/mod.rs(error context);dashboard/src/api/client.ts(correlation id), shared error component,dashboard/src/features/machines/MachinesPage.tsx:202(use it). - Stable
error_codevalues are an enum (e.g.AUTH_EXPIRED,MACHINE_NOT_FOUND,DB_UNAVAILABLE,VALIDATION_FAILED) — documented and reused, not free-text.
Security considerations
- Don't leak internals to clients. The structured body carries a safe
message+ code; the full error chain / SQL / stack stays in the server log keyed by correlation id. A client gets "DB_UNAVAILABLE (id abc123)", not the connection string. - Correlation id is a random opaque token (no embedded PII/sequence that leaks volume).
- Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username exists), while the log records the specific reason — valuable internally, safe externally.
- The swallow sweep must not start logging secrets (api keys, tokens) that were previously silently dropped — scrub identifiers, never log key material.
Testing strategy
- Unit: each
AppErrorvariant maps to the right status + code + envelope; correlation id present on every error body and matches the response header. - Integration: a handler error returns the structured envelope with a correlation id
that appears in the captured log; an inbound
x-request-idis honored end-to-end. A forced DB failure on a formerlet _ =path now emits awarn!/error!(assert via log capture) instead of vanishing. - Dashboard: 401 renders "session expired / re-login"; 500 renders code + correlation id; fetch-throw renders "can't reach server" — three distinct states, not one generic.
- CI guard: lint/grep fails the build on a new bare
let _ =over aResultinserver/.
Effort estimate & dependencies
- Size: Large but phaseable. Phase 1 (highest value):
AppError+ correlation-id middleware + thelet _ =swallow sweep. Phase 2: dashboard error rendering. Phase 3: agent error context. Each ships independently. - Depends on: nothing blocking. Synergizes with SPEC-004 (the agent-auth-loop diagnosis would have been trivial with agent error detail) and improves every other spec's debuggability.
- Unblocks: faster incident triage across the product; a real correlation-id workflow for support.
Open questions
- Envelope migration. Converge to the single new envelope and update the dashboard union in lockstep, or keep accepting both for a deprecation window? (Dashboard already unions both — low risk to converge.)
error_codetaxonomy. Agree the initial enum of stable codes and their HTTP status mapping.- Correlation id source. Generate server-side always, or trust an inbound
x-request-idfrom the dashboard/NPM when present (and from where is it trusted)? - Swallow policy granularity. Blanket "log all
let _ =on Result", or an allowlist for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment + keep best-effort socket closes.