Files
guru-connect/docs/specs/SPEC-008-valuable-error-messages.md
Mike Swanson 65eff5cf50 spec: add SPEC-008 valuable error messages
Cross-cutting error-quality initiative: one structured AppError envelope
(stable error_code + message + correlation_id) replacing the current ad-hoc
mix (bare (StatusCode,&str) tuples, per-file ErrorResponse, two JSON envelopes
the dashboard already unions); correlation-id middleware tied to tracing spans
+ response header so a reported id greps the log; contextual error logging with
identifiers + error chain; sweep the 37 server `let _ =` swallows (the pattern
that silently hid migration-005's missing columns); dashboard renders the real
cause + correlation id (drop the hardcoded generic at MachinesPage.tsx:202);
agent logs why/where auth/connection failed (the auth-loop incident gave no
local signal). Phaseable; Large. Parallel RMM request keeps conventions aligned.
Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 16:30:07 -07:00

8.8 KiB

SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows)

Status: Proposed Priority: P2 Requested By: Mike (2026-05-30) Estimated Effort: Large (phaseable)

Overview

Make every error in GuruConnect carry information a human can act on: a single structured API error envelope with a stable code and a correlation id that also appears in the server logs, consistent contextual logging on the server and agent, an end to silent error swallowing, and a dashboard that shows the real cause (with the id to quote to support) instead of a hardcoded generic line. Success = given any error a tech sees, they (or we) can determine what failed and why — and a user-reported "it broke" comes with a correlation id that pulls the exact server log line.

This is a cross-cutting quality initiative. A parallel feature request governs GuruRMM (both are Rust/Axum); keep the conventions consistent where practical, but this spec governs GuruConnect only.

Current state (researched — this is the problem)

  • Inconsistent API error shapes. Bare (StatusCode, &str) tuples (server/src/api/mod.rs:101,106), a per-file ErrorResponse struct (server/src/api/auth.rs:36), and two different JSON envelopes in the wild — {error} and {detail, error_code, status_code} — which the dashboard already has to union to cope (dashboard/src/api/client.ts:48-52). No single error type.
  • No correlation/request id. Only TraceLayer::new_for_http() (server/src/main.rs:471); nothing ties a response a user saw to a log line.
  • 37 let _ = swallows in server/src/ — many on fallible DB ops: create_session (relay/mod.rs:594), end_session (:686), log_event (:132,191,303,604), mark_machine_offline (:687), link_session_to_code (:640). This is the exact pattern that hid the migration-005 missing-columns bugupdate_machine_metadata failed silently for an unknown period.
  • Generic dashboard message. MachinesPage.tsx:202 shows "The GuruConnect server did not respond. Check the relay status, then retry." regardless of cause (auth expired? 500? network? CORS?), even though ApiError (client.ts:12) already carries status + code.
  • Thin agent errors. The GuruConnect-client auth-failure-loop incident (2026-05-30) produced no useful local signal about why auth failed or which server it was hitting.

Scope

Included in v1

  • One structured API error type. A server AppError enum (thiserror) with a single IntoResponse producing one envelope: { error_code: string, message: string, detail?: object, correlation_id: string } + consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-file ErrorResponse; converge the two envelopes to one.
  • Correlation-id middleware. Generate (or accept inbound x-request-id) a request id, put it on the tracing span, echo it in the response header and in every AppError body, so a reported id greps directly to the log.
  • Logging convention. Errors logged via tracing include the operation + relevant identifiers (agent_id/machine_uid/session_id) + the underlying error chain (anyhow context / thiserror source) — never a bare "failed". Use #[instrument] span fields so context is attached automatically.
  • Kill silent swallows. Sweep all let _ = <fallible>(): convert to if let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }. Where fire-and-forget is genuinely correct (e.g. best-effort socket close on an already-dead connection), keep it but add a one-line comment stating why. A clippy lint / CI grep guards against new bare let _ = on Result.
  • Dashboard surfaces the real error. Render ApiError.code + message + correlation_id ("quote this to support: …"); distinguish network-down (fetch threw) vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic at MachinesPage.tsx:202; reuse one error component across pages.
  • Agent error detail. Connection/auth/capture failures log what failed and why — the target server URL, the relay's stated reason, the auth mode attempted — so a misbehaving agent is self-diagnosing (directly addresses the auth-loop incident).

Explicitly out of scope

  • Centralized log aggregation / external error tracking (Sentry-style) — local structured logs + correlation id only in v1.
  • User-facing localized error catalog / i18n.
  • Changing protobuf-level agent↔relay error frames beyond adding context to existing disconnect reasons (a deeper protocol error model is a later item).
  • Rewriting business logic — this is error plumbing/observability, not behavior change.

Architecture

  • Relay-server: new server/src/error.rs (AppError + IntoResponse + status map); handlers return Result<T, AppError>; correlation-id middleware in server/src/main.rs (layer beside TraceLayer, :471). relay/mod.rs swallows fixed in place with context logging.
  • Agent: anyhow context added at error boundaries in agent/src/transport/ and agent/src/session/; connection/auth failures error!-logged with server URL + reason.
  • Dashboard: dashboard/src/api/client.ts keeps/extends ApiError to carry correlation_id; a shared ErrorBanner/ErrorState component; pages stop hardcoding messages.
  • Protobuf/DB: none required (existing Disconnect.reason string can carry richer text; no schema change).

Implementation details

  • Files to touch: server/src/error.rs (new); server/src/api/*.rs (adopt AppError, drop ErrorResponse/tuple returns — mod.rs:101,106, auth.rs:36); server/src/main.rs (correlation-id middleware, :471); server/src/relay/mod.rs (swallow sweep — :132,191, 303,594,604,640,686,687,695, …); agent/src/transport/*, agent/src/session/mod.rs (error context); dashboard/src/api/client.ts (correlation id), shared error component, dashboard/src/features/machines/MachinesPage.tsx:202 (use it).
  • Stable error_code values are an enum (e.g. AUTH_EXPIRED, MACHINE_NOT_FOUND, DB_UNAVAILABLE, VALIDATION_FAILED) — documented and reused, not free-text.

Security considerations

  • Don't leak internals to clients. The structured body carries a safe message + code; the full error chain / SQL / stack stays in the server log keyed by correlation id. A client gets "DB_UNAVAILABLE (id abc123)", not the connection string.
  • Correlation id is a random opaque token (no embedded PII/sequence that leaks volume).
  • Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username exists), while the log records the specific reason — valuable internally, safe externally.
  • The swallow sweep must not start logging secrets (api keys, tokens) that were previously silently dropped — scrub identifiers, never log key material.

Testing strategy

  • Unit: each AppError variant maps to the right status + code + envelope; correlation id present on every error body and matches the response header.
  • Integration: a handler error returns the structured envelope with a correlation id that appears in the captured log; an inbound x-request-id is honored end-to-end. A forced DB failure on a former let _ = path now emits a warn!/error! (assert via log capture) instead of vanishing.
  • Dashboard: 401 renders "session expired / re-login"; 500 renders code + correlation id; fetch-throw renders "can't reach server" — three distinct states, not one generic.
  • CI guard: lint/grep fails the build on a new bare let _ = over a Result in server/.

Effort estimate & dependencies

  • Size: Large but phaseable. Phase 1 (highest value): AppError + correlation-id middleware + the let _ = swallow sweep. Phase 2: dashboard error rendering. Phase 3: agent error context. Each ships independently.
  • Depends on: nothing blocking. Synergizes with SPEC-004 (the agent-auth-loop diagnosis would have been trivial with agent error detail) and improves every other spec's debuggability.
  • Unblocks: faster incident triage across the product; a real correlation-id workflow for support.

Open questions

  1. Envelope migration. Converge to the single new envelope and update the dashboard union in lockstep, or keep accepting both for a deprecation window? (Dashboard already unions both — low risk to converge.)
  2. error_code taxonomy. Agree the initial enum of stable codes and their HTTP status mapping.
  3. Correlation id source. Generate server-side always, or trust an inbound x-request-id from the dashboard/NPM when present (and from where is it trusted)?
  4. Swallow policy granularity. Blanket "log all let _ = on Result", or an allowlist for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment + keep best-effort socket closes.