spec: add SPEC-008 valuable error messages

Cross-cutting error-quality initiative: one structured AppError envelope
(stable error_code + message + correlation_id) replacing the current ad-hoc
mix (bare (StatusCode,&str) tuples, per-file ErrorResponse, two JSON envelopes
the dashboard already unions); correlation-id middleware tied to tracing spans
+ response header so a reported id greps the log; contextual error logging with
identifiers + error chain; sweep the 37 server `let _ =` swallows (the pattern
that silently hid migration-005's missing columns); dashboard renders the real
cause + correlation id (drop the hardcoded generic at MachinesPage.tsx:202);
agent logs why/where auth/connection failed (the auth-loop incident gave no
local signal). Phaseable; Large. Parallel RMM request keeps conventions aligned.
Requested by Mike 2026-05-30.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 16:30:07 -07:00
parent 008d2bf30b
commit 65eff5cf50
2 changed files with 153 additions and 0 deletions

View File

@@ -53,6 +53,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
- [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005.
- [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). ([SPEC-006](specs/SPEC-006-universal-machine-search.md))
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
- [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. ([SPEC-008](specs/SPEC-008-valuable-error-messages.md))
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
## Security & Infrastructure

View File

@@ -0,0 +1,152 @@
# SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows)
**Status:** Proposed
**Priority:** P2
**Requested By:** Mike (2026-05-30)
**Estimated Effort:** Large (phaseable)
## Overview
Make every error in GuruConnect carry information a human can act on: a single structured
API error envelope with a stable code and a **correlation id** that also appears in the
server logs, consistent contextual logging on the server and agent, an end to silent
error swallowing, and a dashboard that shows the *real* cause (with the id to quote to
support) instead of a hardcoded generic line. Success = given any error a tech sees, they
(or we) can determine what failed and why — and a user-reported "it broke" comes with a
correlation id that pulls the exact server log line.
This is a cross-cutting quality initiative. **A parallel feature request governs GuruRMM**
(both are Rust/Axum); keep the conventions consistent where practical, but this spec
governs GuruConnect only.
## Current state (researched — this is the problem)
- **Inconsistent API error shapes.** Bare `(StatusCode, &str)` tuples
(`server/src/api/mod.rs:101,106`), a per-file `ErrorResponse` struct
(`server/src/api/auth.rs:36`), and **two different JSON envelopes** in the wild —
`{error}` and `{detail, error_code, status_code}` — which the dashboard already has to
union to cope (`dashboard/src/api/client.ts:48-52`). No single error type.
- **No correlation/request id.** Only `TraceLayer::new_for_http()`
(`server/src/main.rs:471`); nothing ties a response a user saw to a log line.
- **37 `let _ =` swallows** in `server/src/` — many on fallible DB ops:
`create_session` (`relay/mod.rs:594`), `end_session` (`:686`), `log_event`
(`:132,191,303,604`), `mark_machine_offline` (`:687`),
`link_session_to_code` (`:640`). This is the exact pattern that hid the **migration-005
missing-columns bug** — `update_machine_metadata` failed silently for an unknown period.
- **Generic dashboard message.** `MachinesPage.tsx:202` shows "The GuruConnect server did
not respond. Check the relay status, then retry." regardless of cause (auth expired? 500?
network? CORS?), even though `ApiError` (`client.ts:12`) already carries status + code.
- **Thin agent errors.** The GuruConnect-client auth-failure-loop incident (2026-05-30)
produced no useful local signal about *why* auth failed or *which* server it was hitting.
## Scope
### Included in v1
- **One structured API error type.** A server `AppError` enum (`thiserror`) with a single
`IntoResponse` producing one envelope:
`{ error_code: string, message: string, detail?: object, correlation_id: string }` +
consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-file
`ErrorResponse`; converge the two envelopes to one.
- **Correlation-id middleware.** Generate (or accept inbound `x-request-id`) a request id,
put it on the tracing span, echo it in the response header **and** in every `AppError`
body, so a reported id greps directly to the log.
- **Logging convention.** Errors logged via `tracing` include the operation + relevant
identifiers (`agent_id`/`machine_uid`/`session_id`) + the underlying error chain
(`anyhow` context / `thiserror` source) — never a bare "failed". Use `#[instrument]`
span fields so context is attached automatically.
- **Kill silent swallows.** Sweep all `let _ = <fallible>()`: convert to
`if let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }`. Where fire-and-forget
is genuinely correct (e.g. best-effort socket close on an already-dead connection),
keep it but add a one-line comment stating why. A clippy lint / CI grep guards against
new bare `let _ =` on `Result`.
- **Dashboard surfaces the real error.** Render `ApiError.code` + `message` +
`correlation_id` ("quote this to support: …"); distinguish network-down (fetch threw)
vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic at
`MachinesPage.tsx:202`; reuse one error component across pages.
- **Agent error detail.** Connection/auth/capture failures log *what* failed and *why*
the target server URL, the relay's stated reason, the auth mode attempted — so a
misbehaving agent is self-diagnosing (directly addresses the auth-loop incident).
### Explicitly out of scope
- Centralized log aggregation / external error tracking (Sentry-style) — local structured
logs + correlation id only in v1.
- User-facing localized error catalog / i18n.
- Changing protobuf-level agent↔relay error frames beyond adding context to existing
disconnect reasons (a deeper protocol error model is a later item).
- Rewriting business logic — this is error *plumbing/observability*, not behavior change.
## Architecture
- **Relay-server:** new `server/src/error.rs` (`AppError` + `IntoResponse` + status map);
handlers return `Result<T, AppError>`; correlation-id middleware in
`server/src/main.rs` (layer beside `TraceLayer`, :471). `relay/mod.rs` swallows fixed in
place with context logging.
- **Agent:** `anyhow` context added at error boundaries in `agent/src/transport/` and
`agent/src/session/`; connection/auth failures `error!`-logged with server URL + reason.
- **Dashboard:** `dashboard/src/api/client.ts` keeps/extends `ApiError` to carry
`correlation_id`; a shared `ErrorBanner`/`ErrorState` component; pages stop hardcoding
messages.
- **Protobuf/DB:** none required (existing `Disconnect.reason` string can carry richer
text; no schema change).
## Implementation details
- Files to touch: `server/src/error.rs` (new); `server/src/api/*.rs` (adopt `AppError`,
drop `ErrorResponse`/tuple returns — `mod.rs:101,106`, `auth.rs:36`); `server/src/main.rs`
(correlation-id middleware, :471); `server/src/relay/mod.rs` (swallow sweep — :132,191,
303,594,604,640,686,687,695, …); `agent/src/transport/*`, `agent/src/session/mod.rs`
(error context); `dashboard/src/api/client.ts` (correlation id), shared error component,
`dashboard/src/features/machines/MachinesPage.tsx:202` (use it).
- Stable `error_code` values are an enum (e.g. `AUTH_EXPIRED`, `MACHINE_NOT_FOUND`,
`DB_UNAVAILABLE`, `VALIDATION_FAILED`) — documented and reused, not free-text.
## Security considerations
- **Don't leak internals to clients.** The structured body carries a safe `message` + code;
the full error chain / SQL / stack stays in the **server log** keyed by correlation id.
A client gets "DB_UNAVAILABLE (id abc123)", not the connection string.
- Correlation id is a random opaque token (no embedded PII/sequence that leaks volume).
- Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username
exists), while the *log* records the specific reason — valuable internally, safe
externally.
- The swallow sweep must not start logging secrets (api keys, tokens) that were previously
silently dropped — scrub identifiers, never log key material.
## Testing strategy
- **Unit:** each `AppError` variant maps to the right status + code + envelope;
correlation id present on every error body and matches the response header.
- **Integration:** a handler error returns the structured envelope with a correlation id
that appears in the captured log; an inbound `x-request-id` is honored end-to-end. A
forced DB failure on a former `let _ =` path now emits a `warn!/error!` (assert via log
capture) instead of vanishing.
- **Dashboard:** 401 renders "session expired / re-login"; 500 renders code + correlation
id; fetch-throw renders "can't reach server" — three distinct states, not one generic.
- **CI guard:** lint/grep fails the build on a new bare `let _ =` over a `Result` in
`server/`.
## Effort estimate & dependencies
- **Size: Large but phaseable.** Phase 1 (highest value): `AppError` + correlation-id
middleware + the `let _ =` swallow sweep. Phase 2: dashboard error rendering. Phase 3:
agent error context. Each ships independently.
- **Depends on:** nothing blocking. Synergizes with **SPEC-004** (the agent-auth-loop
diagnosis would have been trivial with agent error detail) and improves every other spec's
debuggability.
- **Unblocks:** faster incident triage across the product; a real correlation-id workflow
for support.
## Open questions
1. **Envelope migration.** Converge to the single new envelope and update the dashboard
union in lockstep, or keep accepting both for a deprecation window? (Dashboard already
unions both — low risk to converge.)
2. **`error_code` taxonomy.** Agree the initial enum of stable codes and their HTTP status
mapping.
3. **Correlation id source.** Generate server-side always, or trust an inbound
`x-request-id` from the dashboard/NPM when present (and from where is it trusted)?
4. **Swallow policy granularity.** Blanket "log all `let _ =` on Result", or an allowlist
for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment +
keep best-effort socket closes.