spec: add SPEC-008 valuable error messages
Cross-cutting error-quality initiative: one structured AppError envelope (stable error_code + message + correlation_id) replacing the current ad-hoc mix (bare (StatusCode,&str) tuples, per-file ErrorResponse, two JSON envelopes the dashboard already unions); correlation-id middleware tied to tracing spans + response header so a reported id greps the log; contextual error logging with identifiers + error chain; sweep the 37 server `let _ =` swallows (the pattern that silently hid migration-005's missing columns); dashboard renders the real cause + correlation id (drop the hardcoded generic at MachinesPage.tsx:202); agent logs why/where auth/connection failed (the auth-loop incident gave no local signal). Phaseable; Large. Parallel RMM request keeps conventions aligned. Requested by Mike 2026-05-30. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -53,6 +53,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001](
|
|||||||
- [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005.
|
- [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005.
|
||||||
- [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). ([SPEC-006](specs/SPEC-006-universal-machine-search.md))
|
- [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). ([SPEC-006](specs/SPEC-006-universal-machine-search.md))
|
||||||
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
|
- [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md))
|
||||||
|
- [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. ([SPEC-008](specs/SPEC-008-valuable-error-messages.md))
|
||||||
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
|
- [ ] Programmatic session pre-create + viewer-token (integration contract) — P2
|
||||||
|
|
||||||
## Security & Infrastructure
|
## Security & Infrastructure
|
||||||
|
|||||||
152
docs/specs/SPEC-008-valuable-error-messages.md
Normal file
152
docs/specs/SPEC-008-valuable-error-messages.md
Normal file
@@ -0,0 +1,152 @@
|
|||||||
|
# SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows)
|
||||||
|
|
||||||
|
**Status:** Proposed
|
||||||
|
**Priority:** P2
|
||||||
|
**Requested By:** Mike (2026-05-30)
|
||||||
|
**Estimated Effort:** Large (phaseable)
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Make every error in GuruConnect carry information a human can act on: a single structured
|
||||||
|
API error envelope with a stable code and a **correlation id** that also appears in the
|
||||||
|
server logs, consistent contextual logging on the server and agent, an end to silent
|
||||||
|
error swallowing, and a dashboard that shows the *real* cause (with the id to quote to
|
||||||
|
support) instead of a hardcoded generic line. Success = given any error a tech sees, they
|
||||||
|
(or we) can determine what failed and why — and a user-reported "it broke" comes with a
|
||||||
|
correlation id that pulls the exact server log line.
|
||||||
|
|
||||||
|
This is a cross-cutting quality initiative. **A parallel feature request governs GuruRMM**
|
||||||
|
(both are Rust/Axum); keep the conventions consistent where practical, but this spec
|
||||||
|
governs GuruConnect only.
|
||||||
|
|
||||||
|
## Current state (researched — this is the problem)
|
||||||
|
|
||||||
|
- **Inconsistent API error shapes.** Bare `(StatusCode, &str)` tuples
|
||||||
|
(`server/src/api/mod.rs:101,106`), a per-file `ErrorResponse` struct
|
||||||
|
(`server/src/api/auth.rs:36`), and **two different JSON envelopes** in the wild —
|
||||||
|
`{error}` and `{detail, error_code, status_code}` — which the dashboard already has to
|
||||||
|
union to cope (`dashboard/src/api/client.ts:48-52`). No single error type.
|
||||||
|
- **No correlation/request id.** Only `TraceLayer::new_for_http()`
|
||||||
|
(`server/src/main.rs:471`); nothing ties a response a user saw to a log line.
|
||||||
|
- **37 `let _ =` swallows** in `server/src/` — many on fallible DB ops:
|
||||||
|
`create_session` (`relay/mod.rs:594`), `end_session` (`:686`), `log_event`
|
||||||
|
(`:132,191,303,604`), `mark_machine_offline` (`:687`),
|
||||||
|
`link_session_to_code` (`:640`). This is the exact pattern that hid the **migration-005
|
||||||
|
missing-columns bug** — `update_machine_metadata` failed silently for an unknown period.
|
||||||
|
- **Generic dashboard message.** `MachinesPage.tsx:202` shows "The GuruConnect server did
|
||||||
|
not respond. Check the relay status, then retry." regardless of cause (auth expired? 500?
|
||||||
|
network? CORS?), even though `ApiError` (`client.ts:12`) already carries status + code.
|
||||||
|
- **Thin agent errors.** The GuruConnect-client auth-failure-loop incident (2026-05-30)
|
||||||
|
produced no useful local signal about *why* auth failed or *which* server it was hitting.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
### Included in v1
|
||||||
|
|
||||||
|
- **One structured API error type.** A server `AppError` enum (`thiserror`) with a single
|
||||||
|
`IntoResponse` producing one envelope:
|
||||||
|
`{ error_code: string, message: string, detail?: object, correlation_id: string }` +
|
||||||
|
consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-file
|
||||||
|
`ErrorResponse`; converge the two envelopes to one.
|
||||||
|
- **Correlation-id middleware.** Generate (or accept inbound `x-request-id`) a request id,
|
||||||
|
put it on the tracing span, echo it in the response header **and** in every `AppError`
|
||||||
|
body, so a reported id greps directly to the log.
|
||||||
|
- **Logging convention.** Errors logged via `tracing` include the operation + relevant
|
||||||
|
identifiers (`agent_id`/`machine_uid`/`session_id`) + the underlying error chain
|
||||||
|
(`anyhow` context / `thiserror` source) — never a bare "failed". Use `#[instrument]`
|
||||||
|
span fields so context is attached automatically.
|
||||||
|
- **Kill silent swallows.** Sweep all `let _ = <fallible>()`: convert to
|
||||||
|
`if let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }`. Where fire-and-forget
|
||||||
|
is genuinely correct (e.g. best-effort socket close on an already-dead connection),
|
||||||
|
keep it but add a one-line comment stating why. A clippy lint / CI grep guards against
|
||||||
|
new bare `let _ =` on `Result`.
|
||||||
|
- **Dashboard surfaces the real error.** Render `ApiError.code` + `message` +
|
||||||
|
`correlation_id` ("quote this to support: …"); distinguish network-down (fetch threw)
|
||||||
|
vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic at
|
||||||
|
`MachinesPage.tsx:202`; reuse one error component across pages.
|
||||||
|
- **Agent error detail.** Connection/auth/capture failures log *what* failed and *why* —
|
||||||
|
the target server URL, the relay's stated reason, the auth mode attempted — so a
|
||||||
|
misbehaving agent is self-diagnosing (directly addresses the auth-loop incident).
|
||||||
|
|
||||||
|
### Explicitly out of scope
|
||||||
|
|
||||||
|
- Centralized log aggregation / external error tracking (Sentry-style) — local structured
|
||||||
|
logs + correlation id only in v1.
|
||||||
|
- User-facing localized error catalog / i18n.
|
||||||
|
- Changing protobuf-level agent↔relay error frames beyond adding context to existing
|
||||||
|
disconnect reasons (a deeper protocol error model is a later item).
|
||||||
|
- Rewriting business logic — this is error *plumbing/observability*, not behavior change.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- **Relay-server:** new `server/src/error.rs` (`AppError` + `IntoResponse` + status map);
|
||||||
|
handlers return `Result<T, AppError>`; correlation-id middleware in
|
||||||
|
`server/src/main.rs` (layer beside `TraceLayer`, :471). `relay/mod.rs` swallows fixed in
|
||||||
|
place with context logging.
|
||||||
|
- **Agent:** `anyhow` context added at error boundaries in `agent/src/transport/` and
|
||||||
|
`agent/src/session/`; connection/auth failures `error!`-logged with server URL + reason.
|
||||||
|
- **Dashboard:** `dashboard/src/api/client.ts` keeps/extends `ApiError` to carry
|
||||||
|
`correlation_id`; a shared `ErrorBanner`/`ErrorState` component; pages stop hardcoding
|
||||||
|
messages.
|
||||||
|
- **Protobuf/DB:** none required (existing `Disconnect.reason` string can carry richer
|
||||||
|
text; no schema change).
|
||||||
|
|
||||||
|
## Implementation details
|
||||||
|
|
||||||
|
- Files to touch: `server/src/error.rs` (new); `server/src/api/*.rs` (adopt `AppError`,
|
||||||
|
drop `ErrorResponse`/tuple returns — `mod.rs:101,106`, `auth.rs:36`); `server/src/main.rs`
|
||||||
|
(correlation-id middleware, :471); `server/src/relay/mod.rs` (swallow sweep — :132,191,
|
||||||
|
303,594,604,640,686,687,695, …); `agent/src/transport/*`, `agent/src/session/mod.rs`
|
||||||
|
(error context); `dashboard/src/api/client.ts` (correlation id), shared error component,
|
||||||
|
`dashboard/src/features/machines/MachinesPage.tsx:202` (use it).
|
||||||
|
- Stable `error_code` values are an enum (e.g. `AUTH_EXPIRED`, `MACHINE_NOT_FOUND`,
|
||||||
|
`DB_UNAVAILABLE`, `VALIDATION_FAILED`) — documented and reused, not free-text.
|
||||||
|
|
||||||
|
## Security considerations
|
||||||
|
|
||||||
|
- **Don't leak internals to clients.** The structured body carries a safe `message` + code;
|
||||||
|
the full error chain / SQL / stack stays in the **server log** keyed by correlation id.
|
||||||
|
A client gets "DB_UNAVAILABLE (id abc123)", not the connection string.
|
||||||
|
- Correlation id is a random opaque token (no embedded PII/sequence that leaks volume).
|
||||||
|
- Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username
|
||||||
|
exists), while the *log* records the specific reason — valuable internally, safe
|
||||||
|
externally.
|
||||||
|
- The swallow sweep must not start logging secrets (api keys, tokens) that were previously
|
||||||
|
silently dropped — scrub identifiers, never log key material.
|
||||||
|
|
||||||
|
## Testing strategy
|
||||||
|
|
||||||
|
- **Unit:** each `AppError` variant maps to the right status + code + envelope;
|
||||||
|
correlation id present on every error body and matches the response header.
|
||||||
|
- **Integration:** a handler error returns the structured envelope with a correlation id
|
||||||
|
that appears in the captured log; an inbound `x-request-id` is honored end-to-end. A
|
||||||
|
forced DB failure on a former `let _ =` path now emits a `warn!/error!` (assert via log
|
||||||
|
capture) instead of vanishing.
|
||||||
|
- **Dashboard:** 401 renders "session expired / re-login"; 500 renders code + correlation
|
||||||
|
id; fetch-throw renders "can't reach server" — three distinct states, not one generic.
|
||||||
|
- **CI guard:** lint/grep fails the build on a new bare `let _ =` over a `Result` in
|
||||||
|
`server/`.
|
||||||
|
|
||||||
|
## Effort estimate & dependencies
|
||||||
|
|
||||||
|
- **Size: Large but phaseable.** Phase 1 (highest value): `AppError` + correlation-id
|
||||||
|
middleware + the `let _ =` swallow sweep. Phase 2: dashboard error rendering. Phase 3:
|
||||||
|
agent error context. Each ships independently.
|
||||||
|
- **Depends on:** nothing blocking. Synergizes with **SPEC-004** (the agent-auth-loop
|
||||||
|
diagnosis would have been trivial with agent error detail) and improves every other spec's
|
||||||
|
debuggability.
|
||||||
|
- **Unblocks:** faster incident triage across the product; a real correlation-id workflow
|
||||||
|
for support.
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
1. **Envelope migration.** Converge to the single new envelope and update the dashboard
|
||||||
|
union in lockstep, or keep accepting both for a deprecation window? (Dashboard already
|
||||||
|
unions both — low risk to converge.)
|
||||||
|
2. **`error_code` taxonomy.** Agree the initial enum of stable codes and their HTTP status
|
||||||
|
mapping.
|
||||||
|
3. **Correlation id source.** Generate server-side always, or trust an inbound
|
||||||
|
`x-request-id` from the dashboard/NPM when present (and from where is it trusted)?
|
||||||
|
4. **Swallow policy granularity.** Blanket "log all `let _ =` on Result", or an allowlist
|
||||||
|
for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment +
|
||||||
|
keep best-effort socket closes.
|
||||||
Reference in New Issue
Block a user