diff --git a/docs/FEATURE_ROADMAP.md b/docs/FEATURE_ROADMAP.md index d36fc15..ebf7161 100644 --- a/docs/FEATURE_ROADMAP.md +++ b/docs/FEATURE_ROADMAP.md @@ -53,6 +53,7 @@ Bringing GC to parity with GuruRMM's release engineering. Full plan: [SPEC-001]( - [ ] Machines "by Company" tree nav with per-company counts — P3 — left-nav grouping sidebar (screenshot parity). Follow-up sub-item of SPEC-005. - [ ] **Universal machine search ("everything is searchable")** — P2 — server-side `?q=` on `/api/machines` matching case-insensitive substring across ALL attributes (OS, logged-on user, external/private IP, company, site, tag, serial, MAC, version, …), pg_trgm GIN-indexed; multi-term AND + optional field-scoped syntax (`os:`, `user:`, `ip:`). Replaces the hostname-only client filter. Depends on SPEC-003 (attrs must be persisted). ([SPEC-006](specs/SPEC-006-universal-machine-search.md)) - [ ] **Managed-agent installer builder ("Build Installer")** — P2 — dashboard wizard to build a pre-labeled persistent-agent installer (Name/Company/Site/Department/Device Type/Tag/Type) with Download / Copy URL / Send Link, reusing the existing embed-config download path; adds department + device_type to EmbeddedConfig/AgentStatus so labels persist at install time. Pairs with revocable per-machine keys; signature-vs-appended-config is the key open question. ([SPEC-007](specs/SPEC-007-managed-agent-installer-builder.md)) +- [ ] **Valuable error messages (structured errors + no silent swallows)** — P2 — one structured API error envelope with stable codes + a correlation id that also lands in the logs; contextual tracing on server/agent; sweep the 37 `let _ =` swallows (the pattern that hid the migration-005 bug); dashboard surfaces the real cause + id instead of a generic line. ([SPEC-008](specs/SPEC-008-valuable-error-messages.md)) - [ ] Programmatic session pre-create + viewer-token (integration contract) — P2 ## Security & Infrastructure diff --git a/docs/specs/SPEC-008-valuable-error-messages.md b/docs/specs/SPEC-008-valuable-error-messages.md new file mode 100644 index 0000000..aeb186a --- /dev/null +++ b/docs/specs/SPEC-008-valuable-error-messages.md @@ -0,0 +1,152 @@ +# SPEC-008: Valuable Error Messages (Structured Errors + No Silent Swallows) + +**Status:** Proposed +**Priority:** P2 +**Requested By:** Mike (2026-05-30) +**Estimated Effort:** Large (phaseable) + +## Overview + +Make every error in GuruConnect carry information a human can act on: a single structured +API error envelope with a stable code and a **correlation id** that also appears in the +server logs, consistent contextual logging on the server and agent, an end to silent +error swallowing, and a dashboard that shows the *real* cause (with the id to quote to +support) instead of a hardcoded generic line. Success = given any error a tech sees, they +(or we) can determine what failed and why — and a user-reported "it broke" comes with a +correlation id that pulls the exact server log line. + +This is a cross-cutting quality initiative. **A parallel feature request governs GuruRMM** +(both are Rust/Axum); keep the conventions consistent where practical, but this spec +governs GuruConnect only. + +## Current state (researched — this is the problem) + +- **Inconsistent API error shapes.** Bare `(StatusCode, &str)` tuples + (`server/src/api/mod.rs:101,106`), a per-file `ErrorResponse` struct + (`server/src/api/auth.rs:36`), and **two different JSON envelopes** in the wild — + `{error}` and `{detail, error_code, status_code}` — which the dashboard already has to + union to cope (`dashboard/src/api/client.ts:48-52`). No single error type. +- **No correlation/request id.** Only `TraceLayer::new_for_http()` + (`server/src/main.rs:471`); nothing ties a response a user saw to a log line. +- **37 `let _ =` swallows** in `server/src/` — many on fallible DB ops: + `create_session` (`relay/mod.rs:594`), `end_session` (`:686`), `log_event` + (`:132,191,303,604`), `mark_machine_offline` (`:687`), + `link_session_to_code` (`:640`). This is the exact pattern that hid the **migration-005 + missing-columns bug** — `update_machine_metadata` failed silently for an unknown period. +- **Generic dashboard message.** `MachinesPage.tsx:202` shows "The GuruConnect server did + not respond. Check the relay status, then retry." regardless of cause (auth expired? 500? + network? CORS?), even though `ApiError` (`client.ts:12`) already carries status + code. +- **Thin agent errors.** The GuruConnect-client auth-failure-loop incident (2026-05-30) + produced no useful local signal about *why* auth failed or *which* server it was hitting. + +## Scope + +### Included in v1 + +- **One structured API error type.** A server `AppError` enum (`thiserror`) with a single + `IntoResponse` producing one envelope: + `{ error_code: string, message: string, detail?: object, correlation_id: string }` + + consistent HTTP status mapping per variant. Replace the ad-hoc tuples and per-file + `ErrorResponse`; converge the two envelopes to one. +- **Correlation-id middleware.** Generate (or accept inbound `x-request-id`) a request id, + put it on the tracing span, echo it in the response header **and** in every `AppError` + body, so a reported id greps directly to the log. +- **Logging convention.** Errors logged via `tracing` include the operation + relevant + identifiers (`agent_id`/`machine_uid`/`session_id`) + the underlying error chain + (`anyhow` context / `thiserror` source) — never a bare "failed". Use `#[instrument]` + span fields so context is attached automatically. +- **Kill silent swallows.** Sweep all `let _ = ()`: convert to + `if let Err(e) = … { warn!/error!(error = %e, op = "…", id = …) }`. Where fire-and-forget + is genuinely correct (e.g. best-effort socket close on an already-dead connection), + keep it but add a one-line comment stating why. A clippy lint / CI grep guards against + new bare `let _ =` on `Result`. +- **Dashboard surfaces the real error.** Render `ApiError.code` + `message` + + `correlation_id` ("quote this to support: …"); distinguish network-down (fetch threw) + vs auth-expired (401 → re-login) vs server-500. Remove the hardcoded generic at + `MachinesPage.tsx:202`; reuse one error component across pages. +- **Agent error detail.** Connection/auth/capture failures log *what* failed and *why* — + the target server URL, the relay's stated reason, the auth mode attempted — so a + misbehaving agent is self-diagnosing (directly addresses the auth-loop incident). + +### Explicitly out of scope + +- Centralized log aggregation / external error tracking (Sentry-style) — local structured + logs + correlation id only in v1. +- User-facing localized error catalog / i18n. +- Changing protobuf-level agent↔relay error frames beyond adding context to existing + disconnect reasons (a deeper protocol error model is a later item). +- Rewriting business logic — this is error *plumbing/observability*, not behavior change. + +## Architecture + +- **Relay-server:** new `server/src/error.rs` (`AppError` + `IntoResponse` + status map); + handlers return `Result`; correlation-id middleware in + `server/src/main.rs` (layer beside `TraceLayer`, :471). `relay/mod.rs` swallows fixed in + place with context logging. +- **Agent:** `anyhow` context added at error boundaries in `agent/src/transport/` and + `agent/src/session/`; connection/auth failures `error!`-logged with server URL + reason. +- **Dashboard:** `dashboard/src/api/client.ts` keeps/extends `ApiError` to carry + `correlation_id`; a shared `ErrorBanner`/`ErrorState` component; pages stop hardcoding + messages. +- **Protobuf/DB:** none required (existing `Disconnect.reason` string can carry richer + text; no schema change). + +## Implementation details + +- Files to touch: `server/src/error.rs` (new); `server/src/api/*.rs` (adopt `AppError`, + drop `ErrorResponse`/tuple returns — `mod.rs:101,106`, `auth.rs:36`); `server/src/main.rs` + (correlation-id middleware, :471); `server/src/relay/mod.rs` (swallow sweep — :132,191, + 303,594,604,640,686,687,695, …); `agent/src/transport/*`, `agent/src/session/mod.rs` + (error context); `dashboard/src/api/client.ts` (correlation id), shared error component, + `dashboard/src/features/machines/MachinesPage.tsx:202` (use it). +- Stable `error_code` values are an enum (e.g. `AUTH_EXPIRED`, `MACHINE_NOT_FOUND`, + `DB_UNAVAILABLE`, `VALIDATION_FAILED`) — documented and reused, not free-text. + +## Security considerations + +- **Don't leak internals to clients.** The structured body carries a safe `message` + code; + the full error chain / SQL / stack stays in the **server log** keyed by correlation id. + A client gets "DB_UNAVAILABLE (id abc123)", not the connection string. +- Correlation id is a random opaque token (no embedded PII/sequence that leaks volume). +- Auth errors stay deliberately coarse to the client (e.g. don't reveal whether a username + exists), while the *log* records the specific reason — valuable internally, safe + externally. +- The swallow sweep must not start logging secrets (api keys, tokens) that were previously + silently dropped — scrub identifiers, never log key material. + +## Testing strategy + +- **Unit:** each `AppError` variant maps to the right status + code + envelope; + correlation id present on every error body and matches the response header. +- **Integration:** a handler error returns the structured envelope with a correlation id + that appears in the captured log; an inbound `x-request-id` is honored end-to-end. A + forced DB failure on a former `let _ =` path now emits a `warn!/error!` (assert via log + capture) instead of vanishing. +- **Dashboard:** 401 renders "session expired / re-login"; 500 renders code + correlation + id; fetch-throw renders "can't reach server" — three distinct states, not one generic. +- **CI guard:** lint/grep fails the build on a new bare `let _ =` over a `Result` in + `server/`. + +## Effort estimate & dependencies + +- **Size: Large but phaseable.** Phase 1 (highest value): `AppError` + correlation-id + middleware + the `let _ =` swallow sweep. Phase 2: dashboard error rendering. Phase 3: + agent error context. Each ships independently. +- **Depends on:** nothing blocking. Synergizes with **SPEC-004** (the agent-auth-loop + diagnosis would have been trivial with agent error detail) and improves every other spec's + debuggability. +- **Unblocks:** faster incident triage across the product; a real correlation-id workflow + for support. + +## Open questions + +1. **Envelope migration.** Converge to the single new envelope and update the dashboard + union in lockstep, or keep accepting both for a deprecation window? (Dashboard already + unions both — low risk to converge.) +2. **`error_code` taxonomy.** Agree the initial enum of stable codes and their HTTP status + mapping. +3. **Correlation id source.** Generate server-side always, or trust an inbound + `x-request-id` from the dashboard/NPM when present (and from where is it trusted)? +4. **Swallow policy granularity.** Blanket "log all `let _ =` on Result", or an allowlist + for the genuinely best-effort socket sends? Proposed: log all DB/state ops; comment + + keep best-effort socket closes.