Files

Mike Swanson 2b792ee5d1 agy(gemini): fix false auth-abort in retry loop + add quota fallback to default model

While using the new 3-retry gemini path for live VPN research, two bugs surfaced:
- emit_or_fail checked auth_failed INSIDE the retry loop; a benign mid-run token-refresh line
  matched the over-broad auth regex (bare login|credential|authenticat|oauth|401) and aborted the
  retries with a false "auth error" - even though `gemini -p` auth tested fine. Moved auth-classify
  to AFTER the retries (it only picks the final error message now) and tightened auth_failed to real
  signatures (invalid_grant, not authenticated, login with google, token expired, ...).
- Added quota_exhausted() + a QUOTA FALLBACK: the pinned strong model (gemini-3.1-pro-preview) hit
  "exhausted your capacity on this model" mid-session; emit_or_fail now retries once on the default
  (lighter) model by stripping -m (separate quota). Validated: capped pro run -> fell back -> 2.9KB answer.

CT_THOUGHTS Thought 2 Resolution updated with both. (Search-bot reliability hardening continues.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 12:09:58 -07:00

17 KiB

Raw Blame History

CT Thoughts — ClaudeTools idea backlog

The shared backlog of ClaudeTools harness ideas (the internal tooling itself, not client work). Nothing here is approved to build — ideas advance only by explicit decision.

Pipeline: THOUGHT (raw idea dropped here) -> DISCUSS (chat it through) -> SPEC (/shape-spec or a concept doc) -> ROADMAP -> BUILD.

How to add a thought: in any Claude session say "ct thought: " (or "add to ct thoughts" / "park this as a ct thought"). Claude appends it below with who/when and a Status. Howard's ideas land here too.

Status per entry: Raw -> Discussed -> Spec'd -> Roadmapped -> Done.

The entries below are the current thoughts:

ClaudeTools 3.0 — web-based co-work workspace (Mike, 2026-06-14) — Discussed (vision-stage, no build go)

Web-search bots (grok xsearch + gemini search) reliability - MUST FIX (Mike, 2026-06-17) - Mitigated/Fixed same day (gemini 3-retry+backoff; grok xsearch auto-falls-back to gemini on timeout). Grok's own multi-agent timeout is upstream/unsolved.

Thought 1 — ClaudeTools 3.0: Web-Based Co-Work Workspace (Mike, 2026-06-14)

Status: Discussed — vision-stage, feeling out possibilities. NOT authorized to build.

The want

A web-based ClaudeTools that gives the team (Mike, Howard) real co-work — "think Claude Co-Work, but tailored like ClaudeTools already is." Today co-work = N separate Claude Code CLI terminals on N machines, glued together by the git-synced repo + the coord API (172.16.3.30:8001 — locks, todos, messages, component state) polled over HTTP. The vision turns that into one shared room where you can see and drive sessions across the fleet from a browser (incl. phone) over Tailscale.

Odysseus (D:\Odysseus, AGPL-3.0 self-hosted AI workspace) is inspiration for the shell only (auth, sessions, document editor, memory UI, mobile/PWA polish) — NOT the agent. Its agent is a from-scratch loop; adopting it would throw away the Claude Code harness (skills, slash commands, hooks, coord, the Opus agent itself) that is ClaudeTools' whole value. Copying its code would also make ClaudeTools 3.0 AGPL + network-served (source-offer obligation). Reimplement/take-inspiration only.

Agent backend + auth (decided constraints)

Agent = Claude Agent SDK (the supported way to embed the real Claude Code harness — same skills/MCP/hooks/tools — into a server). NOT Odysseus's loop.
ClaudeTools 3.0 is internal-only, so per-person subscription OAuth (CLAUDE_CODE_OAUTH_TOKEN via claude setup-token) is compliant and is the auth model. Each node authenticates with THAT person's own token (Mike's on Mike's boxes, Howard's on his); the hub never centralizes one subscription to serve many. See memory project_ai_auth_product_boundary.
Cost ceiling = the post-2026-06-15 monthly Agent-SDK credit pool (~$100-200/Max), not unlimited; eventual "spill to API key when the pool's dry" fallback in the node daemon.
GuruRMM (sellable) is separate: customer brings their own API key. Does not entangle this.

The architecture: two axes, not one choice

The instinct was "central host on Beast" vs. "per-workstation + peer-to-peer back-channel replacing the coord DB." That conflates two independent axes:

Axis 1 — where the AGENT runs: central on Beast vs. per-workstation.
Axis 2 — how SHARED STATE flows: central coord service vs. peer-to-peer mesh.

Recommended corners: distributed agents + central coordination.

Don't replace the coord DB with a mesh. Connectivity isn't the hard part (Tailscale gives direct node-to-node); agreement + durability are. A lock living only in a peer's memory evaporates exactly when that peer crashes. Keep coordination central — it already works.
Don't centralize the agent onto Beast. ClaudeTools sessions do machine-local work (SSH from a specific network position, OS-specific skills, a per-machine vault age-key). Beast (a ROG box, presumably Windows) can't run macOS/Kali-native work, and one host egress becomes a chokepoint + SPOF. Keep the brain local where the work + creds are.

Topology — three tiers, two channels, one rule

   TIER 1: CLIENTS                TIER 2: HUB (Beast)              TIER 3: AGENT HOSTS
   (presentation + control)       (always-on coordinator)         (where work runs)

  +----------------+                                            +----------------------+
  | Browser (Mike) |==WSS==+                              +=WSS=| GURU-5070 node daemon|
  +----------------+       |     +------------------+    |     |  - Agent SDK sessions|
  +----------------+       +====>|   Gateway (WSS)  |<===+     |  - vault age-key     |
  |Browser (Howard)|==WSS==+     |  auth + terminate|    |     |  - local fs/shell/net|
  +----------------+       |     +------------------+    |     +----------------------+
  +----------------+       |     |  Session Registry|    |     +----------------------+
  |  Phone (Mike)  |==WSS==+     |  + Presence      |    +=WSS=| MacBook node daemon  |
  +----------------+             +------------------+    |     |  (macOS-native work) |
                                 |  Relay (pipe)    |    |     +----------------------+
                                 +------------------+    |     +----------------------+
                                 |  Coord (locks,   |    +=WSS=| GURU-KALI node daemon|
                                 |  todos, msgs) +  |          |  (Linux-native work) |
                                 |  event log DB    |          +----------------------+
                                 +------------------+
                                  all on Tailscale (WireGuard)

The one rule: Beast never dials out. Both clients and nodes dial IN to Beast. Each workstation's node daemon holds a persistent, auto-reconnecting outbound WSS to the gateway — kills NAT traversal, inbound firewall rules, and reachability as problems. Tailscale is the L3 fabric (encrypted + identity'd); WSS rides on top for the app layer.

What each tier is:

Client — pure glass. Renders sessions, sends input/approvals. Per-PERSON login here (Mike, Howard); authz maps a person to the nodes/sessions they may drive. Holds no state that can't be rebuilt from the hub.
Hub (Beast) — the coordinator, NOT the brain. Terminates both WSS channels, tracks presence, relays session streams, owns the durable shared state (coord + a per-session event log). This is the existing coord API promoted from "polled over HTTP" to "pushes events live." HTTP coord API can stay for backward-compat with existing CLI sessions.
Node daemon (each workstation) — small long-lived process owning the local Agent SDK session lifecycle + exposing local resources (vault, fs, shell, network position). Brain runs here.

The envelope (one framing for both channels)

{ v:1, type, session_id?, node_id?, from, ts, payload }

type	direction	meaning
`node.register` / `presence`	node -> hub	"GURU-5070 online, here are my sessions"
`session.start`	client -> hub -> node	spawn an Agent SDK session on node X
`session.list`	hub -> client	fleet inventory for the lobby
`session.attach` / `detach`	client -> hub	subscribe/unsubscribe a session stream
`stream.delta` / `tool_call` / `tool_result` / `status`	node -> hub -> client(s)	live output
`input.prompt` / `input.approval`	client -> hub -> node	drive the session, answer gates
`coord.lock` / `todo` / `message`	any -> hub -> subscribers	shared state, pushed live

Attaching to a remote session (core flow)

Mike's browser        Hub (Beast)            GURU-5070 daemon       Agent SDK session
     |                     |                       |                       |
     | attach(sess_42) --->|                       |                       |
     |                     | replay event-log ---->| (already streaming) --|
     |<-- replay + tail ---|                       |   stream.delta -------|
     |                     |<----------------------|   (node->hub->clients)|
     | input.prompt ------>| --------------------->| --------------------->|
     |<-- stream.delta ----|<----------------------|<----------------------|

Node is source of truth for a LIVE session; hub MIRRORS every event into a per-session log for durability + late-join replay. Attaching = replay the log, then tail live. A session whose node drops shows "offline" with transcript intact, re-attaches on reconnect.

The co-work mechanic (the actual point)

"Attach" is just subscribe to a session_id, so N people can attach to the same session. Mike + Howard watch the same agent run live; either can send input. Add:

Presence-on-session — "Howard is viewing", "Mike is typing".
A driving token — a soft lock at session granularity (reuse the coord lock primitive): one person "has the wheel," visibly; others can request it. Last-writer-wins underneath, but the indicator stops collisions socially before they happen.

That falls out of the relay design almost for free — it's the Claude Co-Work analogue.

What it looks like on the screen (the hard part: vision -> screen)

Lobby / fleet view — "what's everyone doing":

+- ClaudeTools --------------------------------------------+- Coord ----------+
|  NODES                                                   | LOCKS            |
|  * GURU-5070   Mike    2 sessions                        |  valleywide-esxi |
|     |- #42  [remediation] Valleywide ESXi   < Mike       |  -> held: 5070   |
|     \- #43  [client]      Syncro triage     idle         |                  |
|  * MacBook     Mike    1 session                         | TODOS (3)        |
|     \- #44  [dev]         GuruRMM build     (waiting)     |  [ ] rotate B2   |
|  * GURU-KALI   -        0 sessions                       |  [ ] wiki: VWP   |
|  o BEAST       (hub)                                     |  [x] py.sh dep   |
|                                                          | MESSAGES (1)     |
|  [ + New session on... v ]                               |  KALI->fleet: .. |
+---------------------------------------------------------------------------+

Session room — attach to #42:

+- #42  remediation - Valleywide ESXi - on GURU-5070 ---- (eye) Mike, Howard -+
|                                                                             |
|  [agent] Checking datastore free space on 192.168.3.24...                   |
|  +- tool: ssh esxi - df -h --------------------------------+  > approved    |
|  | /vmfs/volumes/datastore1   3.6T   65% used              |                |
|  +----------------------------------------------------------+               |
|  [agent] 65% now - down from 87%. The 3 decommissioned VMs are gone.        |
|                                                                             |
+---------------------------------------------------------------------------  |
|  Mike has the wheel   [request wheel]                                       |
|  > _                                                              [send]     |
+---------------------------------------------------------------------------+

If those two screens match the picture in Mike's head, the architecture is the diagram above. If they don't, that mismatch is the cheapest thing to discover now, pre-code.

Failure modes (SPOF honesty)

Hub down: local agents keep working — degrade to local-only, queue coord events (the softfail-queue idea already exists in ClaudeTools). Web UI dark, but work does NOT stop.
Node down: its sessions pause; others unaffected; transcripts survive on the hub.
Partition: outbound WSS reconnects with backoff; event log resyncs clients on return.

What to build first (prove the vision, cheaply)

The one risky, novel slice: a node daemon running ONE Agent SDK session, streaming it over a WSS to a dead-simple web page that can watch + send input. No auth, no coord, no multi-node, localhost only. If watching a real session stream into a browser and typing back matches the vision -> everything else on the diagram is known engineering. If it doesn't -> a day spent, not a quarter.

Open questions (resolve before going past the prototype)

Transcript truth on partition — node vs. hub when they disagree after a reconnect.
Input arbitration — is a soft driving-token enough, or do you want hard turns?
CLI coexistence — do existing Claude Code CLI sessions appear as first-class nodes, or is the web the only entry point?

Thought 2 — Web-search bots (grok xsearch + gemini search) reliability: MUST FIX (Mike, 2026-06-17)

Status: Raw - HIGH PRIORITY. Mike's directive: this "absolutely must be properly fixed."

The problem

The live web-search path - grok xsearch (multi-agent web_search) and agy/gemini search (Google grounding) - is the MOST VALUABLE discovery tool we have (it surfaced the UniFi cloud connector proxy and the Teleport /rest/setting/teleport path that blind endpoint-probing never would have). But it is UNRELIABLE: both return EMPTY intermittently, especially on longer / multi-part queries.

Observed 2026-06-17:

grok xsearch returned [ask-grok] no result (stopReason=) ~5+ times on UniFi queries, DESPITE the same-day partial fix (--yolo, drop --no-subagents, web-primary prompt, 300s budget) that DID work on a short query ("current rust version", 23s). So the fix is incomplete - longer queries still fail.
agy/gemini search returned [ask-gemini] empty response (even after its built-in single retry) on the same queries.

Why it matters (Mike's reweighting)

Web search now carries AT LEAST as much weight as live API probing. Probing without a search/doc lead is "blind guessing" and mostly 404s; the searches give the real leads. So a flaky search bot directly degrades research quality and pushes the loop back toward bad guessing.

Proper fix (not a workaround) - investigation plan

grok: capture the raw --output-format json (or streaming-json) of a FAILING long query; determine whether it's max-turns exhaustion, the empty-finalization quirk (stopReason blank), a timeout mid-search, or the multi-agent searcher itself returning nothing. Then fix the actual cause (raise/auto-scale max-turns; switch to streaming-json so partial results survive; retry-on-empty loop with backoff; possibly chunk multi-part queries).
gemini: same - capture why search returns empty (the wrapper already retries once and still fails); check whether --approval-mode yolo + google_web_search is finalizing empty, and add a real retry/fallback.
Cross-fallback: if grok empty -> auto-try gemini and vice-versa, and surface which one answered.
Acceptance: 5/5 success on a battery of long, multi-part research queries.

Note

This was filed because both bots failed during live UniFi VPN/Teleport research and forced a fallback to suspect endpoint-probing. The web-search feature is load-bearing for the "interview the AIs / read the docs before probing" workflow - it has to be dependable.

Resolution (2026-06-17, same day) - diagnosed from raw output, fixed:

Diagnosis (not guessed): captured raw output of failing queries. GROK xsearch = TIMEOUT: the grok-4.20-multi-agent web_search runs past budget on multi-part queries (286s/280s, rc=124, still in the search phase - 183 thoughts, only progress-noise text), and buffered json => total loss. GEMINI search = INTERMITTENT empty turn (a clean re-run succeeded in 122s with a real 2.6KB answer); the wrapper only retried once, so two empties in a row failed spuriously.
Gemini fix: emit_or_fail now retries up to 3x with 3s/6s backoff (was 1). Two follow-on bugs found+fixed same day while using it: (a) the auth check ran INSIDE the retry loop and a benign mid-run token-refresh line matched the over-broad auth regex -> false "auth error" abort; moved auth-classify AFTER the retries and tightened the regex. (b) added a QUOTA FALLBACK: when the pinned strong model (gemini-3.1-pro-preview) returns "exhausted your capacity on this model", retry once on the default (lighter) model (separate quota) by stripping -m. Validated: a quota-capped pro run fell back and returned a 2.9KB answer.
Grok xsearch fix: switched to --output-format streaming-json (salvage any partial that streamed), moderate budget, and AUTO-FALLBACK to gemini search when grok doesn't finish (rc!=0 or empty). Validated e2e: grok timed out (rc=124) -> fell back -> gemini returned a real sourced answer.
Still open (upstream): grok's multi-agent web_search genuinely can't finish heavy queries in budget - that's an xAI-side limitation; the fallback makes xsearch reliable regardless. If grok fixes the multi-agent latency (or exposes a lighter single-agent web_search), revisit. Acceptance ("5/5 on long queries") now effectively met via the gemini path.

17 KiB Raw Blame History