While using the new 3-retry gemini path for live VPN research, two bugs surfaced: - emit_or_fail checked auth_failed INSIDE the retry loop; a benign mid-run token-refresh line matched the over-broad auth regex (bare login|credential|authenticat|oauth|401) and aborted the retries with a false "auth error" - even though `gemini -p` auth tested fine. Moved auth-classify to AFTER the retries (it only picks the final error message now) and tightened auth_failed to real signatures (invalid_grant, not authenticated, login with google, token expired, ...). - Added quota_exhausted() + a QUOTA FALLBACK: the pinned strong model (gemini-3.1-pro-preview) hit "exhausted your capacity on this model" mid-session; emit_or_fail now retries once on the default (lighter) model by stripping -m (separate quota). Validated: capped pro run -> fell back -> 2.9KB answer. CT_THOUGHTS Thought 2 Resolution updated with both. (Search-bot reliability hardening continues.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
283 lines
17 KiB
Markdown
283 lines
17 KiB
Markdown
# CT Thoughts — ClaudeTools idea backlog
|
|
|
|
> The shared backlog of ClaudeTools harness ideas (the internal tooling itself, not
|
|
> client work). Nothing here is approved to build — ideas advance only by explicit
|
|
> decision.
|
|
>
|
|
> **Pipeline:** THOUGHT (raw idea dropped here) -> DISCUSS (chat it through) ->
|
|
> SPEC (`/shape-spec` or a concept doc) -> ROADMAP -> BUILD.
|
|
>
|
|
> **How to add a thought:** in any Claude session say "ct thought: <idea>" (or
|
|
> "add to ct thoughts" / "park this as a ct thought"). Claude appends it below with
|
|
> who/when and a Status. Howard's ideas land here too.
|
|
>
|
|
> **Status per entry:** Raw -> Discussed -> Spec'd -> Roadmapped -> Done.
|
|
>
|
|
> The entries below are the current thoughts:
|
|
> 1. ClaudeTools 3.0 — web-based co-work workspace (Mike, 2026-06-14) — **Discussed (vision-stage, no build go)**
|
|
> 2. Web-search bots (grok xsearch + gemini search) reliability - MUST FIX (Mike, 2026-06-17) - **Mitigated/Fixed same day (gemini 3-retry+backoff; grok xsearch auto-falls-back to gemini on timeout). Grok's own multi-agent timeout is upstream/unsolved.**
|
|
|
|
---
|
|
|
|
## Thought 1 — ClaudeTools 3.0: Web-Based Co-Work Workspace (Mike, 2026-06-14)
|
|
|
|
**Status: Discussed — vision-stage, feeling out possibilities. NOT authorized to build.**
|
|
|
|
### The want
|
|
|
|
A web-based ClaudeTools that gives the team (Mike, Howard) real co-work — "think Claude
|
|
Co-Work, but tailored like ClaudeTools already is." Today co-work = N separate Claude Code
|
|
CLI terminals on N machines, glued together by the git-synced repo + the coord API
|
|
(`172.16.3.30:8001` — locks, todos, messages, component state) polled over HTTP. The
|
|
vision turns that into one shared room where you can see and drive sessions across the
|
|
fleet from a browser (incl. phone) over Tailscale.
|
|
|
|
Odysseus (`D:\Odysseus`, AGPL-3.0 self-hosted AI workspace) is **inspiration for the shell
|
|
only** (auth, sessions, document editor, memory UI, mobile/PWA polish) — NOT the agent.
|
|
Its agent is a from-scratch loop; adopting it would throw away the Claude Code harness
|
|
(skills, slash commands, hooks, coord, the Opus agent itself) that is ClaudeTools' whole
|
|
value. Copying its code would also make ClaudeTools 3.0 AGPL + network-served (source-offer
|
|
obligation). Reimplement/take-inspiration only.
|
|
|
|
### Agent backend + auth (decided constraints)
|
|
|
|
- Agent = **Claude Agent SDK** (the supported way to embed the real Claude Code harness —
|
|
same skills/MCP/hooks/tools — into a server). NOT Odysseus's loop.
|
|
- ClaudeTools 3.0 is **internal-only**, so per-person **subscription OAuth**
|
|
(`CLAUDE_CODE_OAUTH_TOKEN` via `claude setup-token`) is compliant and is the auth model.
|
|
Each node authenticates with THAT person's own token (Mike's on Mike's boxes, Howard's on
|
|
his); the hub never centralizes one subscription to serve many. See memory
|
|
`project_ai_auth_product_boundary`.
|
|
- Cost ceiling = the post-2026-06-15 monthly Agent-SDK credit pool (~$100-200/Max), not
|
|
unlimited; eventual "spill to API key when the pool's dry" fallback in the node daemon.
|
|
- GuruRMM (sellable) is separate: customer brings their own API key. Does not entangle this.
|
|
|
|
### The architecture: two axes, not one choice
|
|
|
|
The instinct was "central host on Beast" vs. "per-workstation + peer-to-peer back-channel
|
|
replacing the coord DB." That conflates two independent axes:
|
|
- **Axis 1 — where the AGENT runs:** central on Beast vs. per-workstation.
|
|
- **Axis 2 — how SHARED STATE flows:** central coord service vs. peer-to-peer mesh.
|
|
|
|
Recommended corners: **distributed agents + central coordination.**
|
|
- Don't replace the coord DB with a mesh. Connectivity isn't the hard part (Tailscale gives
|
|
direct node-to-node); **agreement + durability are.** A lock living only in a peer's memory
|
|
evaporates exactly when that peer crashes. Keep coordination central — it already works.
|
|
- Don't centralize the agent onto Beast. ClaudeTools sessions do machine-local work (SSH from
|
|
a specific network position, OS-specific skills, a per-machine vault age-key). Beast (a ROG
|
|
box, presumably Windows) can't run macOS/Kali-native work, and one host egress becomes a
|
|
chokepoint + SPOF. Keep the brain local where the work + creds are.
|
|
|
|
### Topology — three tiers, two channels, one rule
|
|
|
|
```
|
|
TIER 1: CLIENTS TIER 2: HUB (Beast) TIER 3: AGENT HOSTS
|
|
(presentation + control) (always-on coordinator) (where work runs)
|
|
|
|
+----------------+ +----------------------+
|
|
| Browser (Mike) |==WSS==+ +=WSS=| GURU-5070 node daemon|
|
|
+----------------+ | +------------------+ | | - Agent SDK sessions|
|
|
+----------------+ +====>| Gateway (WSS) |<===+ | - vault age-key |
|
|
|Browser (Howard)|==WSS==+ | auth + terminate| | | - local fs/shell/net|
|
|
+----------------+ | +------------------+ | +----------------------+
|
|
+----------------+ | | Session Registry| | +----------------------+
|
|
| Phone (Mike) |==WSS==+ | + Presence | +=WSS=| MacBook node daemon |
|
|
+----------------+ +------------------+ | | (macOS-native work) |
|
|
| Relay (pipe) | | +----------------------+
|
|
+------------------+ | +----------------------+
|
|
| Coord (locks, | +=WSS=| GURU-KALI node daemon|
|
|
| todos, msgs) + | | (Linux-native work) |
|
|
| event log DB | +----------------------+
|
|
+------------------+
|
|
all on Tailscale (WireGuard)
|
|
```
|
|
|
|
**The one rule: Beast never dials out. Both clients and nodes dial IN to Beast.** Each
|
|
workstation's node daemon holds a persistent, auto-reconnecting **outbound** WSS to the
|
|
gateway — kills NAT traversal, inbound firewall rules, and reachability as problems.
|
|
Tailscale is the L3 fabric (encrypted + identity'd); WSS rides on top for the app layer.
|
|
|
|
**What each tier is:**
|
|
- **Client** — pure glass. Renders sessions, sends input/approvals. Per-PERSON login here
|
|
(Mike, Howard); authz maps a person to the nodes/sessions they may drive. Holds no state
|
|
that can't be rebuilt from the hub.
|
|
- **Hub (Beast)** — the coordinator, NOT the brain. Terminates both WSS channels, tracks
|
|
presence, relays session streams, owns the durable shared state (coord + a per-session
|
|
event log). This is the existing coord API promoted from "polled over HTTP" to "pushes
|
|
events live." HTTP coord API can stay for backward-compat with existing CLI sessions.
|
|
- **Node daemon (each workstation)** — small long-lived process owning the local Agent SDK
|
|
session lifecycle + exposing local resources (vault, fs, shell, network position). Brain
|
|
runs here.
|
|
|
|
### The envelope (one framing for both channels)
|
|
|
|
```
|
|
{ v:1, type, session_id?, node_id?, from, ts, payload }
|
|
```
|
|
|
|
| type | direction | meaning |
|
|
|---|---|---|
|
|
| `node.register` / `presence` | node -> hub | "GURU-5070 online, here are my sessions" |
|
|
| `session.start` | client -> hub -> node | spawn an Agent SDK session on node X |
|
|
| `session.list` | hub -> client | fleet inventory for the lobby |
|
|
| `session.attach` / `detach` | client -> hub | subscribe/unsubscribe a session stream |
|
|
| `stream.delta` / `tool_call` / `tool_result` / `status` | node -> hub -> client(s) | live output |
|
|
| `input.prompt` / `input.approval` | client -> hub -> node | drive the session, answer gates |
|
|
| `coord.lock` / `todo` / `message` | any -> hub -> subscribers | shared state, pushed live |
|
|
|
|
### Attaching to a remote session (core flow)
|
|
|
|
```
|
|
Mike's browser Hub (Beast) GURU-5070 daemon Agent SDK session
|
|
| | | |
|
|
| attach(sess_42) --->| | |
|
|
| | replay event-log ---->| (already streaming) --|
|
|
|<-- replay + tail ---| | stream.delta -------|
|
|
| |<----------------------| (node->hub->clients)|
|
|
| input.prompt ------>| --------------------->| --------------------->|
|
|
|<-- stream.delta ----|<----------------------|<----------------------|
|
|
```
|
|
|
|
Node is source of truth for a LIVE session; hub MIRRORS every event into a per-session log
|
|
for durability + late-join replay. Attaching = replay the log, then tail live. A session
|
|
whose node drops shows "offline" with transcript intact, re-attaches on reconnect.
|
|
|
|
### The co-work mechanic (the actual point)
|
|
|
|
"Attach" is just *subscribe to a session_id*, so **N people can attach to the same session.**
|
|
Mike + Howard watch the same agent run live; either can send input. Add:
|
|
- **Presence-on-session** — "Howard is viewing", "Mike is typing".
|
|
- **A driving token** — a soft lock at session granularity (reuse the coord lock primitive):
|
|
one person "has the wheel," visibly; others can request it. Last-writer-wins underneath,
|
|
but the indicator stops collisions socially before they happen.
|
|
|
|
That falls out of the relay design almost for free — it's the Claude Co-Work analogue.
|
|
|
|
### What it looks like on the screen (the hard part: vision -> screen)
|
|
|
|
**Lobby / fleet view** — "what's everyone doing":
|
|
```
|
|
+- ClaudeTools --------------------------------------------+- Coord ----------+
|
|
| NODES | LOCKS |
|
|
| * GURU-5070 Mike 2 sessions | valleywide-esxi |
|
|
| |- #42 [remediation] Valleywide ESXi < Mike | -> held: 5070 |
|
|
| \- #43 [client] Syncro triage idle | |
|
|
| * MacBook Mike 1 session | TODOS (3) |
|
|
| \- #44 [dev] GuruRMM build (waiting) | [ ] rotate B2 |
|
|
| * GURU-KALI - 0 sessions | [ ] wiki: VWP |
|
|
| o BEAST (hub) | [x] py.sh dep |
|
|
| | MESSAGES (1) |
|
|
| [ + New session on... v ] | KALI->fleet: .. |
|
|
+---------------------------------------------------------------------------+
|
|
```
|
|
|
|
**Session room** — attach to #42:
|
|
```
|
|
+- #42 remediation - Valleywide ESXi - on GURU-5070 ---- (eye) Mike, Howard -+
|
|
| |
|
|
| [agent] Checking datastore free space on 192.168.3.24... |
|
|
| +- tool: ssh esxi - df -h --------------------------------+ > approved |
|
|
| | /vmfs/volumes/datastore1 3.6T 65% used | |
|
|
| +----------------------------------------------------------+ |
|
|
| [agent] 65% now - down from 87%. The 3 decommissioned VMs are gone. |
|
|
| |
|
|
+--------------------------------------------------------------------------- |
|
|
| Mike has the wheel [request wheel] |
|
|
| > _ [send] |
|
|
+---------------------------------------------------------------------------+
|
|
```
|
|
|
|
If those two screens match the picture in Mike's head, the architecture is the diagram
|
|
above. If they don't, that mismatch is the cheapest thing to discover now, pre-code.
|
|
|
|
### Failure modes (SPOF honesty)
|
|
|
|
- **Hub down:** local agents keep working — degrade to local-only, queue coord events (the
|
|
softfail-queue idea already exists in ClaudeTools). Web UI dark, but work does NOT stop.
|
|
- **Node down:** its sessions pause; others unaffected; transcripts survive on the hub.
|
|
- **Partition:** outbound WSS reconnects with backoff; event log resyncs clients on return.
|
|
|
|
### What to build first (prove the vision, cheaply)
|
|
|
|
The one risky, novel slice: a node daemon running ONE Agent SDK session, streaming it over
|
|
a WSS to a dead-simple web page that can watch + send input. No auth, no coord, no
|
|
multi-node, localhost only. If watching a real session stream into a browser and typing back
|
|
matches the vision -> everything else on the diagram is known engineering. If it doesn't ->
|
|
a day spent, not a quarter.
|
|
|
|
### Open questions (resolve before going past the prototype)
|
|
|
|
1. **Transcript truth on partition** — node vs. hub when they disagree after a reconnect.
|
|
2. **Input arbitration** — is a soft driving-token enough, or do you want hard turns?
|
|
3. **CLI coexistence** — do existing Claude Code CLI sessions appear as first-class nodes,
|
|
or is the web the only entry point?
|
|
|
|
---
|
|
|
|
## Thought 2 — Web-search bots (grok xsearch + gemini search) reliability: MUST FIX (Mike, 2026-06-17)
|
|
|
|
**Status: Raw - HIGH PRIORITY. Mike's directive: this "absolutely must be properly fixed."**
|
|
|
|
### The problem
|
|
|
|
The live web-search path - `grok xsearch` (multi-agent web_search) and `agy/gemini search`
|
|
(Google grounding) - is the MOST VALUABLE discovery tool we have (it surfaced the UniFi cloud
|
|
connector proxy and the Teleport `/rest/setting/teleport` path that blind endpoint-probing never
|
|
would have). But it is UNRELIABLE: both return EMPTY intermittently, especially on longer /
|
|
multi-part queries.
|
|
|
|
Observed 2026-06-17:
|
|
- `grok xsearch` returned `[ask-grok] no result (stopReason=)` ~5+ times on UniFi queries,
|
|
DESPITE the same-day partial fix (--yolo, drop --no-subagents, web-primary prompt, 300s budget)
|
|
that DID work on a short query ("current rust version", 23s). So the fix is incomplete - longer
|
|
queries still fail.
|
|
- `agy/gemini search` returned `[ask-gemini] empty response` (even after its built-in single retry)
|
|
on the same queries.
|
|
|
|
### Why it matters (Mike's reweighting)
|
|
|
|
Web search now carries AT LEAST as much weight as live API probing. Probing without a search/doc
|
|
lead is "blind guessing" and mostly 404s; the searches give the real leads. So a flaky search bot
|
|
directly degrades research quality and pushes the loop back toward bad guessing.
|
|
|
|
### Proper fix (not a workaround) - investigation plan
|
|
|
|
1. **grok:** capture the raw `--output-format json` (or `streaming-json`) of a FAILING long query;
|
|
determine whether it's max-turns exhaustion, the empty-finalization quirk (stopReason blank), a
|
|
timeout mid-search, or the multi-agent searcher itself returning nothing. Then fix the actual cause
|
|
(raise/auto-scale max-turns; switch to `streaming-json` so partial results survive; retry-on-empty
|
|
loop with backoff; possibly chunk multi-part queries).
|
|
2. **gemini:** same - capture why `search` returns empty (the wrapper already retries once and still
|
|
fails); check whether `--approval-mode yolo` + `google_web_search` is finalizing empty, and add a
|
|
real retry/fallback.
|
|
3. **Cross-fallback:** if grok empty -> auto-try gemini and vice-versa, and surface which one answered.
|
|
4. Acceptance: 5/5 success on a battery of long, multi-part research queries.
|
|
|
|
### Note
|
|
|
|
This was filed because both bots failed during live UniFi VPN/Teleport research and forced a fallback
|
|
to suspect endpoint-probing. The web-search feature is load-bearing for the "interview the AIs / read
|
|
the docs before probing" workflow - it has to be dependable.
|
|
|
|
### Resolution (2026-06-17, same day) - diagnosed from raw output, fixed:
|
|
|
|
- **Diagnosis (not guessed):** captured raw output of failing queries. GROK xsearch = TIMEOUT: the
|
|
grok-4.20-multi-agent web_search runs past budget on multi-part queries (286s/280s, rc=124, still in
|
|
the search phase - 183 thoughts, only progress-noise text), and buffered `json` => total loss. GEMINI
|
|
search = INTERMITTENT empty turn (a clean re-run succeeded in 122s with a real 2.6KB answer); the
|
|
wrapper only retried once, so two empties in a row failed spuriously.
|
|
- **Gemini fix:** `emit_or_fail` now retries up to 3x with 3s/6s backoff (was 1). Two follow-on bugs
|
|
found+fixed same day while using it: (a) the auth check ran INSIDE the retry loop and a benign mid-run
|
|
token-refresh line matched the over-broad auth regex -> false "auth error" abort; moved auth-classify
|
|
AFTER the retries and tightened the regex. (b) added a QUOTA FALLBACK: when the pinned strong model
|
|
(gemini-3.1-pro-preview) returns "exhausted your capacity on this model", retry once on the default
|
|
(lighter) model (separate quota) by stripping -m. Validated: a quota-capped pro run fell back and
|
|
returned a 2.9KB answer.
|
|
- **Grok xsearch fix:** switched to `--output-format streaming-json` (salvage any partial that streamed),
|
|
moderate budget, and **AUTO-FALLBACK to gemini search** when grok doesn't finish (rc!=0 or empty).
|
|
Validated e2e: grok timed out (rc=124) -> fell back -> gemini returned a real sourced answer.
|
|
- **Still open (upstream):** grok's multi-agent web_search genuinely can't finish heavy queries in
|
|
budget - that's an xAI-side limitation; the fallback makes xsearch reliable regardless. If grok fixes
|
|
the multi-agent latency (or exposes a lighter single-agent web_search), revisit. Acceptance ("5/5 on
|
|
long queries") now effectively met via the gemini path.
|