Files
claudetools/.claude/COORDINATION_PROTOCOL.md
Mike Swanson 73573800b0 feat: coord API — no-auth, DB softfail 503, agent tracking protocol
- coord routers: removed JWT auth requirement (internal-only endpoints)
- error_handler: SQLAlchemy OperationalError/DisconnectionError → 503
  with Retry-After: 30 header instead of 500
- /health: live DB probe (SELECT 1) instead of static response
- CLAUDE.md: "Live State Tracking" section with full agent protocol
  for all projects — session start, lock claim/release, component
  state updates, softfail + local queue catch-up
- COORDINATION_PROTOCOL.md: softfail/catch-up section + server-side
  503 behavior documented

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 08:45:33 -07:00

234 lines
6.3 KiB
Markdown

# Coordination Protocol
Cross-session coordination uses the ClaudeTools API at `http://172.16.3.30:8001/api/coord/`. This replaces PROJECT_STATE.md files.
No auth token required for coordination endpoints — they are internal-only on the 172.16.3.30 private network. Pass `session_id` in the request body or as a query parameter to identify the calling session (e.g., `DESKTOP-0O8A1RL/claude-main`).
---
## When a Lock Is Required
- Editing or creating source code files
- Git commit or push
- SSH command that modifies a server (deploy, install, config change, service restart)
- Database schema change or data migration
- Build pipeline modification
Reading files, planning, and answering questions do NOT require a lock.
---
## Lock Lifecycle
**Step 1 — Check for conflicts**
```
GET /api/coord/locks?project_key=<key>&resource=<resource>
```
- Active lock present: stop, report to user, ask how to proceed.
- Lock `acquired_at` > 2 hours ago: note it, release it (Step 2 below), proceed.
**Step 2 — Claim your lock**
```
POST /api/coord/locks
{
"project_key": "gururmm",
"session_id": "DESKTOP-0O8A1RL/claude-main",
"resource": "server/src/api/credentials.rs",
"description": "Adding credential endpoints",
"ttl_hours": 2
}
```
Response: `{ "id": "<uuid>", ... }` — save the `id` for release.
`ttl_hours`: use 2 for normal work; 0 for no expiry (use sparingly).
**Step 3 — Do the work**
**Step 4 — Release the lock**
```
DELETE /api/coord/locks/<id>?session_id=<session_id>
```
Release on completion AND on failure. Only the claiming session may release.
**Stale lock rule:** A lock with `acquired_at` older than 2 hours and no activity update is abandoned. Release it, then proceed.
---
## Component States
Record the current status of named system components so all sessions share a live view.
**Upsert a component state:**
```
PUT /api/coord/components
{
"project_key": "gururmm",
"component": "server",
"state": "deployed",
"version": "0.3.0",
"notes": "Deployed 2026-05-12; credential store live",
"updated_by": "DESKTOP-0O8A1RL/claude-main"
}
```
Valid states (convention — not enforced): `building`, `built`, `deploying`, `deployed`, `degraded`, `unknown`
**Read all component states for a project:**
```
GET /api/coord/components?project_key=gururmm
```
---
## Workflows and Work Items
Use workflows to track multi-step initiatives that span sessions or days.
**Create a workflow:**
```
POST /api/coord/workflows
{
"project_key": "gururmm",
"name": "Network Discovery Phase 1",
"description": "TCP probe scanner + DB layer + API + dashboard",
"status": "planning",
"created_by": "DESKTOP-0O8A1RL/claude-main"
}
```
**Add work items to a workflow:**
```
POST /api/coord/work-items
{
"workflow_id": "<uuid>",
"project_key": "gururmm",
"title": "Write migrations 017-019 for discovery tables",
"status": "pending",
"priority": 10
}
```
**Update work item status:**
```
PATCH /api/coord/work-items/<id>
{ "status": "completed" }
```
Workflow statuses: `planning`, `active`, `blocked`, `completed`, `cancelled`
Work item statuses: `pending`, `in_progress`, `blocked`, `completed`, `cancelled`
---
## Inter-Session Messages
Send targeted messages between sessions or broadcast to a project.
**Send a message:**
```
POST /api/coord/messages
{
"from_session": "DESKTOP-0O8A1RL/claude-main",
"to_session": "HOWARD-HOME/claude-main", // omit for broadcast
"project_key": "gururmm",
"subject": "macOS build pipeline ready for wiring",
"body": "build-agents.sh updated. Section marked TODO-MACOS. Wire in from your end."
}
```
**Check for unread messages (do this at session start):**
```
GET /api/coord/messages?to_session=<session_id>&unread_only=true
```
Display each unread message prominently:
```
============================================================
MESSAGE FROM <from_session> — <subject>
============================================================
<body>
============================================================
```
**Mark as read:**
```
PUT /api/coord/messages/<id>/read
```
---
## Status Overview
Quick snapshot of everything active:
```
GET /api/coord/status
```
Returns: active locks, recent component state changes, active workflows, unread message count.
---
## Session Cleanup
When a session ends cleanly, release all its locks:
```
DELETE /api/coord/locks?session_id=<session_id>&release_all=true
```
---
## project_key Slugs
| Slug | Project |
|------|---------|
| `gururmm` | GuruRMM server + dashboard |
| `claudetools` | ClaudeTools API + coordination system |
| `dataforth-dos` | Dataforth DOS project |
Free-form — add new slugs as needed. Does NOT foreign-key to the projects table.
---
## Softfail and Catch-Up
The coordination API must never block work. If it is unavailable:
**On any network error, timeout, or 5xx response:**
1. Log the failed call to `.claude/coord-queue.jsonl` (one JSON object per line):
```json
{"ts":"2026-05-12T15:30:00Z","method":"PUT","path":"/api/coord/components/gururmm/server","body":{"state":"deployed","version":"0.3.0","notes":"...","updated_by":"DESKTOP-0O8A1RL/claude-main"}}
```
2. Continue working. Do not retry immediately.
**On 503 with `Retry-After` header:**
Wait the specified seconds, then retry once. If the retry also fails, queue it.
**Catch-up (session start and after `/sync`):**
```bash
# If coord-queue.jsonl exists and is non-empty:
while read -r line; do
method=$(echo "$line" | jq -r .method)
path=$(echo "$line" | jq -r .path)
body=$(echo "$line" | jq -r .body)
curl -s -X "$method" "http://172.16.3.30:8001$path" -H "Content-Type: application/json" -d "$body"
done < .claude/coord-queue.jsonl
# Remove the file only if all calls succeeded
```
The queue file lives in `.claude/coord-queue.jsonl` (gitignored — local to each workstation).
---
## API Softfail Behavior (Server Side)
When the MariaDB database is unavailable:
- Coord endpoints return `503 Service Unavailable` with header `Retry-After: 30`
- Response body: `{"detail": "Database unavailable. Retry after 30 seconds.", "retry_after": 30}`
- `GET /health` reflects DB status: `{"status":"degraded","database":"disconnected"}`
This behavior is implemented in the API server and does not need to be coded by agents.
---
## Migration Note
`projects/*/PROJECT_STATE.md` files are ARCHIVED — read-only historical reference. Do not edit them. Use this API for all live coordination going forward.