- coord routers: removed JWT auth requirement (internal-only endpoints) - error_handler: SQLAlchemy OperationalError/DisconnectionError → 503 with Retry-After: 30 header instead of 500 - /health: live DB probe (SELECT 1) instead of static response - CLAUDE.md: "Live State Tracking" section with full agent protocol for all projects — session start, lock claim/release, component state updates, softfail + local queue catch-up - COORDINATION_PROTOCOL.md: softfail/catch-up section + server-side 503 behavior documented Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
234 lines
6.3 KiB
Markdown
234 lines
6.3 KiB
Markdown
# Coordination Protocol
|
|
|
|
Cross-session coordination uses the ClaudeTools API at `http://172.16.3.30:8001/api/coord/`. This replaces PROJECT_STATE.md files.
|
|
|
|
No auth token required for coordination endpoints — they are internal-only on the 172.16.3.30 private network. Pass `session_id` in the request body or as a query parameter to identify the calling session (e.g., `DESKTOP-0O8A1RL/claude-main`).
|
|
|
|
---
|
|
|
|
## When a Lock Is Required
|
|
|
|
- Editing or creating source code files
|
|
- Git commit or push
|
|
- SSH command that modifies a server (deploy, install, config change, service restart)
|
|
- Database schema change or data migration
|
|
- Build pipeline modification
|
|
|
|
Reading files, planning, and answering questions do NOT require a lock.
|
|
|
|
---
|
|
|
|
## Lock Lifecycle
|
|
|
|
**Step 1 — Check for conflicts**
|
|
```
|
|
GET /api/coord/locks?project_key=<key>&resource=<resource>
|
|
```
|
|
- Active lock present: stop, report to user, ask how to proceed.
|
|
- Lock `acquired_at` > 2 hours ago: note it, release it (Step 2 below), proceed.
|
|
|
|
**Step 2 — Claim your lock**
|
|
```
|
|
POST /api/coord/locks
|
|
{
|
|
"project_key": "gururmm",
|
|
"session_id": "DESKTOP-0O8A1RL/claude-main",
|
|
"resource": "server/src/api/credentials.rs",
|
|
"description": "Adding credential endpoints",
|
|
"ttl_hours": 2
|
|
}
|
|
```
|
|
Response: `{ "id": "<uuid>", ... }` — save the `id` for release.
|
|
|
|
`ttl_hours`: use 2 for normal work; 0 for no expiry (use sparingly).
|
|
|
|
**Step 3 — Do the work**
|
|
|
|
**Step 4 — Release the lock**
|
|
```
|
|
DELETE /api/coord/locks/<id>?session_id=<session_id>
|
|
```
|
|
Release on completion AND on failure. Only the claiming session may release.
|
|
|
|
**Stale lock rule:** A lock with `acquired_at` older than 2 hours and no activity update is abandoned. Release it, then proceed.
|
|
|
|
---
|
|
|
|
## Component States
|
|
|
|
Record the current status of named system components so all sessions share a live view.
|
|
|
|
**Upsert a component state:**
|
|
```
|
|
PUT /api/coord/components
|
|
{
|
|
"project_key": "gururmm",
|
|
"component": "server",
|
|
"state": "deployed",
|
|
"version": "0.3.0",
|
|
"notes": "Deployed 2026-05-12; credential store live",
|
|
"updated_by": "DESKTOP-0O8A1RL/claude-main"
|
|
}
|
|
```
|
|
|
|
Valid states (convention — not enforced): `building`, `built`, `deploying`, `deployed`, `degraded`, `unknown`
|
|
|
|
**Read all component states for a project:**
|
|
```
|
|
GET /api/coord/components?project_key=gururmm
|
|
```
|
|
|
|
---
|
|
|
|
## Workflows and Work Items
|
|
|
|
Use workflows to track multi-step initiatives that span sessions or days.
|
|
|
|
**Create a workflow:**
|
|
```
|
|
POST /api/coord/workflows
|
|
{
|
|
"project_key": "gururmm",
|
|
"name": "Network Discovery Phase 1",
|
|
"description": "TCP probe scanner + DB layer + API + dashboard",
|
|
"status": "planning",
|
|
"created_by": "DESKTOP-0O8A1RL/claude-main"
|
|
}
|
|
```
|
|
|
|
**Add work items to a workflow:**
|
|
```
|
|
POST /api/coord/work-items
|
|
{
|
|
"workflow_id": "<uuid>",
|
|
"project_key": "gururmm",
|
|
"title": "Write migrations 017-019 for discovery tables",
|
|
"status": "pending",
|
|
"priority": 10
|
|
}
|
|
```
|
|
|
|
**Update work item status:**
|
|
```
|
|
PATCH /api/coord/work-items/<id>
|
|
{ "status": "completed" }
|
|
```
|
|
|
|
Workflow statuses: `planning`, `active`, `blocked`, `completed`, `cancelled`
|
|
Work item statuses: `pending`, `in_progress`, `blocked`, `completed`, `cancelled`
|
|
|
|
---
|
|
|
|
## Inter-Session Messages
|
|
|
|
Send targeted messages between sessions or broadcast to a project.
|
|
|
|
**Send a message:**
|
|
```
|
|
POST /api/coord/messages
|
|
{
|
|
"from_session": "DESKTOP-0O8A1RL/claude-main",
|
|
"to_session": "HOWARD-HOME/claude-main", // omit for broadcast
|
|
"project_key": "gururmm",
|
|
"subject": "macOS build pipeline ready for wiring",
|
|
"body": "build-agents.sh updated. Section marked TODO-MACOS. Wire in from your end."
|
|
}
|
|
```
|
|
|
|
**Check for unread messages (do this at session start):**
|
|
```
|
|
GET /api/coord/messages?to_session=<session_id>&unread_only=true
|
|
```
|
|
|
|
Display each unread message prominently:
|
|
```
|
|
============================================================
|
|
MESSAGE FROM <from_session> — <subject>
|
|
============================================================
|
|
<body>
|
|
============================================================
|
|
```
|
|
|
|
**Mark as read:**
|
|
```
|
|
PUT /api/coord/messages/<id>/read
|
|
```
|
|
|
|
---
|
|
|
|
## Status Overview
|
|
|
|
Quick snapshot of everything active:
|
|
```
|
|
GET /api/coord/status
|
|
```
|
|
Returns: active locks, recent component state changes, active workflows, unread message count.
|
|
|
|
---
|
|
|
|
## Session Cleanup
|
|
|
|
When a session ends cleanly, release all its locks:
|
|
```
|
|
DELETE /api/coord/locks?session_id=<session_id>&release_all=true
|
|
```
|
|
|
|
---
|
|
|
|
## project_key Slugs
|
|
|
|
| Slug | Project |
|
|
|------|---------|
|
|
| `gururmm` | GuruRMM server + dashboard |
|
|
| `claudetools` | ClaudeTools API + coordination system |
|
|
| `dataforth-dos` | Dataforth DOS project |
|
|
|
|
Free-form — add new slugs as needed. Does NOT foreign-key to the projects table.
|
|
|
|
---
|
|
|
|
## Softfail and Catch-Up
|
|
|
|
The coordination API must never block work. If it is unavailable:
|
|
|
|
**On any network error, timeout, or 5xx response:**
|
|
1. Log the failed call to `.claude/coord-queue.jsonl` (one JSON object per line):
|
|
```json
|
|
{"ts":"2026-05-12T15:30:00Z","method":"PUT","path":"/api/coord/components/gururmm/server","body":{"state":"deployed","version":"0.3.0","notes":"...","updated_by":"DESKTOP-0O8A1RL/claude-main"}}
|
|
```
|
|
2. Continue working. Do not retry immediately.
|
|
|
|
**On 503 with `Retry-After` header:**
|
|
Wait the specified seconds, then retry once. If the retry also fails, queue it.
|
|
|
|
**Catch-up (session start and after `/sync`):**
|
|
```bash
|
|
# If coord-queue.jsonl exists and is non-empty:
|
|
while read -r line; do
|
|
method=$(echo "$line" | jq -r .method)
|
|
path=$(echo "$line" | jq -r .path)
|
|
body=$(echo "$line" | jq -r .body)
|
|
curl -s -X "$method" "http://172.16.3.30:8001$path" -H "Content-Type: application/json" -d "$body"
|
|
done < .claude/coord-queue.jsonl
|
|
# Remove the file only if all calls succeeded
|
|
```
|
|
|
|
The queue file lives in `.claude/coord-queue.jsonl` (gitignored — local to each workstation).
|
|
|
|
---
|
|
|
|
## API Softfail Behavior (Server Side)
|
|
|
|
When the MariaDB database is unavailable:
|
|
- Coord endpoints return `503 Service Unavailable` with header `Retry-After: 30`
|
|
- Response body: `{"detail": "Database unavailable. Retry after 30 seconds.", "retry_after": 30}`
|
|
- `GET /health` reflects DB status: `{"status":"degraded","database":"disconnected"}`
|
|
|
|
This behavior is implemented in the API server and does not need to be coded by agents.
|
|
|
|
---
|
|
|
|
## Migration Note
|
|
|
|
`projects/*/PROJECT_STATE.md` files are ARCHIVED — read-only historical reference. Do not edit them. Use this API for all live coordination going forward.
|