claudetools/.claude/ARCHITECTURE_OVERVIEW.md

# MSP Mode Architecture Overview

**Version:** 1.0.0
**Last Updated:** 2026-01-16
**Status:** Design Phase

---

## Executive Summary

MSP Mode is a custom Claude Code implementation that tracks client work, maintains context across sessions and machines, and provides structured access to historical MSP data through an agent-based architecture.

**Core Principle:** All modes (MSP, Development, Normal) use specialized agents to preserve main Claude instance context space.

---

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    User (Technician)                         │
│              Multiple Machines (Laptop, Desktop)             │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ↓
┌─────────────────────────────────────────────────────────────┐
│              Claude Code (Main Instance)                     │
│  • Conversation & User Interaction                          │
│  • Decision Making & Mode Management                        │
│  • Agent Orchestration                                      │
└────────────┬───────────────────────┬────────────────────────┘
             │                       │
             ↓                       ↓
┌────────────────────┐    ┌──────────────────────────────────┐
│  13 Specialized    │    │    REST API (FastAPI)            │
│     Agents         │────│    Jupiter Server                │
│  • Context Mgmt    │    │    https://msp-api.azcomputerguru │
│  • Data Processing │    └──────────┬───────────────────────┘
│  • Integration     │               │
└────────────────────┘               ↓
                           ┌──────────────────────┐
                           │  MariaDB Database    │
                           │  msp_tracking        │
                           │  36 Tables           │
                           └──────────────────────┘
```

---

## 13 Specialized Agents

### 1. Machine Detection Agent
**Launched:** Session start (FIRST - before all other agents)
**Purpose:** Identify current machine and load capabilities

**Tasks:**
- Execute `hostname`, `whoami`, detect platform
- Generate machine fingerprint (SHA256)
- Query machines table for existing record
- Load VPN access, Docker, PowerShell version, MCPs, Skills
- Update last_seen timestamp

**Returns:** Machine context (machine_id, capabilities, limitations)

**Context Saved:** ~97% (machine profile loaded, only key capabilities returned)

---

### 2. Environment Context Agent
**Launched:** Before making command suggestions or infrastructure operations
**Purpose:** Check environmental constraints to avoid known failures

**Tasks:**
- Query infrastructure environmental_notes
- Read environmental_insights for client/infrastructure
- Check failure_patterns for similar operations
- Validate command compatibility with environment
- Return constraints and recommendations

**Returns:** Environmental context + compatibility warnings

**Example:** "D2TESTNAS: Manual WINS install (no native service), ReadyNAS OS, SMB1 only"

**Context Saved:** ~96% (processes failure history, returns summary)

---

### 3. Context Recovery Agent
**Launched:** Session start (`/msp` command)
**Purpose:** Load relevant client context

**Tasks:**
- Query previous sessions (last 5)
- Retrieve open pending tasks
- Get recently used credentials
- Fetch infrastructure topology

**Returns:** Concise context summary (< 300 words)

**API Calls:** 4-5 parallel GET requests

**Context Saved:** ~95% (processes MB of data, returns summary)

---

### 4. Work Categorization Agent
**Launched:** Periodically during session or on-demand
**Purpose:** Analyze and categorize recent work

**Tasks:**
- Parse conversation transcript
- Extract commands, files, systems, technologies
- Detect category (infrastructure, troubleshooting, etc.)
- Generate dense description
- Auto-tag work items

**Returns:** Structured work_item object (JSON)

**Context Saved:** ~90% (processes conversation, returns structured data)

---

### 5. Session Summary Agent
**Launched:** Session end (`/msp end` or mode switch)
**Purpose:** Generate comprehensive session summary

**Tasks:**
- Analyze all work_items from session
- Calculate time allocation per category
- Generate dense markdown summary
- Structure data for API storage
- Create billable hours calculation

**Returns:** Summary + API-ready payload

**Context Saved:** ~92% (processes full session, returns summary)

---

### 6. Credential Retrieval Agent
**Launched:** When credential needed
**Purpose:** Securely retrieve and decrypt credentials

**Tasks:**
- Query credentials API
- Decrypt credential value
- Log access to audit trail
- Return only credential value

**Returns:** Single credential string

**API Calls:** 2 (retrieve + audit log)

**Context Saved:** ~98% (credential + minimal metadata)

---

### 7. Credential Storage Agent
**Launched:** When new credential discovered
**Purpose:** Encrypt and store credential securely

**Tasks:**
- Validate credential data
- Encrypt with AES-256-GCM
- Link to client/service/infrastructure
- Store via API
- Create audit log entry

**Returns:** credential_id confirmation

**Context Saved:** ~99% (only ID returned)

---

### 8. Historical Search Agent
**Launched:** On-demand (user asks about past work)
**Purpose:** Search and summarize historical sessions

**Tasks:**
- Query sessions database with filters
- Parse matching sessions
- Extract key outcomes
- Generate concise summary

**Returns:** Brief summary of findings

**Example:** "Found 3 backup sessions: [dates] - [outcomes]"

**Context Saved:** ~95% (processes potentially 100s of sessions)

---

### 9. Integration Workflow Agent
**Launched:** Multi-step integration requests
**Purpose:** Execute complex workflows with external tools

**Tasks:**
- Search external ticketing systems (SyncroMSP)
- Generate work summaries
- Update tickets with comments
- Pull reports from backup systems
- Attach files to tickets
- Track all integrations in database

**Returns:** Workflow completion summary

**API Calls:** 5-10+ external + internal calls

**Context Saved:** ~90% (handles large files, API responses)

---

### 10. Problem Pattern Matching Agent
**Launched:** When user describes an error/issue
**Purpose:** Find similar historical problems

**Tasks:**
- Parse error description
- Search problem_solutions table
- Extract relevant solutions
- Rank by similarity

**Returns:** Top 3 similar problems with solutions

**Context Saved:** ~94% (searches all problems, returns matches)

---

### 11. Database Query Agent
**Launched:** Complex reporting or analytics requests
**Purpose:** Execute complex database queries

**Tasks:**
- Build SQL queries with filters/joins
- Execute query via API
- Process result set
- Generate summary statistics
- Format for presentation

**Returns:** Summary statistics + key findings

**Example:** "Dataforth - Q4 2025: 45 sessions, 120 hours, $12,000 billed"

**Context Saved:** ~93% (processes large result sets)

---

### 12. Failure Analysis Agent
**Launched:** When commands/operations fail, or periodically
**Purpose:** Learn from failures to prevent future mistakes

**Tasks:**
- Log all command/operation failures with full context
- Analyze failure patterns across sessions
- Identify environmental constraints
- Update infrastructure environmental_notes
- Generate/update environmental_insights
- Create actionable resolutions

**Returns:** Updated insights, environmental constraints

**Context Saved:** ~94% (analyzes failures, returns key learnings)

---

### 13. Integration Search Agent
**Launched:** Searching external systems
**Purpose:** Query SyncroMSP, MSP Backups, etc.

**Tasks:**
- Authenticate with external API
- Execute search query
- Parse results
- Summarize findings

**Returns:** Concise list of matches

**API Calls:** 1-3 external API calls

**Context Saved:** ~90% (handles API pagination, large response)

---

## Mode Behaviors

### MSP Mode (`/msp`)
**Purpose:** Track client work with comprehensive context

**Activation Flow:**
1. Machine Detection Agent identifies current machine
2. Environment Context Agent loads environmental constraints
3. Context Recovery Agent loads client history
4. Session created with machine_id, client_id, project_id
5. Real-time work tracking begins

**Auto-Tracking:**
- Work items categorized automatically
- Commands logged with failure tracking
- File changes tracked
- Problems and solutions captured
- Credentials accessed (audit logged)
- Infrastructure changes documented

**Billability:** Default true (client work)

**Session End:**
- Session Summary Agent generates dense summary
- Stores to database via API
- Optional: Link to external tickets (SyncroMSP)
- Optional: Log billable hours to PSA

---

### Development Mode (`/dev`)
**Purpose:** Track development projects (TBD)

**Differences from MSP:**
- Focus on code/features vs client issues
- Git integration
- Project-based (not client-based)
- Billability default: false

**Status:** To be fully defined

---

### Normal Mode (`/normal`)
**Purpose:** General work, research, learning

**Characteristics:**
- No client_id or project_id assignment
- Lighter tracking than MSP mode
- Captures decisions, findings, learnings
- Billability default: false

**Use Cases:**
- Research and exploration
- General questions
- Internal infrastructure work (non-client)
- Learning/experimentation
- Documentation

**Knowledge Retention:**
- Preserves context from previous modes
- Only clears client/project assignment
- Queryable knowledge base

---

## Storage Strategy

### SQL Database (MariaDB)
**Location:** Jupiter (172.16.3.20)
**Database:** `msp_tracking`
**Tables:** 36 total

**Rationale:**
- Structured queries ("show all work for Client X in January")
- Relational data (clients → projects → sessions → credentials)
- Fast indexing even with years of data
- No merge conflicts (single source of truth)
- Time tracking and billing calculations
- Report generation capabilities

**Categories:**
1. Core MSP Tracking (6 tables) - includes `machines`
2. Client & Infrastructure (7 tables)
3. Credentials & Security (4 tables)
4. Work Details (6 tables)
5. Failure Analysis & Insights (3 tables)
6. Tagging & Categorization (3 tables)
7. System & Audit (2 tables)
8. External Integrations (3 tables)
9. Junction Tables (2 tables)

**Estimated Storage:** 1-2 GB per year (compressed)

---

## Machine Detection System

### Auto-Detection on Session Start

**Fingerprint Generation:**
```javascript
fingerprint = SHA256(hostname + "|" + username + "|" + platform + "|" + home_directory)
// Example: SHA256("ACG-M-L5090|MikeSwanson|win32|C:\Users\MikeSwanson")
```

**Capabilities Tracked:**
- VPN access (per client profiles)
- Docker availability
- PowerShell/shell version
- Available MCPs (claude-in-chrome, filesystem, etc.)
- Available Skills (pdf, commit, review-pr, etc.)
- OS-specific package managers
- Preferred shell (powershell, zsh, bash, cmd)

**Benefits:**
- Never suggest Docker commands on machines without Docker
- Never suggest VPN-required access from non-VPN machines
- Use version-compatible syntax for PowerShell/tools
- Check MCP/Skill availability before calling
- Track which sessions were done on which machines

---

## OS-Specific Command Selection

### Platform Detection
**Machine Detection Agent provides:**
- `platform`: "win32", "darwin", "linux"
- `preferred_shell`: "powershell", "zsh", "bash", "cmd"
- `package_manager_commands`: {"install": "choco install {pkg}", ...}

### Command Mapping Examples

| Task | Windows | macOS | Linux |
|------|---------|-------|-------|
| List files | `Get-ChildItem` | `ls -la` | `ls -la` |
| Process list | `Get-Process` | `ps aux` | `ps aux` |
| IP config | `ipconfig` | `ifconfig` | `ip addr` |
| Package install | `choco install` | `brew install` | `apt install` |

**Benefits:**
- No cross-platform errors
- Commands always work on current platform
- Shell syntax matches current environment
- Package manager suggestions platform-appropriate

---

## Failure Logging & Learning System

### Self-Improving Architecture

**Workflow:**
1. Command executes on infrastructure
2. Environment Context Agent pre-checked constraints
3. If failure occurs: Detailed logging to `commands_run`
4. Failure Analysis Agent identifies patterns
5. Creates `failure_patterns` entry
6. Updates `environmental_insights`
7. Future suggestions avoid this failure

**Example Learning Cycle:**
```
Problem: Suggested "Get-LocalUser" on Server 2008
Failure: Command not recognized (PowerShell 2.0 only)

Logged:
- commands_run: success=false, error_message, failure_category
- failure_patterns: "PS7 cmdlets on Server 2008" → use WMI
- environmental_insights: "Server 2008: PowerShell 2.0 limitations"
- infrastructure.environmental_notes: updated

Future Behavior:
- Environment Context Agent checks before suggesting
- Main Claude suggests WMI alternatives automatically
- Never repeats this mistake
```

**Database Tables:**
- `commands_run` - Every command with success/failure
- `operation_failures` - Non-command failures
- `failure_patterns` - Aggregated patterns
- `environmental_insights` - Generated insights per infrastructure

**Benefits:**
- Self-improving system (each failure makes it smarter)
- Reduced user friction (no repeated corrections)
- Institutional knowledge capture
- Proactive problem prevention

---

## Technology Stack

### API Framework: FastAPI (Python)
**Rationale:**
- Async performance for concurrent requests
- Auto-generated OpenAPI/Swagger docs
- Type safety with Pydantic models
- SQLAlchemy ORM for complex queries
- Built-in background tasks
- Industry-standard testing (pytest)
- Alembic for database migrations

### Authentication: JWT Tokens
**Rationale:**
- Stateless (no DB lookup to validate)
- Claims-based (permissions, scopes, expiration)
- Refresh token pattern for long-term access
- Multiple clients/machines supported
- Short-lived tokens minimize compromise risk

**Token Types:**
- Access Token: 1 hour expiration
- Refresh Token: 30 days expiration
- Agent Tokens: Session-scoped, auto-issued

### Configuration Storage: Gitea (Private Repo)
**Rationale:**
- Multi-machine sync
- Version controlled
- Single source of truth
- Token rotation = one commit, all machines sync
- Encrypted token values (git-crypt)

**Repo:** `azcomputerguru/msp-config`

**File Structure:**
```
msp-api-config.json
├── api_url (https://msp-api.azcomputerguru.com)
├── refresh_token (encrypted)
└── database_schema_version (for migration tracking)
```

### Deployment: Docker Container
**Container:** `msp-api`
**Server:** Jupiter (172.16.3.20)

**Components:**
- FastAPI application (Python 3.11+)
- SQLAlchemy + Alembic (ORM and migrations)
- JWT auth library (python-jose)
- Pydantic validation
- Gunicorn/Uvicorn ASGI server
- Health checks endpoint
- Mounted logs: `/var/log/msp-api/`

**Reverse Proxy:** Nginx with Let's Encrypt SSL

---

## External Integrations (Future)

### Planned Integrations

**SyncroMSP (PSA/RMM):**
- Ticket search and linking
- Auto-post session summaries
- Time tracking synchronization

**MSP Backups:**
- Pull backup status reports
- Check backup failures
- Export statistics

**Zapier:**
- Webhook triggers
- Bi-directional automation
- Multi-step workflows

**Future:**
- Autotask, ConnectWise (PSA)
- Datto RMM
- IT Glue (Documentation)
- Microsoft Teams (notifications)

### Integration Architecture

**Database Tables:**
- `external_integrations` - Track all integration actions
- `integration_credentials` - OAuth/API keys (encrypted)
- `ticket_links` - Session-to-ticket relationships

**Agent:** Integration Workflow Agent handles multi-step workflows

**Example Workflow:**
```
User: "Update Dataforth ticket with today's work and attach backup report"

Integration Workflow Agent:
1. Search SyncroMSP for ticket
2. Generate work summary from session
3. Update ticket with comment
4. Pull backup report from MSP Backups
5. Attach report to ticket
6. Log all actions to database

Returns: "✓ Updated ticket #12345, attached report"
```

---

## Security Architecture

### Encryption
- **Credentials:** AES-256-GCM at rest
- **Transport:** HTTPS only (TLS 1.2+)
- **Tokens:** Encrypted in Gitea config
- **Key Management:** Environment variable or vault

### Authentication
- JWT-based with scopes (msp:read, msp:write, msp:admin)
- Token rotation supported
- Revocation list for compromised tokens
- Agent-specific tokens (session-scoped)

### Audit Logging
- All credential access → `credential_audit_log`
- All API requests → `api_audit_log`
- All agent actions logged with parent session
- User ID, IP address, timestamp recorded

### Input Validation
- Pydantic models validate all inputs
- SQL injection prevention (SQLAlchemy ORM)
- Rate limiting (100 req/min, stricter for credentials)

---

## Agent Communication Pattern

```
User: "Show me all work for Dataforth in January"
  ↓
Main Claude: Understands request, validates parameters
  ↓
Launches Database Query Agent: "Query Dataforth sessions in January 2026"
  ↓
Agent:
  - Queries API: GET /api/v1/sessions?client=Dataforth&date_from=2026-01-01
  - Processes 15 sessions
  - Extracts key info: dates, categories, billable hours, outcomes
  - Generates concise summary
  ↓
Agent Returns:
  "Dataforth - January 2026:
   15 sessions, 38.5 billable hours
   Main projects: DOS machines (8 sessions), Network migration (5), M365 (2)
   Categories: Infrastructure (60%), Troubleshooting (25%), Config (15%)
   Key outcomes: Completed UPDATE.BAT v2.0, migrated DNS to UDM"
  ↓
Main Claude: Presents summary to user, ready for follow-up questions
```

**Context Saved:** Agent processed 500+ rows of data, main Claude only received 200-word summary.

---

## Infrastructure Design

### Jupiter Server Components

**Docker Container:** `msp-api`
- FastAPI application
- SQLAlchemy + Alembic
- JWT authentication
- Gunicorn/Uvicorn
- Health checks
- Prometheus metrics (optional)

**MariaDB Database:** `msp_tracking`
- Connection pooling (SQLAlchemy)
- Automated backups (critical MSP data)
- Schema versioned with Alembic
- 36 tables, indexed for performance

**Nginx Reverse Proxy:**
- HTTPS with Let's Encrypt
- Rate limiting
- Access logs
- Proxies to: msp-api.azcomputerguru.com

---

## Local Machine Structure

```
D:\ClaudeTools\
├── .claude/
│   ├── commands/
│   │   ├── msp.md (MSP Mode slash command)
│   │   ├── dev.md (Development Mode)
│   │   └── normal.md (Normal Mode)
│   ├── msp-api-config.json (synced from Gitea)
│   ├── API_SPEC.md (this system)
│   └── ARCHITECTURE_OVERVIEW.md (you are here)
├── MSP-MODE-SPEC.md (master specification)
└── .git/ (synced to Gitea)
```

---

## Benefits Summary

### Context Preservation
- Main Claude stays focused on conversation
- Agents handle data processing (90-99% context saved)
- User gets concise results without context pollution

### Scalability
- Multiple agents run in parallel
- Each agent has full context window for its task
- Complex operations don't consume main context
- Designed for team expansion (multiple technicians)

### Information Density
- Agents process raw data, return summaries
- Dense storage format (more info, fewer words)
- Queryable historical knowledge base
- Cross-session and cross-machine context

### Self-Improvement
- Every failure logged and analyzed
- Environmental constraints learned automatically
- Suggestions become smarter over time
- Never repeat the same mistake

### User Experience
- Auto-categorization (minimal user input)
- Machine-aware suggestions (capability-based)
- Platform-specific commands (no cross-platform errors)
- Proactive warnings about limitations
- Seamless multi-machine operation

---

## Implementation Status

- [OK] Architecture designed
- [OK] Database schema (36 tables)
- [OK] Agent types defined (13 agents)
- [OK] API endpoints specified
- ⏳ FastAPI implementation
- ⏳ Database deployment on Jupiter
- ⏳ JWT authentication flow
- ⏳ Agent token system
- ⏳ Machine detection implementation
- ⏳ MSP Mode slash command
- ⏳ External integrations

---

## Design Principles

1. **Agent-Based Execution** - Preserve main context at all costs
2. **Information Density** - Brief but complete data capture
3. **Self-Improvement** - Learn from every failure
4. **Multi-Machine Support** - Seamless cross-device operation
5. **Security First** - Encrypted credentials, audit logging
6. **Scalability** - Designed for team growth
7. **Separation of Concerns** - Main instance = conversation, Agents = data

---

## Next Steps

1. Deploy MariaDB schema on Jupiter
2. Implement FastAPI endpoints
3. Build JWT authentication system
4. Create agent token mechanism
5. Implement Machine Detection Agent
6. Build MSP Mode slash command
7. Test agent coordination patterns
8. Deploy to production (msp-api.azcomputerguru.com)

---

## Version History

**v1.0.0 (2026-01-16):**
- Initial architecture documentation
- 13 specialized agents defined
- Machine detection system
- OS-specific command selection
- Failure logging and learning system
- External integrations design
- Complete technology stack