Replaced 50+ emoji types with ASCII text markers for consistent rendering across all terminals, editors, and operating systems: - Checkmarks/status: [OK], [DONE], [SUCCESS], [PASS] - Errors/warnings: [ERROR], [FAIL], [WARNING], [CRITICAL] - Actions: [DO], [DO NOT], [REQUIRED], [OPTIONAL] - Navigation: [NEXT], [PREVIOUS], [TIP], [NOTE] - Progress: [IN PROGRESS], [PENDING], [BLOCKED] Additional changes: - Made paths cross-platform (~/ClaudeTools for Mac/Linux) - Fixed database host references to 172.16.3.30 - Updated START_HERE.md and CONTEXT_RECOVERY_PROMPT.md for multi-OS use Files updated: 58 markdown files across: - .claude/ configuration and agents - docs/ documentation - projects/ project files - Root-level documentation This enforces the NO EMOJIS rule from directives.md and ensures documentation renders correctly on all systems. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
22 KiB
MSP Mode Architecture Overview
Version: 1.0.0 Last Updated: 2026-01-16 Status: Design Phase
Executive Summary
MSP Mode is a custom Claude Code implementation that tracks client work, maintains context across sessions and machines, and provides structured access to historical MSP data through an agent-based architecture.
Core Principle: All modes (MSP, Development, Normal) use specialized agents to preserve main Claude instance context space.
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ User (Technician) │
│ Multiple Machines (Laptop, Desktop) │
└────────────────────┬────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Claude Code (Main Instance) │
│ • Conversation & User Interaction │
│ • Decision Making & Mode Management │
│ • Agent Orchestration │
└────────────┬───────────────────────┬────────────────────────┘
│ │
↓ ↓
┌────────────────────┐ ┌──────────────────────────────────┐
│ 13 Specialized │ │ REST API (FastAPI) │
│ Agents │────│ Jupiter Server │
│ • Context Mgmt │ │ https://msp-api.azcomputerguru │
│ • Data Processing │ └──────────┬───────────────────────┘
│ • Integration │ │
└────────────────────┘ ↓
┌──────────────────────┐
│ MariaDB Database │
│ msp_tracking │
│ 36 Tables │
└──────────────────────┘
13 Specialized Agents
1. Machine Detection Agent
Launched: Session start (FIRST - before all other agents) Purpose: Identify current machine and load capabilities
Tasks:
- Execute
hostname,whoami, detect platform - Generate machine fingerprint (SHA256)
- Query machines table for existing record
- Load VPN access, Docker, PowerShell version, MCPs, Skills
- Update last_seen timestamp
Returns: Machine context (machine_id, capabilities, limitations)
Context Saved: ~97% (machine profile loaded, only key capabilities returned)
2. Environment Context Agent
Launched: Before making command suggestions or infrastructure operations Purpose: Check environmental constraints to avoid known failures
Tasks:
- Query infrastructure environmental_notes
- Read environmental_insights for client/infrastructure
- Check failure_patterns for similar operations
- Validate command compatibility with environment
- Return constraints and recommendations
Returns: Environmental context + compatibility warnings
Example: "D2TESTNAS: Manual WINS install (no native service), ReadyNAS OS, SMB1 only"
Context Saved: ~96% (processes failure history, returns summary)
3. Context Recovery Agent
Launched: Session start (/msp command)
Purpose: Load relevant client context
Tasks:
- Query previous sessions (last 5)
- Retrieve open pending tasks
- Get recently used credentials
- Fetch infrastructure topology
Returns: Concise context summary (< 300 words)
API Calls: 4-5 parallel GET requests
Context Saved: ~95% (processes MB of data, returns summary)
4. Work Categorization Agent
Launched: Periodically during session or on-demand Purpose: Analyze and categorize recent work
Tasks:
- Parse conversation transcript
- Extract commands, files, systems, technologies
- Detect category (infrastructure, troubleshooting, etc.)
- Generate dense description
- Auto-tag work items
Returns: Structured work_item object (JSON)
Context Saved: ~90% (processes conversation, returns structured data)
5. Session Summary Agent
Launched: Session end (/msp end or mode switch)
Purpose: Generate comprehensive session summary
Tasks:
- Analyze all work_items from session
- Calculate time allocation per category
- Generate dense markdown summary
- Structure data for API storage
- Create billable hours calculation
Returns: Summary + API-ready payload
Context Saved: ~92% (processes full session, returns summary)
6. Credential Retrieval Agent
Launched: When credential needed Purpose: Securely retrieve and decrypt credentials
Tasks:
- Query credentials API
- Decrypt credential value
- Log access to audit trail
- Return only credential value
Returns: Single credential string
API Calls: 2 (retrieve + audit log)
Context Saved: ~98% (credential + minimal metadata)
7. Credential Storage Agent
Launched: When new credential discovered Purpose: Encrypt and store credential securely
Tasks:
- Validate credential data
- Encrypt with AES-256-GCM
- Link to client/service/infrastructure
- Store via API
- Create audit log entry
Returns: credential_id confirmation
Context Saved: ~99% (only ID returned)
8. Historical Search Agent
Launched: On-demand (user asks about past work) Purpose: Search and summarize historical sessions
Tasks:
- Query sessions database with filters
- Parse matching sessions
- Extract key outcomes
- Generate concise summary
Returns: Brief summary of findings
Example: "Found 3 backup sessions: [dates] - [outcomes]"
Context Saved: ~95% (processes potentially 100s of sessions)
9. Integration Workflow Agent
Launched: Multi-step integration requests Purpose: Execute complex workflows with external tools
Tasks:
- Search external ticketing systems (SyncroMSP)
- Generate work summaries
- Update tickets with comments
- Pull reports from backup systems
- Attach files to tickets
- Track all integrations in database
Returns: Workflow completion summary
API Calls: 5-10+ external + internal calls
Context Saved: ~90% (handles large files, API responses)
10. Problem Pattern Matching Agent
Launched: When user describes an error/issue Purpose: Find similar historical problems
Tasks:
- Parse error description
- Search problem_solutions table
- Extract relevant solutions
- Rank by similarity
Returns: Top 3 similar problems with solutions
Context Saved: ~94% (searches all problems, returns matches)
11. Database Query Agent
Launched: Complex reporting or analytics requests Purpose: Execute complex database queries
Tasks:
- Build SQL queries with filters/joins
- Execute query via API
- Process result set
- Generate summary statistics
- Format for presentation
Returns: Summary statistics + key findings
Example: "Dataforth - Q4 2025: 45 sessions, 120 hours, $12,000 billed"
Context Saved: ~93% (processes large result sets)
12. Failure Analysis Agent
Launched: When commands/operations fail, or periodically Purpose: Learn from failures to prevent future mistakes
Tasks:
- Log all command/operation failures with full context
- Analyze failure patterns across sessions
- Identify environmental constraints
- Update infrastructure environmental_notes
- Generate/update environmental_insights
- Create actionable resolutions
Returns: Updated insights, environmental constraints
Context Saved: ~94% (analyzes failures, returns key learnings)
13. Integration Search Agent
Launched: Searching external systems Purpose: Query SyncroMSP, MSP Backups, etc.
Tasks:
- Authenticate with external API
- Execute search query
- Parse results
- Summarize findings
Returns: Concise list of matches
API Calls: 1-3 external API calls
Context Saved: ~90% (handles API pagination, large response)
Mode Behaviors
MSP Mode (/msp)
Purpose: Track client work with comprehensive context
Activation Flow:
- Machine Detection Agent identifies current machine
- Environment Context Agent loads environmental constraints
- Context Recovery Agent loads client history
- Session created with machine_id, client_id, project_id
- Real-time work tracking begins
Auto-Tracking:
- Work items categorized automatically
- Commands logged with failure tracking
- File changes tracked
- Problems and solutions captured
- Credentials accessed (audit logged)
- Infrastructure changes documented
Billability: Default true (client work)
Session End:
- Session Summary Agent generates dense summary
- Stores to database via API
- Optional: Link to external tickets (SyncroMSP)
- Optional: Log billable hours to PSA
Development Mode (/dev)
Purpose: Track development projects (TBD)
Differences from MSP:
- Focus on code/features vs client issues
- Git integration
- Project-based (not client-based)
- Billability default: false
Status: To be fully defined
Normal Mode (/normal)
Purpose: General work, research, learning
Characteristics:
- No client_id or project_id assignment
- Lighter tracking than MSP mode
- Captures decisions, findings, learnings
- Billability default: false
Use Cases:
- Research and exploration
- General questions
- Internal infrastructure work (non-client)
- Learning/experimentation
- Documentation
Knowledge Retention:
- Preserves context from previous modes
- Only clears client/project assignment
- Queryable knowledge base
Storage Strategy
SQL Database (MariaDB)
Location: Jupiter (172.16.3.20)
Database: msp_tracking
Tables: 36 total
Rationale:
- Structured queries ("show all work for Client X in January")
- Relational data (clients → projects → sessions → credentials)
- Fast indexing even with years of data
- No merge conflicts (single source of truth)
- Time tracking and billing calculations
- Report generation capabilities
Categories:
- Core MSP Tracking (6 tables) - includes
machines - Client & Infrastructure (7 tables)
- Credentials & Security (4 tables)
- Work Details (6 tables)
- Failure Analysis & Insights (3 tables)
- Tagging & Categorization (3 tables)
- System & Audit (2 tables)
- External Integrations (3 tables)
- Junction Tables (2 tables)
Estimated Storage: 1-2 GB per year (compressed)
Machine Detection System
Auto-Detection on Session Start
Fingerprint Generation:
fingerprint = SHA256(hostname + "|" + username + "|" + platform + "|" + home_directory)
// Example: SHA256("ACG-M-L5090|MikeSwanson|win32|C:\Users\MikeSwanson")
Capabilities Tracked:
- VPN access (per client profiles)
- Docker availability
- PowerShell/shell version
- Available MCPs (claude-in-chrome, filesystem, etc.)
- Available Skills (pdf, commit, review-pr, etc.)
- OS-specific package managers
- Preferred shell (powershell, zsh, bash, cmd)
Benefits:
- Never suggest Docker commands on machines without Docker
- Never suggest VPN-required access from non-VPN machines
- Use version-compatible syntax for PowerShell/tools
- Check MCP/Skill availability before calling
- Track which sessions were done on which machines
OS-Specific Command Selection
Platform Detection
Machine Detection Agent provides:
platform: "win32", "darwin", "linux"preferred_shell: "powershell", "zsh", "bash", "cmd"package_manager_commands: {"install": "choco install {pkg}", ...}
Command Mapping Examples
| Task | Windows | macOS | Linux |
|---|---|---|---|
| List files | Get-ChildItem |
ls -la |
ls -la |
| Process list | Get-Process |
ps aux |
ps aux |
| IP config | ipconfig |
ifconfig |
ip addr |
| Package install | choco install |
brew install |
apt install |
Benefits:
- No cross-platform errors
- Commands always work on current platform
- Shell syntax matches current environment
- Package manager suggestions platform-appropriate
Failure Logging & Learning System
Self-Improving Architecture
Workflow:
- Command executes on infrastructure
- Environment Context Agent pre-checked constraints
- If failure occurs: Detailed logging to
commands_run - Failure Analysis Agent identifies patterns
- Creates
failure_patternsentry - Updates
environmental_insights - Future suggestions avoid this failure
Example Learning Cycle:
Problem: Suggested "Get-LocalUser" on Server 2008
Failure: Command not recognized (PowerShell 2.0 only)
Logged:
- commands_run: success=false, error_message, failure_category
- failure_patterns: "PS7 cmdlets on Server 2008" → use WMI
- environmental_insights: "Server 2008: PowerShell 2.0 limitations"
- infrastructure.environmental_notes: updated
Future Behavior:
- Environment Context Agent checks before suggesting
- Main Claude suggests WMI alternatives automatically
- Never repeats this mistake
Database Tables:
commands_run- Every command with success/failureoperation_failures- Non-command failuresfailure_patterns- Aggregated patternsenvironmental_insights- Generated insights per infrastructure
Benefits:
- Self-improving system (each failure makes it smarter)
- Reduced user friction (no repeated corrections)
- Institutional knowledge capture
- Proactive problem prevention
Technology Stack
API Framework: FastAPI (Python)
Rationale:
- Async performance for concurrent requests
- Auto-generated OpenAPI/Swagger docs
- Type safety with Pydantic models
- SQLAlchemy ORM for complex queries
- Built-in background tasks
- Industry-standard testing (pytest)
- Alembic for database migrations
Authentication: JWT Tokens
Rationale:
- Stateless (no DB lookup to validate)
- Claims-based (permissions, scopes, expiration)
- Refresh token pattern for long-term access
- Multiple clients/machines supported
- Short-lived tokens minimize compromise risk
Token Types:
- Access Token: 1 hour expiration
- Refresh Token: 30 days expiration
- Agent Tokens: Session-scoped, auto-issued
Configuration Storage: Gitea (Private Repo)
Rationale:
- Multi-machine sync
- Version controlled
- Single source of truth
- Token rotation = one commit, all machines sync
- Encrypted token values (git-crypt)
Repo: azcomputerguru/msp-config
File Structure:
msp-api-config.json
├── api_url (https://msp-api.azcomputerguru.com)
├── refresh_token (encrypted)
└── database_schema_version (for migration tracking)
Deployment: Docker Container
Container: msp-api
Server: Jupiter (172.16.3.20)
Components:
- FastAPI application (Python 3.11+)
- SQLAlchemy + Alembic (ORM and migrations)
- JWT auth library (python-jose)
- Pydantic validation
- Gunicorn/Uvicorn ASGI server
- Health checks endpoint
- Mounted logs:
/var/log/msp-api/
Reverse Proxy: Nginx with Let's Encrypt SSL
External Integrations (Future)
Planned Integrations
SyncroMSP (PSA/RMM):
- Ticket search and linking
- Auto-post session summaries
- Time tracking synchronization
MSP Backups:
- Pull backup status reports
- Check backup failures
- Export statistics
Zapier:
- Webhook triggers
- Bi-directional automation
- Multi-step workflows
Future:
- Autotask, ConnectWise (PSA)
- Datto RMM
- IT Glue (Documentation)
- Microsoft Teams (notifications)
Integration Architecture
Database Tables:
external_integrations- Track all integration actionsintegration_credentials- OAuth/API keys (encrypted)ticket_links- Session-to-ticket relationships
Agent: Integration Workflow Agent handles multi-step workflows
Example Workflow:
User: "Update Dataforth ticket with today's work and attach backup report"
Integration Workflow Agent:
1. Search SyncroMSP for ticket
2. Generate work summary from session
3. Update ticket with comment
4. Pull backup report from MSP Backups
5. Attach report to ticket
6. Log all actions to database
Returns: "✓ Updated ticket #12345, attached report"
Security Architecture
Encryption
- Credentials: AES-256-GCM at rest
- Transport: HTTPS only (TLS 1.2+)
- Tokens: Encrypted in Gitea config
- Key Management: Environment variable or vault
Authentication
- JWT-based with scopes (msp:read, msp:write, msp:admin)
- Token rotation supported
- Revocation list for compromised tokens
- Agent-specific tokens (session-scoped)
Audit Logging
- All credential access →
credential_audit_log - All API requests →
api_audit_log - All agent actions logged with parent session
- User ID, IP address, timestamp recorded
Input Validation
- Pydantic models validate all inputs
- SQL injection prevention (SQLAlchemy ORM)
- Rate limiting (100 req/min, stricter for credentials)
Agent Communication Pattern
User: "Show me all work for Dataforth in January"
↓
Main Claude: Understands request, validates parameters
↓
Launches Database Query Agent: "Query Dataforth sessions in January 2026"
↓
Agent:
- Queries API: GET /api/v1/sessions?client=Dataforth&date_from=2026-01-01
- Processes 15 sessions
- Extracts key info: dates, categories, billable hours, outcomes
- Generates concise summary
↓
Agent Returns:
"Dataforth - January 2026:
15 sessions, 38.5 billable hours
Main projects: DOS machines (8 sessions), Network migration (5), M365 (2)
Categories: Infrastructure (60%), Troubleshooting (25%), Config (15%)
Key outcomes: Completed UPDATE.BAT v2.0, migrated DNS to UDM"
↓
Main Claude: Presents summary to user, ready for follow-up questions
Context Saved: Agent processed 500+ rows of data, main Claude only received 200-word summary.
Infrastructure Design
Jupiter Server Components
Docker Container: msp-api
- FastAPI application
- SQLAlchemy + Alembic
- JWT authentication
- Gunicorn/Uvicorn
- Health checks
- Prometheus metrics (optional)
MariaDB Database: msp_tracking
- Connection pooling (SQLAlchemy)
- Automated backups (critical MSP data)
- Schema versioned with Alembic
- 36 tables, indexed for performance
Nginx Reverse Proxy:
- HTTPS with Let's Encrypt
- Rate limiting
- Access logs
- Proxies to: msp-api.azcomputerguru.com
Local Machine Structure
D:\ClaudeTools\
├── .claude/
│ ├── commands/
│ │ ├── msp.md (MSP Mode slash command)
│ │ ├── dev.md (Development Mode)
│ │ └── normal.md (Normal Mode)
│ ├── msp-api-config.json (synced from Gitea)
│ ├── API_SPEC.md (this system)
│ └── ARCHITECTURE_OVERVIEW.md (you are here)
├── MSP-MODE-SPEC.md (master specification)
└── .git/ (synced to Gitea)
Benefits Summary
Context Preservation
- Main Claude stays focused on conversation
- Agents handle data processing (90-99% context saved)
- User gets concise results without context pollution
Scalability
- Multiple agents run in parallel
- Each agent has full context window for its task
- Complex operations don't consume main context
- Designed for team expansion (multiple technicians)
Information Density
- Agents process raw data, return summaries
- Dense storage format (more info, fewer words)
- Queryable historical knowledge base
- Cross-session and cross-machine context
Self-Improvement
- Every failure logged and analyzed
- Environmental constraints learned automatically
- Suggestions become smarter over time
- Never repeat the same mistake
User Experience
- Auto-categorization (minimal user input)
- Machine-aware suggestions (capability-based)
- Platform-specific commands (no cross-platform errors)
- Proactive warnings about limitations
- Seamless multi-machine operation
Implementation Status
- [OK] Architecture designed
- [OK] Database schema (36 tables)
- [OK] Agent types defined (13 agents)
- [OK] API endpoints specified
- ⏳ FastAPI implementation
- ⏳ Database deployment on Jupiter
- ⏳ JWT authentication flow
- ⏳ Agent token system
- ⏳ Machine detection implementation
- ⏳ MSP Mode slash command
- ⏳ External integrations
Design Principles
- Agent-Based Execution - Preserve main context at all costs
- Information Density - Brief but complete data capture
- Self-Improvement - Learn from every failure
- Multi-Machine Support - Seamless cross-device operation
- Security First - Encrypted credentials, audit logging
- Scalability - Designed for team growth
- Separation of Concerns - Main instance = conversation, Agents = data
Next Steps
- Deploy MariaDB schema on Jupiter
- Implement FastAPI endpoints
- Build JWT authentication system
- Create agent token mechanism
- Implement Machine Detection Agent
- Build MSP Mode slash command
- Test agent coordination patterns
- Deploy to production (msp-api.azcomputerguru.com)
Version History
v1.0.0 (2026-01-16):
- Initial architecture documentation
- 13 specialized agents defined
- Machine detection system
- OS-specific command selection
- Failure logging and learning system
- External integrations design
- Complete technology stack