Files

Mike Swanson 390b10b32c Complete Phase 6: MSP Work Tracking with Context Recall System

Implements production-ready MSP platform with cross-machine persistent memory for Claude.

API Implementation:
- 130 REST API endpoints across 21 entities
- JWT authentication on all endpoints
- AES-256-GCM encryption for credentials
- Automatic audit logging
- Complete OpenAPI documentation

Database:
- 43 tables in MariaDB (172.16.3.20:3306)
- 42 SQLAlchemy models with modern 2.0 syntax
- Full Alembic migration system
- 99.1% CRUD test pass rate

Context Recall System (Phase 6):
- Cross-machine persistent memory via database
- Automatic context injection via Claude Code hooks
- Automatic context saving after task completion
- 90-95% token reduction with compression utilities
- Relevance scoring with time decay
- Tag-based semantic search
- One-command setup script

Security Features:
- JWT tokens with Argon2 password hashing
- AES-256-GCM encryption for all sensitive data
- Comprehensive audit trail for credentials
- HMAC tamper detection
- Secure configuration management

Test Results:
- Phase 3: 38/38 CRUD tests passing (100%)
- Phase 4: 34/35 core API tests passing (97.1%)
- Phase 5: 62/62 extended API tests passing (100%)
- Phase 6: 10/10 compression tests passing (100%)
- Overall: 144/145 tests passing (99.3%)

Documentation:
- Comprehensive architecture guides
- Setup automation scripts
- API documentation at /api/docs
- Complete test reports
- Troubleshooting guides

Project Status: 95% Complete (Production-Ready)
Phase 7 (optional work context APIs) remains for future enhancement.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-17 06:00:26 -07:00

22 KiB

Raw Blame History

MSP Mode Architecture Overview

Version: 1.0.0 Last Updated: 2026-01-16 Status: Design Phase

Executive Summary

MSP Mode is a custom Claude Code implementation that tracks client work, maintains context across sessions and machines, and provides structured access to historical MSP data through an agent-based architecture.

Core Principle: All modes (MSP, Development, Normal) use specialized agents to preserve main Claude instance context space.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    User (Technician)                         │
│              Multiple Machines (Laptop, Desktop)             │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ↓
┌─────────────────────────────────────────────────────────────┐
│              Claude Code (Main Instance)                     │
│  • Conversation & User Interaction                          │
│  • Decision Making & Mode Management                        │
│  • Agent Orchestration                                      │
└────────────┬───────────────────────┬────────────────────────┘
             │                       │
             ↓                       ↓
┌────────────────────┐    ┌──────────────────────────────────┐
│  13 Specialized    │    │    REST API (FastAPI)            │
│     Agents         │────│    Jupiter Server                │
│  • Context Mgmt    │    │    https://msp-api.azcomputerguru │
│  • Data Processing │    └──────────┬───────────────────────┘
│  • Integration     │               │
└────────────────────┘               ↓
                           ┌──────────────────────┐
                           │  MariaDB Database    │
                           │  msp_tracking        │
                           │  36 Tables           │
                           └──────────────────────┘

13 Specialized Agents

1. Machine Detection Agent

Launched: Session start (FIRST - before all other agents) Purpose: Identify current machine and load capabilities

Tasks:

Execute hostname, whoami, detect platform
Generate machine fingerprint (SHA256)
Query machines table for existing record
Load VPN access, Docker, PowerShell version, MCPs, Skills
Update last_seen timestamp

Returns: Machine context (machine_id, capabilities, limitations)

Context Saved: ~97% (machine profile loaded, only key capabilities returned)

2. Environment Context Agent

Launched: Before making command suggestions or infrastructure operations Purpose: Check environmental constraints to avoid known failures

Tasks:

Query infrastructure environmental_notes
Read environmental_insights for client/infrastructure
Check failure_patterns for similar operations
Validate command compatibility with environment
Return constraints and recommendations

Returns: Environmental context + compatibility warnings

Example: "D2TESTNAS: Manual WINS install (no native service), ReadyNAS OS, SMB1 only"

Context Saved: ~96% (processes failure history, returns summary)

3. Context Recovery Agent

Launched: Session start (/msp command) Purpose: Load relevant client context

Tasks:

Query previous sessions (last 5)
Retrieve open pending tasks
Get recently used credentials
Fetch infrastructure topology

Returns: Concise context summary (< 300 words)

API Calls: 4-5 parallel GET requests

Context Saved: ~95% (processes MB of data, returns summary)

4. Work Categorization Agent

Launched: Periodically during session or on-demand Purpose: Analyze and categorize recent work

Tasks:

Parse conversation transcript
Extract commands, files, systems, technologies
Detect category (infrastructure, troubleshooting, etc.)
Generate dense description
Auto-tag work items

Returns: Structured work_item object (JSON)

Context Saved: ~90% (processes conversation, returns structured data)

5. Session Summary Agent

Launched: Session end (/msp end or mode switch) Purpose: Generate comprehensive session summary

Tasks:

Analyze all work_items from session
Calculate time allocation per category
Generate dense markdown summary
Structure data for API storage
Create billable hours calculation

Returns: Summary + API-ready payload

Context Saved: ~92% (processes full session, returns summary)

6. Credential Retrieval Agent

Launched: When credential needed Purpose: Securely retrieve and decrypt credentials

Tasks:

Query credentials API
Decrypt credential value
Log access to audit trail
Return only credential value

Returns: Single credential string

API Calls: 2 (retrieve + audit log)

Context Saved: ~98% (credential + minimal metadata)

7. Credential Storage Agent

Launched: When new credential discovered Purpose: Encrypt and store credential securely

Tasks:

Validate credential data
Encrypt with AES-256-GCM
Link to client/service/infrastructure
Store via API
Create audit log entry

Returns: credential_id confirmation

Context Saved: ~99% (only ID returned)

8. Historical Search Agent

Launched: On-demand (user asks about past work) Purpose: Search and summarize historical sessions

Tasks:

Query sessions database with filters
Parse matching sessions
Extract key outcomes
Generate concise summary

Returns: Brief summary of findings

Example: "Found 3 backup sessions: [dates] - [outcomes]"

Context Saved: ~95% (processes potentially 100s of sessions)

9. Integration Workflow Agent

Launched: Multi-step integration requests Purpose: Execute complex workflows with external tools

Tasks:

Search external ticketing systems (SyncroMSP)
Generate work summaries
Update tickets with comments
Pull reports from backup systems
Attach files to tickets
Track all integrations in database

Returns: Workflow completion summary

API Calls: 5-10+ external + internal calls

Context Saved: ~90% (handles large files, API responses)

10. Problem Pattern Matching Agent

Launched: When user describes an error/issue Purpose: Find similar historical problems

Tasks:

Parse error description
Search problem_solutions table
Extract relevant solutions
Rank by similarity

Returns: Top 3 similar problems with solutions

Context Saved: ~94% (searches all problems, returns matches)

11. Database Query Agent

Launched: Complex reporting or analytics requests Purpose: Execute complex database queries

Tasks:

Build SQL queries with filters/joins
Execute query via API
Process result set
Generate summary statistics
Format for presentation

Returns: Summary statistics + key findings

Example: "Dataforth - Q4 2025: 45 sessions, 120 hours, $12,000 billed"

Context Saved: ~93% (processes large result sets)

12. Failure Analysis Agent

Launched: When commands/operations fail, or periodically Purpose: Learn from failures to prevent future mistakes

Tasks:

Log all command/operation failures with full context
Analyze failure patterns across sessions
Identify environmental constraints
Update infrastructure environmental_notes
Generate/update environmental_insights
Create actionable resolutions

Returns: Updated insights, environmental constraints

Context Saved: ~94% (analyzes failures, returns key learnings)

13. Integration Search Agent

Launched: Searching external systems Purpose: Query SyncroMSP, MSP Backups, etc.

Tasks:

Authenticate with external API
Execute search query
Parse results
Summarize findings

Returns: Concise list of matches

API Calls: 1-3 external API calls

Context Saved: ~90% (handles API pagination, large response)

Mode Behaviors

MSP Mode (`/msp`)

Purpose: Track client work with comprehensive context

Activation Flow:

Machine Detection Agent identifies current machine
Environment Context Agent loads environmental constraints
Context Recovery Agent loads client history
Session created with machine_id, client_id, project_id
Real-time work tracking begins

Auto-Tracking:

Work items categorized automatically
Commands logged with failure tracking
File changes tracked
Problems and solutions captured
Credentials accessed (audit logged)
Infrastructure changes documented

Billability: Default true (client work)

Session End:

Session Summary Agent generates dense summary
Stores to database via API
Optional: Link to external tickets (SyncroMSP)
Optional: Log billable hours to PSA

Development Mode (`/dev`)

Purpose: Track development projects (TBD)

Differences from MSP:

Focus on code/features vs client issues
Git integration
Project-based (not client-based)
Billability default: false

Status: To be fully defined

Normal Mode (`/normal`)

Purpose: General work, research, learning

Characteristics:

No client_id or project_id assignment
Lighter tracking than MSP mode
Captures decisions, findings, learnings
Billability default: false

Use Cases:

Research and exploration
General questions
Internal infrastructure work (non-client)
Learning/experimentation
Documentation

Knowledge Retention:

Preserves context from previous modes
Only clears client/project assignment
Queryable knowledge base

Storage Strategy

SQL Database (MariaDB)

Location: Jupiter (172.16.3.20) Database: msp_tracking Tables: 36 total

Rationale:

Structured queries ("show all work for Client X in January")
Relational data (clients → projects → sessions → credentials)
Fast indexing even with years of data
No merge conflicts (single source of truth)
Time tracking and billing calculations
Report generation capabilities

Categories:

Core MSP Tracking (6 tables) - includes machines
Client & Infrastructure (7 tables)
Credentials & Security (4 tables)
Work Details (6 tables)
Failure Analysis & Insights (3 tables)
Tagging & Categorization (3 tables)
System & Audit (2 tables)
External Integrations (3 tables)
Junction Tables (2 tables)

Estimated Storage: 1-2 GB per year (compressed)

Machine Detection System

Auto-Detection on Session Start

Fingerprint Generation:

fingerprint = SHA256(hostname + "|" + username + "|" + platform + "|" + home_directory)
// Example: SHA256("ACG-M-L5090|MikeSwanson|win32|C:\Users\MikeSwanson")

Capabilities Tracked:

VPN access (per client profiles)
Docker availability
PowerShell/shell version
Available MCPs (claude-in-chrome, filesystem, etc.)
Available Skills (pdf, commit, review-pr, etc.)
OS-specific package managers
Preferred shell (powershell, zsh, bash, cmd)

Benefits:

Never suggest Docker commands on machines without Docker
Never suggest VPN-required access from non-VPN machines
Use version-compatible syntax for PowerShell/tools
Check MCP/Skill availability before calling
Track which sessions were done on which machines

OS-Specific Command Selection

Platform Detection

Machine Detection Agent provides:

platform: "win32", "darwin", "linux"
preferred_shell: "powershell", "zsh", "bash", "cmd"
package_manager_commands: {"install": "choco install {pkg}", ...}

Command Mapping Examples

Task	Windows	macOS	Linux
List files	`Get-ChildItem`	`ls -la`	`ls -la`
Process list	`Get-Process`	`ps aux`	`ps aux`
IP config	`ipconfig`	`ifconfig`	`ip addr`
Package install	`choco install`	`brew install`	`apt install`

Benefits:

No cross-platform errors
Commands always work on current platform
Shell syntax matches current environment
Package manager suggestions platform-appropriate

Failure Logging & Learning System

Self-Improving Architecture

Workflow:

Command executes on infrastructure
Environment Context Agent pre-checked constraints
If failure occurs: Detailed logging to commands_run
Failure Analysis Agent identifies patterns
Creates failure_patterns entry
Updates environmental_insights
Future suggestions avoid this failure

Example Learning Cycle:

Problem: Suggested "Get-LocalUser" on Server 2008
Failure: Command not recognized (PowerShell 2.0 only)

Logged:
- commands_run: success=false, error_message, failure_category
- failure_patterns: "PS7 cmdlets on Server 2008" → use WMI
- environmental_insights: "Server 2008: PowerShell 2.0 limitations"
- infrastructure.environmental_notes: updated

Future Behavior:
- Environment Context Agent checks before suggesting
- Main Claude suggests WMI alternatives automatically
- Never repeats this mistake

Database Tables:

commands_run - Every command with success/failure
operation_failures - Non-command failures
failure_patterns - Aggregated patterns
environmental_insights - Generated insights per infrastructure

Benefits:

Self-improving system (each failure makes it smarter)
Reduced user friction (no repeated corrections)
Institutional knowledge capture
Proactive problem prevention

Technology Stack

API Framework: FastAPI (Python)

Rationale:

Async performance for concurrent requests
Auto-generated OpenAPI/Swagger docs
Type safety with Pydantic models
SQLAlchemy ORM for complex queries
Built-in background tasks
Industry-standard testing (pytest)
Alembic for database migrations

Authentication: JWT Tokens

Rationale:

Stateless (no DB lookup to validate)
Claims-based (permissions, scopes, expiration)
Refresh token pattern for long-term access
Multiple clients/machines supported
Short-lived tokens minimize compromise risk

Token Types:

Access Token: 1 hour expiration
Refresh Token: 30 days expiration
Agent Tokens: Session-scoped, auto-issued

Configuration Storage: Gitea (Private Repo)

Rationale:

Multi-machine sync
Version controlled
Single source of truth
Token rotation = one commit, all machines sync
Encrypted token values (git-crypt)

Repo: azcomputerguru/msp-config

File Structure:

msp-api-config.json
├── api_url (https://msp-api.azcomputerguru.com)
├── refresh_token (encrypted)
└── database_schema_version (for migration tracking)

Deployment: Docker Container

Container: msp-api Server: Jupiter (172.16.3.20)

Components:

FastAPI application (Python 3.11+)
SQLAlchemy + Alembic (ORM and migrations)
JWT auth library (python-jose)
Pydantic validation
Gunicorn/Uvicorn ASGI server
Health checks endpoint
Mounted logs: /var/log/msp-api/

Reverse Proxy: Nginx with Let's Encrypt SSL

External Integrations (Future)

Planned Integrations

SyncroMSP (PSA/RMM):

Ticket search and linking
Auto-post session summaries
Time tracking synchronization

MSP Backups:

Pull backup status reports
Check backup failures
Export statistics

Zapier:

Webhook triggers
Bi-directional automation
Multi-step workflows

Future:

Autotask, ConnectWise (PSA)
Datto RMM
IT Glue (Documentation)
Microsoft Teams (notifications)

Integration Architecture

Database Tables:

external_integrations - Track all integration actions
integration_credentials - OAuth/API keys (encrypted)
ticket_links - Session-to-ticket relationships

Agent: Integration Workflow Agent handles multi-step workflows

Example Workflow:

User: "Update Dataforth ticket with today's work and attach backup report"

Integration Workflow Agent:
1. Search SyncroMSP for ticket
2. Generate work summary from session
3. Update ticket with comment
4. Pull backup report from MSP Backups
5. Attach report to ticket
6. Log all actions to database

Returns: "✓ Updated ticket #12345, attached report"

Security Architecture

Encryption

Credentials: AES-256-GCM at rest
Transport: HTTPS only (TLS 1.2+)
Tokens: Encrypted in Gitea config
Key Management: Environment variable or vault

Authentication

JWT-based with scopes (msp:read, msp:write, msp:admin)
Token rotation supported
Revocation list for compromised tokens
Agent-specific tokens (session-scoped)

Audit Logging

All credential access → credential_audit_log
All API requests → api_audit_log
All agent actions logged with parent session
User ID, IP address, timestamp recorded

Input Validation

Pydantic models validate all inputs
SQL injection prevention (SQLAlchemy ORM)
Rate limiting (100 req/min, stricter for credentials)

Agent Communication Pattern

User: "Show me all work for Dataforth in January"
  ↓
Main Claude: Understands request, validates parameters
  ↓
Launches Database Query Agent: "Query Dataforth sessions in January 2026"
  ↓
Agent:
  - Queries API: GET /api/v1/sessions?client=Dataforth&date_from=2026-01-01
  - Processes 15 sessions
  - Extracts key info: dates, categories, billable hours, outcomes
  - Generates concise summary
  ↓
Agent Returns:
  "Dataforth - January 2026:
   15 sessions, 38.5 billable hours
   Main projects: DOS machines (8 sessions), Network migration (5), M365 (2)
   Categories: Infrastructure (60%), Troubleshooting (25%), Config (15%)
   Key outcomes: Completed UPDATE.BAT v2.0, migrated DNS to UDM"
  ↓
Main Claude: Presents summary to user, ready for follow-up questions

Context Saved: Agent processed 500+ rows of data, main Claude only received 200-word summary.

Infrastructure Design

Jupiter Server Components

Docker Container: msp-api

FastAPI application
SQLAlchemy + Alembic
JWT authentication
Gunicorn/Uvicorn
Health checks
Prometheus metrics (optional)

MariaDB Database: msp_tracking

Connection pooling (SQLAlchemy)
Automated backups (critical MSP data)
Schema versioned with Alembic
36 tables, indexed for performance

Nginx Reverse Proxy:

HTTPS with Let's Encrypt
Rate limiting
Access logs
Proxies to: msp-api.azcomputerguru.com

Local Machine Structure

D:\ClaudeTools\
├── .claude/
│   ├── commands/
│   │   ├── msp.md (MSP Mode slash command)
│   │   ├── dev.md (Development Mode)
│   │   └── normal.md (Normal Mode)
│   ├── msp-api-config.json (synced from Gitea)
│   ├── API_SPEC.md (this system)
│   └── ARCHITECTURE_OVERVIEW.md (you are here)
├── MSP-MODE-SPEC.md (master specification)
└── .git/ (synced to Gitea)

Benefits Summary

Context Preservation

Main Claude stays focused on conversation
Agents handle data processing (90-99% context saved)
User gets concise results without context pollution

Scalability

Multiple agents run in parallel
Each agent has full context window for its task
Complex operations don't consume main context
Designed for team expansion (multiple technicians)

Information Density

Agents process raw data, return summaries
Dense storage format (more info, fewer words)
Queryable historical knowledge base
Cross-session and cross-machine context

Self-Improvement

Every failure logged and analyzed
Environmental constraints learned automatically
Suggestions become smarter over time
Never repeat the same mistake

User Experience

Auto-categorization (minimal user input)
Machine-aware suggestions (capability-based)
Platform-specific commands (no cross-platform errors)
Proactive warnings about limitations
Seamless multi-machine operation

Implementation Status

✅ Architecture designed
✅ Database schema (36 tables)
✅ Agent types defined (13 agents)
✅ API endpoints specified
⏳ FastAPI implementation
⏳ Database deployment on Jupiter
⏳ JWT authentication flow
⏳ Agent token system
⏳ Machine detection implementation
⏳ MSP Mode slash command
⏳ External integrations

Design Principles

Agent-Based Execution - Preserve main context at all costs
Information Density - Brief but complete data capture
Self-Improvement - Learn from every failure
Multi-Machine Support - Seamless cross-device operation
Security First - Encrypted credentials, audit logging
Scalability - Designed for team growth
Separation of Concerns - Main instance = conversation, Agents = data

Next Steps

Deploy MariaDB schema on Jupiter
Implement FastAPI endpoints
Build JWT authentication system
Create agent token mechanism
Implement Machine Detection Agent
Build MSP Mode slash command
Test agent coordination patterns
Deploy to production (msp-api.azcomputerguru.com)

Version History

v1.0.0 (2026-01-16):

Initial architecture documentation
13 specialized agents defined
Machine detection system
OS-specific command selection
Failure logging and learning system
External integrations design
Complete technology stack

22 KiB Raw Blame History

MSP Mode Architecture Overview

Executive Summary

High-Level Architecture

13 Specialized Agents

1. Machine Detection Agent

2. Environment Context Agent

3. Context Recovery Agent

4. Work Categorization Agent

5. Session Summary Agent

6. Credential Retrieval Agent

7. Credential Storage Agent

8. Historical Search Agent

9. Integration Workflow Agent

10. Problem Pattern Matching Agent

11. Database Query Agent

12. Failure Analysis Agent

13. Integration Search Agent

Mode Behaviors

MSP Mode (/msp)

Development Mode (/dev)

Normal Mode (/normal)

Storage Strategy

SQL Database (MariaDB)

Machine Detection System

Auto-Detection on Session Start

OS-Specific Command Selection

Platform Detection

Command Mapping Examples

Failure Logging & Learning System

Self-Improving Architecture

Technology Stack

API Framework: FastAPI (Python)

Authentication: JWT Tokens

Configuration Storage: Gitea (Private Repo)

Deployment: Docker Container

External Integrations (Future)

Planned Integrations

Integration Architecture

Security Architecture

Encryption

Authentication

Audit Logging

Input Validation

Agent Communication Pattern

Infrastructure Design

Jupiter Server Components

Local Machine Structure

Benefits Summary

Context Preservation

Scalability

Information Density

Self-Improvement

User Experience

Implementation Status

Design Principles

Next Steps

Version History

22 KiB

Raw Blame History

MSP Mode (`/msp`)

Development Mode (`/dev`)

Normal Mode (`/normal`)