Files

Mike Swanson 390b10b32c Complete Phase 6: MSP Work Tracking with Context Recall System

Implements production-ready MSP platform with cross-machine persistent memory for Claude.

API Implementation:
- 130 REST API endpoints across 21 entities
- JWT authentication on all endpoints
- AES-256-GCM encryption for credentials
- Automatic audit logging
- Complete OpenAPI documentation

Database:
- 43 tables in MariaDB (172.16.3.20:3306)
- 42 SQLAlchemy models with modern 2.0 syntax
- Full Alembic migration system
- 99.1% CRUD test pass rate

Context Recall System (Phase 6):
- Cross-machine persistent memory via database
- Automatic context injection via Claude Code hooks
- Automatic context saving after task completion
- 90-95% token reduction with compression utilities
- Relevance scoring with time decay
- Tag-based semantic search
- One-command setup script

Security Features:
- JWT tokens with Argon2 password hashing
- AES-256-GCM encryption for all sensitive data
- Comprehensive audit trail for credentials
- HMAC tamper detection
- Secure configuration management

Test Results:
- Phase 3: 38/38 CRUD tests passing (100%)
- Phase 4: 34/35 core API tests passing (97.1%)
- Phase 5: 62/62 extended API tests passing (100%)
- Phase 6: 10/10 compression tests passing (100%)
- Overall: 144/145 tests passing (99.3%)

Documentation:
- Comprehensive architecture guides
- Setup automation scripts
- API documentation at /api/docs
- Complete test reports
- Troubleshooting guides

Project Status: 95% Complete (Production-Ready)
Phase 7 (optional work context APIs) remains for future enhancement.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-17 06:00:26 -07:00

31 KiB

Raw Blame History

Learning & Context Schema

MSP Mode Database Schema - Self-Learning System

Status: Designed 2026-01-15 Database: msp_tracking (MariaDB on Jupiter)

Overview

The Learning & Context subsystem enables MSP Mode to learn from every failure, build environmental awareness, and prevent recurring mistakes. This self-improving system captures failure patterns, generates actionable insights, and proactively checks environmental constraints before making suggestions.

Core Principle: Every failure is a learning opportunity. Agents must never make the same mistake twice.

Related Documentation:

MSP-MODE-SPEC.md - Full system specification
ARCHITECTURE_OVERVIEW.md - Agent architecture
SCHEMA_CREDENTIALS.md - Security tables
API_SPEC.md - API endpoints

Tables Summary

Table	Purpose	Auto-Generated
`environmental_insights`	Generated insights per client/infrastructure	Yes
`problem_solutions`	Issue tracking with root cause and resolution	Partial
`failure_patterns`	Aggregated failure analysis and learnings	Yes
`operation_failures`	Non-command failures (API, file ops, network)	Yes

Total: 4 tables

Specialized Agents:

Failure Analysis Agent - Analyzes failures, identifies patterns, generates insights
Environment Context Agent - Pre-checks environmental constraints before operations
Problem Pattern Matching Agent - Searches historical solutions for similar issues

Table Schemas

`environmental_insights`

Auto-generated insights about client infrastructure constraints, limitations, and quirks. Used by Environment Context Agent to prevent failures before they occur.

CREATE TABLE environmental_insights (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    client_id UUID REFERENCES clients(id) ON DELETE CASCADE,
    infrastructure_id UUID REFERENCES infrastructure(id) ON DELETE CASCADE,

    -- Insight classification
    insight_category VARCHAR(100) NOT NULL CHECK(insight_category IN (
        'command_constraints', 'service_configuration', 'version_limitations',
        'custom_installations', 'network_constraints', 'permissions',
        'compatibility', 'performance', 'security'
    )),
    insight_title VARCHAR(500) NOT NULL,
    insight_description TEXT NOT NULL, -- markdown formatted

    -- Examples and documentation
    examples TEXT, -- JSON array of command/config examples
    affected_operations TEXT, -- JSON array: ["user_management", "service_restart"]

    -- Source and verification
    source_pattern_id UUID REFERENCES failure_patterns(id) ON DELETE SET NULL,
    confidence_level VARCHAR(20) CHECK(confidence_level IN ('confirmed', 'likely', 'suspected')),
    verification_count INTEGER DEFAULT 1, -- how many times verified
    last_verified TIMESTAMP,

    -- Priority (1-10, higher = more important to avoid)
    priority INTEGER DEFAULT 5 CHECK(priority BETWEEN 1 AND 10),

    -- Status
    is_active BOOLEAN DEFAULT true, -- false if pattern no longer applies
    superseded_by UUID REFERENCES environmental_insights(id), -- if replaced by better insight

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_insights_client (client_id),
    INDEX idx_insights_infrastructure (infrastructure_id),
    INDEX idx_insights_category (insight_category),
    INDEX idx_insights_priority (priority),
    INDEX idx_insights_active (is_active)
);

Real-World Examples:

D2TESTNAS - Custom WINS Installation:

{
  "infrastructure_id": "d2testnas-uuid",
  "client_id": "dataforth-uuid",
  "insight_category": "custom_installations",
  "insight_title": "WINS Service: Manual Samba installation (no native ReadyNAS service)",
  "insight_description": "**Installation:** Manually installed via Samba nmbd, not a native ReadyNAS service.\n\n**Constraints:**\n- No GUI service manager for WINS\n- Cannot use standard service management commands\n- Configuration via `/etc/frontview/samba/smb.conf.overrides`\n\n**Correct commands:**\n- Check status: `ssh root@192.168.0.9 'ps aux | grep nmbd'`\n- View config: `ssh root@192.168.0.9 'cat /etc/frontview/samba/smb.conf.overrides | grep wins'`\n- Restart: `ssh root@192.168.0.9 'service nmbd restart'`",
  "examples": [
    "ps aux | grep nmbd",
    "cat /etc/frontview/samba/smb.conf.overrides | grep wins",
    "service nmbd restart"
  ],
  "affected_operations": ["service_management", "wins_configuration"],
  "confidence_level": "confirmed",
  "verification_count": 3,
  "priority": 9
}

AD2 - PowerShell Version Constraints:

{
  "infrastructure_id": "ad2-uuid",
  "client_id": "dataforth-uuid",
  "insight_category": "version_limitations",
  "insight_title": "Server 2022: PowerShell 5.1 command compatibility",
  "insight_description": "**PowerShell Version:** 5.1 (default)\n\n**Compatible:** Modern cmdlets work (Get-LocalUser, Get-LocalGroup)\n\n**Not available:** PowerShell 7 specific features\n\n**Remote execution:** Use Invoke-Command for remote operations",
  "examples": [
    "Get-LocalUser",
    "Get-LocalGroup",
    "Invoke-Command -ComputerName AD2 -ScriptBlock { Get-LocalUser }"
  ],
  "confidence_level": "confirmed",
  "verification_count": 5,
  "priority": 6
}

Server 2008 - PowerShell 2.0 Limitations:

{
  "infrastructure_id": "old-server-2008-uuid",
  "insight_category": "version_limitations",
  "insight_title": "Server 2008: PowerShell 2.0 command compatibility",
  "insight_description": "**PowerShell Version:** 2.0 only\n\n**Avoid:** Get-LocalUser, Get-LocalGroup, New-LocalUser (not available in PS 2.0)\n\n**Use instead:** Get-WmiObject Win32_UserAccount, Get-WmiObject Win32_Group\n\n**Why:** Server 2008 predates modern PowerShell user management cmdlets",
  "examples": [
    "Get-WmiObject Win32_UserAccount",
    "Get-WmiObject Win32_Group",
    "Get-WmiObject Win32_UserAccount -Filter \"Name='username'\""
  ],
  "affected_operations": ["user_management", "group_management"],
  "confidence_level": "confirmed",
  "verification_count": 5,
  "priority": 8
}

DOS Machines (TS-XX) - Batch Syntax Constraints:

{
  "infrastructure_id": "ts-27-uuid",
  "client_id": "dataforth-uuid",
  "insight_category": "command_constraints",
  "insight_title": "MS-DOS 6.22: Batch file syntax limitations",
  "insight_description": "**OS:** MS-DOS 6.22\n\n**No support for:**\n- `IF /I` (case insensitive) - added in Windows 2000\n- Long filenames (8.3 format only)\n- Unicode or special characters\n- Modern batch features\n\n**Workarounds:**\n- Use duplicate IF statements for upper/lowercase\n- Keep filenames to 8.3 format\n- Use basic batch syntax only",
  "examples": [
    "IF \"%1\"=\"STATUS\" GOTO STATUS",
    "IF \"%1\"=\"status\" GOTO STATUS",
    "COPY FILE.TXT BACKUP.TXT"
  ],
  "affected_operations": ["batch_scripting", "file_operations"],
  "confidence_level": "confirmed",
  "verification_count": 8,
  "priority": 10
}

D2TESTNAS - SMB Protocol Constraints:

{
  "infrastructure_id": "d2testnas-uuid",
  "insight_category": "network_constraints",
  "insight_title": "ReadyNAS: SMB1/CORE protocol for DOS compatibility",
  "insight_description": "**Protocol:** CORE/SMB1 only (for DOS machine compatibility)\n\n**Implications:**\n- Modern SMB2/3 clients may need configuration\n- Use NetBIOS name, not IP address for DOS machines\n- Security risk: SMB1 deprecated due to vulnerabilities\n\n**Configuration:**\n- Set in `/etc/frontview/samba/smb.conf.overrides`\n- `min protocol = CORE`",
  "examples": [
    "NET USE Z: \\\\D2TESTNAS\\SHARE (from DOS)",
    "smbclient -L //192.168.0.9 -m SMB1"
  ],
  "confidence_level": "confirmed",
  "priority": 7
}

Generated insights.md Example:

When Failure Analysis Agent runs, it generates markdown files for each client:

# Environmental Insights: Dataforth

Auto-generated from failure patterns and verified operations.

## D2TESTNAS (192.168.0.9)

### Custom Installations

**WINS Service: Manual Samba installation**
- Manually installed via Samba nmbd, not native ReadyNAS service
- No GUI service manager for WINS
- Configure via `/etc/frontview/samba/smb.conf.overrides`
- Check status: `ssh root@192.168.0.9 'ps aux | grep nmbd'`

### Network Constraints

**SMB Protocol: CORE/SMB1 only**
- For DOS compatibility
- Modern SMB2/3 clients may need configuration
- Use NetBIOS name from DOS machines

## AD2 (192.168.0.6 - Server 2022)

### PowerShell Version

**Version:** PowerShell 5.1 (default)
- **Compatible:** Modern cmdlets work
- **Not available:** PowerShell 7 specific features

## TS-XX Machines (DOS 6.22)

### Command Constraints

**No support for:**
- `IF /I` (case insensitive) - use duplicate IF statements
- Long filenames (8.3 format only)
- Unicode or special characters
- Modern batch features

**Examples:**
```batch
REM Correct (DOS 6.22)
IF "%1"=="STATUS" GOTO STATUS
IF "%1"=="status" GOTO STATUS

REM Incorrect (requires Windows 2000+)
IF /I "%1"=="STATUS" GOTO STATUS


---

### `problem_solutions`

Issue tracking with root cause analysis and resolution documentation. Searchable historical knowledge base.

```sql
CREATE TABLE problem_solutions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    work_item_id UUID NOT NULL REFERENCES work_items(id) ON DELETE CASCADE,
    session_id UUID NOT NULL REFERENCES sessions(id) ON DELETE CASCADE,
    client_id UUID REFERENCES clients(id) ON DELETE SET NULL,
    infrastructure_id UUID REFERENCES infrastructure(id) ON DELETE SET NULL,

    -- Problem description
    problem_title VARCHAR(500) NOT NULL,
    problem_description TEXT NOT NULL,
    symptom TEXT, -- what user/system exhibited
    error_message TEXT, -- exact error code/message
    error_code VARCHAR(100), -- structured error code

    -- Investigation
    investigation_steps TEXT, -- JSON array of diagnostic commands/actions
    diagnostic_output TEXT, -- key outputs that led to root cause
    investigation_duration_minutes INTEGER,

    -- Root cause
    root_cause TEXT NOT NULL,
    root_cause_category VARCHAR(100), -- "configuration", "hardware", "software", "network"

    -- Solution
    solution_applied TEXT NOT NULL,
    solution_category VARCHAR(100), -- "config_change", "restart", "replacement", "patch"
    commands_run TEXT, -- JSON array of commands used to fix
    files_modified TEXT, -- JSON array of config files changed

    -- Verification
    verification_method TEXT,
    verification_successful BOOLEAN DEFAULT true,
    verification_notes TEXT,

    -- Prevention and rollback
    rollback_plan TEXT,
    prevention_measures TEXT, -- what was done to prevent recurrence

    -- Pattern tracking
    recurrence_count INTEGER DEFAULT 1, -- if same problem reoccurs
    similar_problems TEXT, -- JSON array of related problem_solution IDs
    tags TEXT, -- JSON array: ["ssl", "apache", "certificate"]

    -- Resolution
    resolved_at TIMESTAMP,
    time_to_resolution_minutes INTEGER,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_problems_work_item (work_item_id),
    INDEX idx_problems_session (session_id),
    INDEX idx_problems_client (client_id),
    INDEX idx_problems_infrastructure (infrastructure_id),
    INDEX idx_problems_category (root_cause_category),
    FULLTEXT idx_problems_search (problem_description, symptom, error_message, root_cause)
);

Example Problem Solutions:

Apache SSL Certificate Expiration:

{
  "problem_title": "Apache SSL certificate expiration causing ERR_SSL_PROTOCOL_ERROR",
  "problem_description": "Website inaccessible via HTTPS. Browser shows ERR_SSL_PROTOCOL_ERROR.",
  "symptom": "Users unable to access website. SSL handshake failure.",
  "error_message": "ERR_SSL_PROTOCOL_ERROR",
  "investigation_steps": [
    "curl -I https://example.com",
    "openssl s_client -connect example.com:443",
    "systemctl status apache2",
    "openssl x509 -in /etc/ssl/certs/example.com.crt -text -noout"
  ],
  "diagnostic_output": "Certificate expiration: 2026-01-10 (3 days ago)",
  "root_cause": "SSL certificate expired on 2026-01-10. Certbot auto-renewal failed due to DNS validation issue.",
  "root_cause_category": "configuration",
  "solution_applied": "1. Fixed DNS TXT record for Let's Encrypt validation\n2. Ran: certbot renew --force-renewal\n3. Restarted Apache: systemctl restart apache2",
  "solution_category": "config_change",
  "commands_run": [
    "certbot renew --force-renewal",
    "systemctl restart apache2"
  ],
  "files_modified": [
    "/etc/apache2/sites-enabled/example.com.conf"
  ],
  "verification_method": "curl test successful. Browser loads HTTPS site without error.",
  "verification_successful": true,
  "prevention_measures": "Set up monitoring for certificate expiration (30 days warning). Fixed DNS automation for certbot.",
  "tags": ["ssl", "apache", "certificate", "certbot"],
  "time_to_resolution_minutes": 25
}

PowerShell Compatibility Issue:

{
  "problem_title": "Get-LocalUser fails on Server 2008 (PowerShell 2.0)",
  "problem_description": "Attempting to list local users on Server 2008 using Get-LocalUser cmdlet",
  "symptom": "Command not recognized error",
  "error_message": "Get-LocalUser : The term 'Get-LocalUser' is not recognized as the name of a cmdlet",
  "error_code": "CommandNotFoundException",
  "investigation_steps": [
    "$PSVersionTable",
    "Get-Command Get-LocalUser",
    "Get-WmiObject Win32_OperatingSystem | Select Caption, Version"
  ],
  "root_cause": "Server 2008 has PowerShell 2.0 only. Get-LocalUser introduced in PowerShell 5.1 (Windows 10/Server 2016).",
  "root_cause_category": "software",
  "solution_applied": "Use WMI instead: Get-WmiObject Win32_UserAccount",
  "solution_category": "alternative_approach",
  "commands_run": [
    "Get-WmiObject Win32_UserAccount | Select Name, Disabled, LocalAccount"
  ],
  "verification_method": "Successfully retrieved local user list",
  "verification_successful": true,
  "prevention_measures": "Created environmental insight for all Server 2008 machines. Environment Context Agent now checks PowerShell version before suggesting cmdlets.",
  "tags": ["powershell", "server_2008", "compatibility", "user_management"],
  "recurrence_count": 5
}

Queries:

-- Find similar problems by error message
SELECT problem_title, solution_applied, created_at
FROM problem_solutions
WHERE MATCH(error_message) AGAINST('SSL_PROTOCOL_ERROR' IN BOOLEAN MODE)
ORDER BY created_at DESC;

-- Most common problems (by recurrence)
SELECT problem_title, recurrence_count, root_cause_category
FROM problem_solutions
WHERE recurrence_count > 1
ORDER BY recurrence_count DESC;

-- Recent solutions for client
SELECT problem_title, solution_applied, resolved_at
FROM problem_solutions
WHERE client_id = 'dataforth-uuid'
ORDER BY resolved_at DESC
LIMIT 10;

`failure_patterns`

Aggregated failure insights learned from command/operation failures. Auto-generated by Failure Analysis Agent.

CREATE TABLE failure_patterns (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    infrastructure_id UUID REFERENCES infrastructure(id) ON DELETE CASCADE,
    client_id UUID REFERENCES clients(id) ON DELETE CASCADE,

    -- Pattern identification
    pattern_type VARCHAR(100) NOT NULL CHECK(pattern_type IN (
        'command_compatibility', 'version_mismatch', 'permission_denied',
        'service_unavailable', 'configuration_error', 'environmental_limitation',
        'network_connectivity', 'authentication_failure', 'syntax_error'
    )),
    pattern_signature VARCHAR(500) NOT NULL, -- "PowerShell 7 cmdlets on Server 2008"
    error_pattern TEXT, -- regex or keywords: "Get-LocalUser.*not recognized"

    -- Context
    affected_systems TEXT, -- JSON array: ["all_server_2008", "D2TESTNAS"]
    affected_os_versions TEXT, -- JSON array: ["Server 2008", "DOS 6.22"]
    triggering_commands TEXT, -- JSON array of command patterns
    triggering_operations TEXT, -- JSON array of operation types

    -- Failure details
    failure_description TEXT NOT NULL,
    typical_error_messages TEXT, -- JSON array of common error texts

    -- Resolution
    root_cause TEXT NOT NULL, -- "Server 2008 only has PowerShell 2.0"
    recommended_solution TEXT NOT NULL, -- "Use Get-WmiObject instead of Get-LocalUser"
    alternative_approaches TEXT, -- JSON array of alternatives
    workaround_commands TEXT, -- JSON array of working commands

    -- Metadata
    occurrence_count INTEGER DEFAULT 1, -- how many times seen
    first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    severity VARCHAR(20) CHECK(severity IN ('blocking', 'major', 'minor', 'info')),

    -- Status
    is_active BOOLEAN DEFAULT true, -- false if pattern no longer applies (e.g., server upgraded)
    added_to_insights BOOLEAN DEFAULT false, -- environmental_insight generated

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_failure_infrastructure (infrastructure_id),
    INDEX idx_failure_client (client_id),
    INDEX idx_failure_pattern_type (pattern_type),
    INDEX idx_failure_signature (pattern_signature),
    INDEX idx_failure_active (is_active),
    INDEX idx_failure_severity (severity)
);

Example Failure Patterns:

PowerShell Version Incompatibility:

{
  "pattern_type": "command_compatibility",
  "pattern_signature": "Modern PowerShell cmdlets on Server 2008",
  "error_pattern": "(Get-LocalUser|Get-LocalGroup|New-LocalUser).*not recognized",
  "affected_systems": ["all_server_2008_machines"],
  "affected_os_versions": ["Server 2008", "Server 2008 R2"],
  "triggering_commands": [
    "Get-LocalUser",
    "Get-LocalGroup",
    "New-LocalUser",
    "Remove-LocalUser"
  ],
  "failure_description": "Modern PowerShell user management cmdlets fail on Server 2008 with 'not recognized' error",
  "typical_error_messages": [
    "Get-LocalUser : The term 'Get-LocalUser' is not recognized",
    "Get-LocalGroup : The term 'Get-LocalGroup' is not recognized"
  ],
  "root_cause": "Server 2008 has PowerShell 2.0 only. Modern user management cmdlets (Get-LocalUser, etc.) were introduced in PowerShell 5.1 (Windows 10/Server 2016).",
  "recommended_solution": "Use WMI for user/group management: Get-WmiObject Win32_UserAccount, Get-WmiObject Win32_Group",
  "alternative_approaches": [
    "Use Get-WmiObject Win32_UserAccount",
    "Use net user command",
    "Upgrade to PowerShell 5.1 (if possible on Server 2008 R2)"
  ],
  "workaround_commands": [
    "Get-WmiObject Win32_UserAccount",
    "Get-WmiObject Win32_Group",
    "net user"
  ],
  "occurrence_count": 5,
  "severity": "major",
  "added_to_insights": true
}

DOS Batch Syntax Limitation:

{
  "pattern_type": "environmental_limitation",
  "pattern_signature": "Modern batch syntax on MS-DOS 6.22",
  "error_pattern": "IF /I.*Invalid switch",
  "affected_systems": ["all_dos_machines"],
  "affected_os_versions": ["MS-DOS 6.22"],
  "triggering_commands": [
    "IF /I \"%1\"==\"value\" ...",
    "Long filenames with spaces"
  ],
  "failure_description": "Modern batch file syntax not supported in MS-DOS 6.22",
  "typical_error_messages": [
    "Invalid switch - /I",
    "File not found (long filename)",
    "Bad command or file name"
  ],
  "root_cause": "DOS 6.22 does not support /I flag (added in Windows 2000), long filenames, or many modern batch features",
  "recommended_solution": "Use duplicate IF statements for upper/lowercase. Keep filenames to 8.3 format. Use basic batch syntax only.",
  "alternative_approaches": [
    "Duplicate IF for case-insensitive: IF \"%1\"==\"VALUE\" ... + IF \"%1\"==\"value\" ...",
    "Use 8.3 filenames only",
    "Avoid advanced batch features"
  ],
  "workaround_commands": [
    "IF \"%1\"==\"STATUS\" GOTO STATUS",
    "IF \"%1\"==\"status\" GOTO STATUS"
  ],
  "occurrence_count": 8,
  "severity": "blocking",
  "added_to_insights": true
}

ReadyNAS Service Management:

{
  "pattern_type": "service_unavailable",
  "pattern_signature": "systemd commands on ReadyNAS",
  "error_pattern": "systemctl.*command not found",
  "affected_systems": ["D2TESTNAS"],
  "triggering_commands": [
    "systemctl status nmbd",
    "systemctl restart samba"
  ],
  "failure_description": "ReadyNAS does not use systemd for service management",
  "typical_error_messages": [
    "systemctl: command not found",
    "-ash: systemctl: not found"
  ],
  "root_cause": "ReadyNAS OS is based on older Linux without systemd. Uses traditional init scripts.",
  "recommended_solution": "Use 'service' command or direct process management: service nmbd status, ps aux | grep nmbd",
  "alternative_approaches": [
    "service nmbd status",
    "ps aux | grep nmbd",
    "/etc/init.d/nmbd status"
  ],
  "occurrence_count": 3,
  "severity": "major",
  "added_to_insights": true
}

`operation_failures`

Non-command failures (API calls, integrations, file operations, network requests). Complements commands_run failure tracking.

CREATE TABLE operation_failures (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id UUID REFERENCES sessions(id) ON DELETE CASCADE,
    work_item_id UUID REFERENCES work_items(id) ON DELETE CASCADE,
    client_id UUID REFERENCES clients(id) ON DELETE SET NULL,

    -- Operation details
    operation_type VARCHAR(100) NOT NULL CHECK(operation_type IN (
        'api_call', 'file_operation', 'network_request',
        'database_query', 'external_integration', 'service_restart',
        'backup_operation', 'restore_operation', 'migration'
    )),
    operation_description TEXT NOT NULL,
    target_system VARCHAR(255), -- host, URL, service name

    -- Failure details
    error_message TEXT NOT NULL,
    error_code VARCHAR(50), -- HTTP status, exit code, error number
    failure_category VARCHAR(100), -- "timeout", "authentication", "not_found", etc.
    stack_trace TEXT,

    -- Context
    request_data TEXT, -- JSON: what was attempted
    response_data TEXT, -- JSON: error response
    environment_snapshot TEXT, -- JSON: relevant env vars, versions

    -- Resolution
    resolution_applied TEXT,
    resolved BOOLEAN DEFAULT false,
    resolved_at TIMESTAMP,
    time_to_resolution_minutes INTEGER,

    -- Pattern linkage
    related_pattern_id UUID REFERENCES failure_patterns(id),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    INDEX idx_op_failure_session (session_id),
    INDEX idx_op_failure_type (operation_type),
    INDEX idx_op_failure_category (failure_category),
    INDEX idx_op_failure_resolved (resolved),
    INDEX idx_op_failure_client (client_id)
);

Example Operation Failures:

SyncroMSP API Timeout:

{
  "operation_type": "api_call",
  "operation_description": "Search SyncroMSP tickets for Dataforth",
  "target_system": "https://azcomputerguru.syncromsp.com/api/v1",
  "error_message": "Request timeout after 30 seconds",
  "error_code": "ETIMEDOUT",
  "failure_category": "timeout",
  "request_data": {
    "endpoint": "/api/v1/tickets",
    "params": {"customer_id": 12345, "status": "open"}
  },
  "response_data": null,
  "resolution_applied": "Increased timeout to 60 seconds. Added retry logic with exponential backoff.",
  "resolved": true,
  "time_to_resolution_minutes": 15
}

File Upload Permission Denied:

{
  "operation_type": "file_operation",
  "operation_description": "Upload backup file to NAS",
  "target_system": "D2TESTNAS:/mnt/backups",
  "error_message": "Permission denied: /mnt/backups/db_backup_2026-01-15.sql",
  "error_code": "EACCES",
  "failure_category": "permission",
  "environment_snapshot": {
    "user": "backupuser",
    "directory_perms": "drwxr-xr-x root root"
  },
  "resolution_applied": "Changed directory ownership: chown -R backupuser:backupgroup /mnt/backups",
  "resolved": true
}

Database Query Performance:

{
  "operation_type": "database_query",
  "operation_description": "Query sessions table for large date range",
  "target_system": "MariaDB msp_tracking",
  "error_message": "Query execution time: 45 seconds (threshold: 5 seconds)",
  "failure_category": "performance",
  "request_data": {
    "query": "SELECT * FROM sessions WHERE session_date BETWEEN '2020-01-01' AND '2026-01-15'"
  },
  "resolution_applied": "Added index on session_date column. Query now runs in 0.3 seconds.",
  "resolved": true
}

Self-Learning Workflow

1. Failure Detection and Logging

Command Execution with Failure Tracking:

User: "Check WINS status on D2TESTNAS"

Main Claude → Environment Context Agent:
  - Queries infrastructure table for D2TESTNAS
  - Reads environmental_notes: "Manual WINS install, no native service"
  - Reads environmental_insights for D2TESTNAS
  - Returns: "D2TESTNAS has manually installed WINS (not native ReadyNAS service)"

Main Claude suggests command based on environmental context:
  - Executes: ssh root@192.168.0.9 'systemctl status nmbd'

Command fails:
  - success = false
  - exit_code = 127
  - error_message = "systemctl: command not found"
  - failure_category = "command_compatibility"

Trigger Failure Analysis Agent:
  - Analyzes error: ReadyNAS doesn't use systemd
  - Identifies correct approach: "service nmbd status" or "ps aux | grep nmbd"
  - Creates failure_pattern entry
  - Updates environmental_insights with correction
  - Returns resolution to Main Claude

Main Claude tries corrected command:
  - Executes: ssh root@192.168.0.9 'ps aux | grep nmbd'
  - Success = true
  - Updates original failure record with resolution

2. Pattern Analysis (Periodic Agent Run)

Failure Analysis Agent runs periodically:

Agent Task: "Analyze recent failures and update environmental insights"

Query failures:

SELECT * FROM commands_run
WHERE success = false AND resolved = false
ORDER BY created_at DESC;

SELECT * FROM operation_failures
WHERE resolved = false
ORDER BY created_at DESC;

Group by pattern:
- Group by infrastructure_id, error_pattern, failure_category
- Identify recurring patterns
Create/update failure_patterns:
- If pattern seen 3+ times → Create failure_pattern
- Increment occurrence_count for existing patterns
- Update last_seen timestamp
Generate environmental_insights:
- Transform failure_patterns into actionable insights
- Create markdown-formatted descriptions
- Add command examples
- Set priority based on severity and frequency
Update infrastructure environmental_notes:
- Add constraints to infrastructure.environmental_notes
- Set powershell_version, shell_type, limitations
Generate insights.md file:
- Query all environmental_insights for client
- Format as markdown
- Save to D:\ClaudeTools\insights[client-name].md
- Agents read this file before making suggestions

3. Pre-Operation Environment Check

Environment Context Agent runs before operations:

Agent Task: "Check environmental constraints for D2TESTNAS before command suggestion"

Query infrastructure:

SELECT environmental_notes, powershell_version, shell_type, limitations
FROM infrastructure
WHERE id = 'd2testnas-uuid';

Query environmental_insights:

SELECT insight_title, insight_description, examples, priority
FROM environmental_insights
WHERE infrastructure_id = 'd2testnas-uuid'
  AND is_active = true
ORDER BY priority DESC;

Query failure_patterns:

SELECT pattern_signature, recommended_solution, workaround_commands
FROM failure_patterns
WHERE infrastructure_id = 'd2testnas-uuid'
  AND is_active = true;

Check proposed command compatibility:
- Proposed: "systemctl status nmbd"
- Pattern match: "systemctl.*command not found"
- Result: INCOMPATIBLE
- Recommended: "ps aux | grep nmbd"

Return environmental context:

Environmental Context for D2TESTNAS:
- ReadyNAS OS (Linux-based)
- Manual WINS installation (Samba nmbd)
- No systemd (use 'service' or ps commands)
- SMB1/CORE protocol for DOS compatibility

Recommended commands:
✓ ps aux | grep nmbd
✓ service nmbd status
✗ systemctl status nmbd (not available)

Main Claude uses this context to suggest correct approach.

Benefits

1. Self-Improving System

Each failure makes the system smarter
Patterns identified automatically
Insights generated without manual documentation
Knowledge accumulates over time

2. Reduced User Friction

User doesn't have to keep correcting same mistakes
Claude learns environmental constraints once
Suggestions are environmentally aware from start
Proactive problem prevention

3. Institutional Knowledge Capture

All environmental quirks documented in database
Survives across sessions and Claude instances
Queryable: "What are known issues with D2TESTNAS?"
Transferable to new team members

4. Proactive Problem Prevention

Environment Context Agent prevents failures before they happen
Suggests compatible alternatives automatically
Warns about known limitations
Avoids wasting time on incompatible approaches

5. Audit Trail

Every failure tracked with full context
Resolution history for troubleshooting
Pattern analysis for infrastructure planning
ROI tracking: time saved by avoiding repeat failures

Integration with Other Schemas

Sources data from:

commands_run - Command execution failures
infrastructure - System capabilities and limitations
work_items - Context for failures
sessions - Session context for operations

Provides data to:

Environment Context Agent (pre-operation checks)
Problem Pattern Matching Agent (solution lookup)
MSP Mode (intelligent suggestions)
Reporting (failure analysis, improvement metrics)

Example Queries

Find all insights for a client

SELECT ei.insight_title, ei.insight_description, i.hostname
FROM environmental_insights ei
JOIN infrastructure i ON ei.infrastructure_id = i.id
WHERE ei.client_id = 'dataforth-uuid'
  AND ei.is_active = true
ORDER BY ei.priority DESC;

Search for similar problems

SELECT ps.problem_title, ps.solution_applied, ps.created_at
FROM problem_solutions ps
WHERE MATCH(ps.problem_description, ps.symptom, ps.error_message)
      AGAINST('SSL certificate' IN BOOLEAN MODE)
ORDER BY ps.created_at DESC
LIMIT 10;

Active failure patterns

SELECT fp.pattern_signature, fp.occurrence_count, fp.recommended_solution
FROM failure_patterns fp
WHERE fp.is_active = true
  AND fp.severity IN ('blocking', 'major')
ORDER BY fp.occurrence_count DESC;

Unresolved operation failures

SELECT of.operation_type, of.target_system, of.error_message, of.created_at
FROM operation_failures of
WHERE of.resolved = false
ORDER BY of.created_at DESC;

Document Version: 1.0 Last Updated: 2026-01-15 Author: MSP Mode Schema Design Team

31 KiB Raw Blame History