claudetools/api/utils/CONVERSATION_PARSER_GUIDE.md

# Conversation Parser Usage Guide

Complete guide for using the ClaudeTools conversation transcript parser and intelligent categorizer.

## Overview

The conversation parser extracts, analyzes, and categorizes conversation data from Claude Desktop/Code sessions. It intelligently classifies conversations as **MSP Work**, **Development**, or **General** and compresses them for efficient database storage.

## Main Functions

### 1. `parse_jsonl_conversation(file_path: str)`

Parse conversation files (`.jsonl` or `.json`) and extract structured data.

**Returns:**
```python
{
    "messages": [{"role": str, "content": str, "timestamp": str}, ...],
    "metadata": {"title": str, "model": str, "created_at": str, ...},
    "file_paths": [str, ...],           # Auto-extracted from content
    "tool_calls": [{"tool": str, "count": int}, ...],
    "duration_seconds": int,
    "message_count": int
}
```

**Example:**
```python
from api.utils.conversation_parser import parse_jsonl_conversation

conversation = parse_jsonl_conversation("/path/to/conversation.jsonl")
print(f"Found {conversation['message_count']} messages")
print(f"Duration: {conversation['duration_seconds']} seconds")
```

---

### 2. `categorize_conversation(messages: List[Dict])`

Intelligently categorize conversation content using weighted keyword analysis.

**Returns:** `"msp"`, `"development"`, or `"general"`

**Categorization Logic:**

**MSP Keywords (higher weight = stronger signal):**
- Client/Infrastructure: client, customer, site, firewall, network, server
- Services: support, ticket, incident, billable, invoice
- Microsoft 365: office365, azure, exchange, sharepoint, teams
- MSP-specific: managed service, service desk, RDS, terminal server

**Development Keywords:**
- API/Backend: api, endpoint, fastapi, flask, rest, webhook
- Database: database, migration, alembic, sqlalchemy, postgresql
- Code: implement, refactor, debug, test, pytest, function, class
- Tools: docker, kubernetes, ci/cd, deployment

**Example:**
```python
from api.utils.conversation_parser import categorize_conversation

# MSP conversation
messages = [
    {"role": "user", "content": "Client firewall blocking Office365"},
    {"role": "assistant", "content": "Checking client site configuration"}
]
category = categorize_conversation(messages)  # Returns "msp"

# Development conversation
messages = [
    {"role": "user", "content": "Build FastAPI endpoint with PostgreSQL"},
    {"role": "assistant", "content": "Creating API using SQLAlchemy"}
]
category = categorize_conversation(messages)  # Returns "development"
```

---

### 3. `extract_context_from_conversation(conversation: Dict)`

Extract dense, compressed context suitable for database storage.

**Returns:**
```python
{
    "category": str,                    # "msp", "development", or "general"
    "summary": Dict,                    # From compress_conversation_summary()
    "tags": List[str],                  # Auto-extracted technology/topic tags
    "decisions": List[Dict],            # Key decisions with rationale
    "key_files": List[str],            # Top 20 file paths mentioned
    "key_tools": List[str],            # Top 10 tools used
    "metrics": {
        "message_count": int,
        "duration_seconds": int,
        "file_count": int,
        "tool_count": int,
        "decision_count": int,
        "quality_score": float         # 0-10 quality rating
    },
    "raw_metadata": Dict               # Original metadata
}
```

**Quality Score Calculation:**
- More messages = higher quality (up to 5 points)
- Decisions indicate depth (up to 2 points)
- File mentions indicate concrete work (up to 2 points)
- Sessions >5 minutes (+1 point)

**Example:**
```python
from api.utils.conversation_parser import (
    parse_jsonl_conversation,
    extract_context_from_conversation
)

# Parse and extract context
conversation = parse_jsonl_conversation("/path/to/file.jsonl")
context = extract_context_from_conversation(conversation)

print(f"Category: {context['category']}")
print(f"Tags: {context['tags']}")
print(f"Quality: {context['metrics']['quality_score']}/10")
print(f"Decisions: {len(context['decisions'])}")
```

---

### 4. `scan_folder_for_conversations(base_path: str)`

Recursively find all conversation files in a directory.

**Features:**
- Finds both `.jsonl` and `.json` files
- Automatically skips config files (config.json, settings.json)
- Skips common non-conversation files (package.json, tsconfig.json)
- Cross-platform path handling

**Returns:** List of absolute file paths

**Example:**
```python
from api.utils.conversation_parser import scan_folder_for_conversations

# Scan Claude Code sessions
files = scan_folder_for_conversations(
    r"C:\Users\MikeSwanson\claude-projects"
)

print(f"Found {len(files)} conversation files")
for file in files[:5]:
    print(f"  - {file}")
```

---

## Complete Workflow Example

### Batch Process Conversation Folder

```python
from api.utils.conversation_parser import (
    scan_folder_for_conversations,
    parse_jsonl_conversation,
    extract_context_from_conversation
)

# 1. Scan for conversation files
base_path = r"C:\Users\MikeSwanson\claude-projects"
files = scan_folder_for_conversations(base_path)

# 2. Process each conversation
contexts = []
for file_path in files:
    try:
        # Parse conversation
        conversation = parse_jsonl_conversation(file_path)

        # Extract context
        context = extract_context_from_conversation(conversation)

        # Add source file
        context["source_file"] = file_path

        contexts.append(context)

        print(f"Processed: {file_path}")
        print(f"  Category: {context['category']}")
        print(f"  Messages: {context['metrics']['message_count']}")
        print(f"  Quality: {context['metrics']['quality_score']}/10")

    except Exception as e:
        print(f"Error processing {file_path}: {e}")

# 3. Categorize by type
msp_contexts = [c for c in contexts if c['category'] == 'msp']
dev_contexts = [c for c in contexts if c['category'] == 'development']

print(f"\nSummary:")
print(f"  MSP conversations: {len(msp_contexts)}")
print(f"  Development conversations: {len(dev_contexts)}")
```

### Using the Batch Helper Function

```python
from api.utils.conversation_parser import batch_process_conversations

def progress_callback(file_path, context):
    """Called for each processed file"""
    print(f"Processed: {context['category']} - {context['metrics']['quality_score']}/10")

# Process all conversations with callback
contexts = batch_process_conversations(
    r"C:\Users\MikeSwanson\claude-projects",
    output_callback=progress_callback
)

print(f"Total processed: {len(contexts)}")
```

---

## Integration with Database

### Insert Context into Database

```python
from sqlalchemy.orm import Session
from api.models import ContextSnippet
from api.utils.conversation_parser import (
    parse_jsonl_conversation,
    extract_context_from_conversation
)

def import_conversation_to_db(db: Session, file_path: str):
    """Import a conversation file into the database."""

    # 1. Parse and extract context
    conversation = parse_jsonl_conversation(file_path)
    context = extract_context_from_conversation(conversation)

    # 2. Create context snippet for summary
    summary_snippet = ContextSnippet(
        content=str(context['summary']),
        snippet_type="session_summary",
        tags=context['tags'],
        importance=min(10, int(context['metrics']['quality_score'])),
        metadata={
            "category": context['category'],
            "source_file": file_path,
            "message_count": context['metrics']['message_count'],
            "duration_seconds": context['metrics']['duration_seconds']
        }
    )
    db.add(summary_snippet)

    # 3. Create decision snippets
    for decision in context['decisions']:
        decision_snippet = ContextSnippet(
            content=f"{decision['decision']} - {decision['rationale']}",
            snippet_type="decision",
            tags=context['tags'][:5],
            importance=7 if decision['impact'] == 'high' else 5,
            metadata={
                "category": context['category'],
                "impact": decision['impact'],
                "source_file": file_path
            }
        )
        db.add(decision_snippet)

    db.commit()
    print(f"Imported conversation from {file_path}")
```

---

## CLI Quick Test

The module includes a standalone CLI for quick testing:

```bash
# Test a specific conversation file
python api/utils/conversation_parser.py /path/to/conversation.jsonl

# Output:
# Conversation: Build authentication system
# Category: development
# Messages: 15
# Duration: 1200s (20m)
# Tags: development, fastapi, postgresql, auth, api
# Quality: 7.5/10
```

---

## Categorization Examples

### MSP Conversation
```
User: Client at BGBuilders site reported VPN connection issues
Assistant: I'll check the firewall configuration and VPN settings for the client
```
**Category:** `msp`
**Score Logic:** client (3), site (2), vpn (2), firewall (3) = 10 points

### Development Conversation
```
User: Build a FastAPI REST API with PostgreSQL and implement JWT authentication
Assistant: I'll create the API endpoints using SQLAlchemy ORM and add JWT token support
```
**Category:** `development`
**Score Logic:** fastapi (4), api (3), postgresql (3), jwt (auth tag), sqlalchemy (3) = 13+ points

### General Conversation
```
User: What's the best way to organize my project files?
Assistant: I recommend organizing by feature rather than by file type
```
**Category:** `general`
**Score Logic:** No strong MSP or dev keywords, low scores on both

---

## Advanced Features

### File Path Extraction

Automatically extracts file paths from conversation content:

```python
conversation = parse_jsonl_conversation("/path/to/file.jsonl")
print(conversation['file_paths'])
# ['api/auth.py', 'api/models.py', 'tests/test_auth.py']
```

Supports:
- Windows absolute paths: `C:\Users\...\file.py`
- Unix absolute paths: `/home/user/file.py`
- Relative paths: `./api/file.py`, `../utils/helper.py`
- Code paths: `api/auth.py`, `src/models.py`

### Tool Call Tracking

Automatically tracks which tools were used:

```python
conversation = parse_jsonl_conversation("/path/to/file.jsonl")
print(conversation['tool_calls'])
# [
#   {"tool": "write", "count": 5},
#   {"tool": "read", "count": 3},
#   {"tool": "bash", "count": 2}
# ]
```

---

## Best Practices

1. **Use quality scores to filter**: Only import high-quality conversations (score > 5.0)
2. **Batch process in chunks**: Process large folders in batches to manage memory
3. **Add source file tracking**: Always include `source_file` in context for traceability
4. **Validate before import**: Check `message_count > 0` before importing to database
5. **Use callbacks for progress**: Implement progress callbacks for long-running batch jobs

---

## Error Handling

```python
from api.utils.conversation_parser import parse_jsonl_conversation

try:
    conversation = parse_jsonl_conversation(file_path)

    if conversation['message_count'] == 0:
        print("Warning: Empty conversation, skipping")
        return

    # Process conversation...

except FileNotFoundError:
    print(f"File not found: {file_path}")
except ValueError as e:
    print(f"Invalid file format: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

---

## Related Files

- **`context_compression.py`**: Provides compression utilities used by the parser
- **`test_conversation_parser.py`**: Comprehensive test suite with examples
- **Database Models**: `api/models.py` - ContextSnippet model for storage

---

## Future Enhancements

Potential improvements for future versions:

1. **Multi-language detection**: Identify primary programming language
2. **Sentiment analysis**: Detect problem-solving vs. exploratory conversations
3. **Entity extraction**: Extract specific client names, project names, technologies
4. **Time-based patterns**: Identify working hours, session patterns
5. **Conversation linking**: Link related conversations by topic/project