# Hybrid LLM Bridge - Local + Cloud Routing

**Created:** 2026-02-09
**Purpose:** Route queries intelligently between local Ollama, Claude API, and Grok API

---

## Architecture

```
User Query (voice, chat, HA automation)
              |
      [LiteLLM Proxy]
       localhost:4000
              |
     Routing Decision
     /       |        \
[Ollama]  [Claude]   [Grok]
 Local    Anthropic    xAI
 Free     Reasoning   Search
 Private  $3/$15/1M   $3/$15/1M
```

---

## Recommended: LiteLLM Proxy

Unified API gateway that presents a single OpenAI-compatible endpoint. Everything talks to `localhost:4000` and LiteLLM routes to the right backend.

### Installation

```bash
pip install litellm[proxy]
```

### Configuration (`config.yaml`)

```yaml
model_list:
  # Local models (free, private)
  - model_name: local-fast
    litellm_params:
      model: ollama/qwen2.5:7b
      api_base: http://localhost:11434

  - model_name: local-reasoning
    litellm_params:
      model: ollama/llama3.1:70b-q4
      api_base: http://localhost:11434

  # Cloud: Claude (complex reasoning)
  - model_name: cloud-reasoning
    litellm_params:
      model: claude-sonnet-4-5-20250929
      api_key: sk-ant-XXXXX

  - model_name: cloud-reasoning-cheap
    litellm_params:
      model: claude-haiku-4-5-20251001
      api_key: sk-ant-XXXXX

  # Cloud: Grok (internet search)
  - model_name: cloud-search
    litellm_params:
      model: grok-4
      api_key: xai-XXXXX
      api_base: https://api.x.ai/v1

router_settings:
  routing_strategy: simple-shuffle
  allowed_fails: 2
  num_retries: 3

  budget_policy:
    local-fast: unlimited
    local-reasoning: unlimited
    cloud-reasoning: $50/month
    cloud-reasoning-cheap: $25/month
    cloud-search: $25/month
```

### Start the Proxy

```bash
litellm --config config.yaml --port 4000
```

### Usage

Everything talks to `http://localhost:4000` with OpenAI-compatible format:

```python
import openai

client = openai.OpenAI(
    api_key="anything",  # LiteLLM doesn't need this for local
    base_url="http://localhost:4000"
)

# Route to local
response = client.chat.completions.create(
    model="local-fast",
    messages=[{"role": "user", "content": "Turn on the lights"}]
)

# Route to Claude
response = client.chat.completions.create(
    model="cloud-reasoning",
    messages=[{"role": "user", "content": "Analyze my energy usage patterns"}]
)

# Route to Grok
response = client.chat.completions.create(
    model="cloud-search",
    messages=[{"role": "user", "content": "What's the current electricity rate in PA?"}]
)
```

---

## Routing Strategy

### What Goes Where

**Local (Ollama) -- Default for everything private:**
- Home automation commands ("turn on lights", "set thermostat to 72")
- Sensor data queries ("what's the temperature in the garage?")
- Camera-related queries (never send video to cloud)
- Personal information queries
- Simple Q&A
- Quick lookups from local knowledge

**Claude API -- Complex reasoning tasks:**
- Detailed analysis ("analyze my energy trends this month")
- Code generation ("write an HA automation for...")
- Long-form content creation
- Multi-step reasoning problems
- Function calling for HA service control

**Grok API -- Internet/real-time data:**
- Current events ("latest news on solar tariffs")
- Real-time pricing ("current electricity rates")
- Weather data (if not using local integration)
- Web searches
- Anything requiring information the local model doesn't have

### Manual vs Automatic Routing

**Phase 1 (Start here):** Manual model selection
- User picks "local-fast", "cloud-reasoning", or "cloud-search" in Open WebUI
- Simple, no mistakes, full control
- Good for learning which queries work best where

**Phase 2 (Later):** Keyword-based routing in LiteLLM
- Route based on keywords in the query
- "search", "latest", "current" --> Grok
- "analyze", "explain in detail", "write code" --> Claude
- Everything else --> local

**Phase 3 (Advanced):** Semantic routing
- Use sentence embeddings to classify query intent
- Small local model (all-MiniLM-L6-v2) classifies in 50-200ms
- Most intelligent routing, but requires Python development

---

## Cloud API Details

### Claude (Anthropic)

**Endpoint:** `https://api.anthropic.com/v1/messages`
**Get API key:** https://console.anthropic.com/

**Pricing (2025-2026):**

| Model | Input/1M tokens | Output/1M tokens | Best For |
|---|---|---|---|
| Claude Haiku 4.5 | $0.50 | $2.50 | Fast, cheap tasks |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Best balance |
| Claude Opus 4.5 | $5.00 | $25.00 | Top quality |

**Cost optimization:**
- Prompt caching: 90% savings on repeated system prompts
- Use Haiku for simple tasks, Sonnet for complex ones
- Batch processing available for non-urgent tasks

**Features:**
- 200k context window
- Extended thinking mode
- Function calling (perfect for HA control)
- Vision support (could analyze charts, screenshots)

### Grok (xAI)

**Endpoint:** `https://api.x.ai/v1/chat/completions`
**Get API key:** https://console.x.ai/
**Format:** OpenAI SDK compatible

**Pricing:**

| Model | Input/1M tokens | Output/1M tokens | Best For |
|---|---|---|---|
| Grok 4.1 Fast | $0.20 | $1.00 | Budget queries |
| Grok 4 | $3.00 | $15.00 | Full capability |

**Free credits:** $25 new user + $150/month if opting into data sharing program

**Features:**
- 2 million token context window (industry-leading)
- Real-time X (Twitter) integration
- Internet search capability
- OpenAI SDK compatibility

---

## Monthly Cost Estimates

### Conservative Use (80/15/5 Split, 1000 queries/month)

| Route | Queries | Model | Cost |
|---|---|---|---|
| Local (80%) | 800 | Ollama | $0 |
| Claude (15%) | 150 | Haiku 4.5 | ~$0.45 |
| Grok (5%) | 50 | Grok 4.1 Fast | ~$0.07 |
| **Total** | **1000** | | **~$0.52/month** |

### Heavy Use (60/25/15 Split, 3000 queries/month)

| Route | Queries | Model | Cost |
|---|---|---|---|
| Local (60%) | 1800 | Ollama | $0 |
| Claude (25%) | 750 | Sonnet 4.5 | ~$15 |
| Grok (15%) | 450 | Grok 4 | ~$9 |
| **Total** | **3000** | | **~$24/month** |

**Add electricity for LLM server:** ~$15-30/month (RTX 4090 build)

---

## Home Assistant Integration

### Connect HA to LiteLLM Proxy

**Option 1: Extended OpenAI Conversation (Recommended)**

Install via HACS, then configure:
- API Base URL: `http://<llm-server-ip>:4000/v1`
- API Key: (any string, LiteLLM doesn't validate for local)
- Model: `local-fast` (or any model name from your config)

This gives HA natural language control:
- "Turn off all lights downstairs" --> local LLM understands --> calls HA service
- "What's my battery charge level?" --> queries HA entities --> responds

**Option 2: Native Ollama Integration**

Settings > Integrations > Ollama:
- URL: `http://<llm-server-ip>:11434`
- Simpler but bypasses the routing layer

### Voice Assistant Pipeline

```
Wake word detected ("Hey Jarvis")
         |
   Whisper (speech-to-text, local)
         |
   Query text
         |
   Extended OpenAI Conversation
         |
   LiteLLM Proxy (routing)
         |
   Response text
         |
   Piper (text-to-speech, local)
         |
   Speaker output
```

---

## Sources

- https://docs.litellm.ai/
- https://github.com/open-webui/open-webui
- https://console.anthropic.com/
- https://docs.x.ai/developers/models
- https://github.com/jekalmin/extended_openai_conversation
- https://github.com/aurelio-labs/semantic-router