New projects from 2026-02-09 research session: Wrightstown Solar: - DIY 48V LiFePO4 battery storage (EVE C40 cells) - Victron MultiPlus II whole-house UPS design - BMS comparison (Victron CAN bus compatible) - EV salvage analysis (new cells won) - Full parts list and budget Wrightstown Smart Home: - Home Assistant Yellow setup (local voice, no cloud) - Local LLM server build guide (Ollama + RTX 4090) - Hybrid LLM bridge (LiteLLM + Claude API + Grok API) - Network security (VLAN architecture, PII sanitization) Machine: ACG-M-L5090 Timestamp: 2026-02-09 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
291 lines
7.2 KiB
Markdown
291 lines
7.2 KiB
Markdown
# Hybrid LLM Bridge - Local + Cloud Routing
|
|
|
|
**Created:** 2026-02-09
|
|
**Purpose:** Route queries intelligently between local Ollama, Claude API, and Grok API
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
User Query (voice, chat, HA automation)
|
|
|
|
|
[LiteLLM Proxy]
|
|
localhost:4000
|
|
|
|
|
Routing Decision
|
|
/ | \
|
|
[Ollama] [Claude] [Grok]
|
|
Local Anthropic xAI
|
|
Free Reasoning Search
|
|
Private $3/$15/1M $3/$15/1M
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended: LiteLLM Proxy
|
|
|
|
Unified API gateway that presents a single OpenAI-compatible endpoint. Everything talks to `localhost:4000` and LiteLLM routes to the right backend.
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install litellm[proxy]
|
|
```
|
|
|
|
### Configuration (`config.yaml`)
|
|
|
|
```yaml
|
|
model_list:
|
|
# Local models (free, private)
|
|
- model_name: local-fast
|
|
litellm_params:
|
|
model: ollama/qwen2.5:7b
|
|
api_base: http://localhost:11434
|
|
|
|
- model_name: local-reasoning
|
|
litellm_params:
|
|
model: ollama/llama3.1:70b-q4
|
|
api_base: http://localhost:11434
|
|
|
|
# Cloud: Claude (complex reasoning)
|
|
- model_name: cloud-reasoning
|
|
litellm_params:
|
|
model: claude-sonnet-4-5-20250929
|
|
api_key: sk-ant-XXXXX
|
|
|
|
- model_name: cloud-reasoning-cheap
|
|
litellm_params:
|
|
model: claude-haiku-4-5-20251001
|
|
api_key: sk-ant-XXXXX
|
|
|
|
# Cloud: Grok (internet search)
|
|
- model_name: cloud-search
|
|
litellm_params:
|
|
model: grok-4
|
|
api_key: xai-XXXXX
|
|
api_base: https://api.x.ai/v1
|
|
|
|
router_settings:
|
|
routing_strategy: simple-shuffle
|
|
allowed_fails: 2
|
|
num_retries: 3
|
|
|
|
budget_policy:
|
|
local-fast: unlimited
|
|
local-reasoning: unlimited
|
|
cloud-reasoning: $50/month
|
|
cloud-reasoning-cheap: $25/month
|
|
cloud-search: $25/month
|
|
```
|
|
|
|
### Start the Proxy
|
|
|
|
```bash
|
|
litellm --config config.yaml --port 4000
|
|
```
|
|
|
|
### Usage
|
|
|
|
Everything talks to `http://localhost:4000` with OpenAI-compatible format:
|
|
|
|
```python
|
|
import openai
|
|
|
|
client = openai.OpenAI(
|
|
api_key="anything", # LiteLLM doesn't need this for local
|
|
base_url="http://localhost:4000"
|
|
)
|
|
|
|
# Route to local
|
|
response = client.chat.completions.create(
|
|
model="local-fast",
|
|
messages=[{"role": "user", "content": "Turn on the lights"}]
|
|
)
|
|
|
|
# Route to Claude
|
|
response = client.chat.completions.create(
|
|
model="cloud-reasoning",
|
|
messages=[{"role": "user", "content": "Analyze my energy usage patterns"}]
|
|
)
|
|
|
|
# Route to Grok
|
|
response = client.chat.completions.create(
|
|
model="cloud-search",
|
|
messages=[{"role": "user", "content": "What's the current electricity rate in PA?"}]
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Routing Strategy
|
|
|
|
### What Goes Where
|
|
|
|
**Local (Ollama) -- Default for everything private:**
|
|
- Home automation commands ("turn on lights", "set thermostat to 72")
|
|
- Sensor data queries ("what's the temperature in the garage?")
|
|
- Camera-related queries (never send video to cloud)
|
|
- Personal information queries
|
|
- Simple Q&A
|
|
- Quick lookups from local knowledge
|
|
|
|
**Claude API -- Complex reasoning tasks:**
|
|
- Detailed analysis ("analyze my energy trends this month")
|
|
- Code generation ("write an HA automation for...")
|
|
- Long-form content creation
|
|
- Multi-step reasoning problems
|
|
- Function calling for HA service control
|
|
|
|
**Grok API -- Internet/real-time data:**
|
|
- Current events ("latest news on solar tariffs")
|
|
- Real-time pricing ("current electricity rates")
|
|
- Weather data (if not using local integration)
|
|
- Web searches
|
|
- Anything requiring information the local model doesn't have
|
|
|
|
### Manual vs Automatic Routing
|
|
|
|
**Phase 1 (Start here):** Manual model selection
|
|
- User picks "local-fast", "cloud-reasoning", or "cloud-search" in Open WebUI
|
|
- Simple, no mistakes, full control
|
|
- Good for learning which queries work best where
|
|
|
|
**Phase 2 (Later):** Keyword-based routing in LiteLLM
|
|
- Route based on keywords in the query
|
|
- "search", "latest", "current" --> Grok
|
|
- "analyze", "explain in detail", "write code" --> Claude
|
|
- Everything else --> local
|
|
|
|
**Phase 3 (Advanced):** Semantic routing
|
|
- Use sentence embeddings to classify query intent
|
|
- Small local model (all-MiniLM-L6-v2) classifies in 50-200ms
|
|
- Most intelligent routing, but requires Python development
|
|
|
|
---
|
|
|
|
## Cloud API Details
|
|
|
|
### Claude (Anthropic)
|
|
|
|
**Endpoint:** `https://api.anthropic.com/v1/messages`
|
|
**Get API key:** https://console.anthropic.com/
|
|
|
|
**Pricing (2025-2026):**
|
|
|
|
| Model | Input/1M tokens | Output/1M tokens | Best For |
|
|
|---|---|---|---|
|
|
| Claude Haiku 4.5 | $0.50 | $2.50 | Fast, cheap tasks |
|
|
| Claude Sonnet 4.5 | $3.00 | $15.00 | Best balance |
|
|
| Claude Opus 4.5 | $5.00 | $25.00 | Top quality |
|
|
|
|
**Cost optimization:**
|
|
- Prompt caching: 90% savings on repeated system prompts
|
|
- Use Haiku for simple tasks, Sonnet for complex ones
|
|
- Batch processing available for non-urgent tasks
|
|
|
|
**Features:**
|
|
- 200k context window
|
|
- Extended thinking mode
|
|
- Function calling (perfect for HA control)
|
|
- Vision support (could analyze charts, screenshots)
|
|
|
|
### Grok (xAI)
|
|
|
|
**Endpoint:** `https://api.x.ai/v1/chat/completions`
|
|
**Get API key:** https://console.x.ai/
|
|
**Format:** OpenAI SDK compatible
|
|
|
|
**Pricing:**
|
|
|
|
| Model | Input/1M tokens | Output/1M tokens | Best For |
|
|
|---|---|---|---|
|
|
| Grok 4.1 Fast | $0.20 | $1.00 | Budget queries |
|
|
| Grok 4 | $3.00 | $15.00 | Full capability |
|
|
|
|
**Free credits:** $25 new user + $150/month if opting into data sharing program
|
|
|
|
**Features:**
|
|
- 2 million token context window (industry-leading)
|
|
- Real-time X (Twitter) integration
|
|
- Internet search capability
|
|
- OpenAI SDK compatibility
|
|
|
|
---
|
|
|
|
## Monthly Cost Estimates
|
|
|
|
### Conservative Use (80/15/5 Split, 1000 queries/month)
|
|
|
|
| Route | Queries | Model | Cost |
|
|
|---|---|---|---|
|
|
| Local (80%) | 800 | Ollama | $0 |
|
|
| Claude (15%) | 150 | Haiku 4.5 | ~$0.45 |
|
|
| Grok (5%) | 50 | Grok 4.1 Fast | ~$0.07 |
|
|
| **Total** | **1000** | | **~$0.52/month** |
|
|
|
|
### Heavy Use (60/25/15 Split, 3000 queries/month)
|
|
|
|
| Route | Queries | Model | Cost |
|
|
|---|---|---|---|
|
|
| Local (60%) | 1800 | Ollama | $0 |
|
|
| Claude (25%) | 750 | Sonnet 4.5 | ~$15 |
|
|
| Grok (15%) | 450 | Grok 4 | ~$9 |
|
|
| **Total** | **3000** | | **~$24/month** |
|
|
|
|
**Add electricity for LLM server:** ~$15-30/month (RTX 4090 build)
|
|
|
|
---
|
|
|
|
## Home Assistant Integration
|
|
|
|
### Connect HA to LiteLLM Proxy
|
|
|
|
**Option 1: Extended OpenAI Conversation (Recommended)**
|
|
|
|
Install via HACS, then configure:
|
|
- API Base URL: `http://<llm-server-ip>:4000/v1`
|
|
- API Key: (any string, LiteLLM doesn't validate for local)
|
|
- Model: `local-fast` (or any model name from your config)
|
|
|
|
This gives HA natural language control:
|
|
- "Turn off all lights downstairs" --> local LLM understands --> calls HA service
|
|
- "What's my battery charge level?" --> queries HA entities --> responds
|
|
|
|
**Option 2: Native Ollama Integration**
|
|
|
|
Settings > Integrations > Ollama:
|
|
- URL: `http://<llm-server-ip>:11434`
|
|
- Simpler but bypasses the routing layer
|
|
|
|
### Voice Assistant Pipeline
|
|
|
|
```
|
|
Wake word detected ("Hey Jarvis")
|
|
|
|
|
Whisper (speech-to-text, local)
|
|
|
|
|
Query text
|
|
|
|
|
Extended OpenAI Conversation
|
|
|
|
|
LiteLLM Proxy (routing)
|
|
|
|
|
Response text
|
|
|
|
|
Piper (text-to-speech, local)
|
|
|
|
|
Speaker output
|
|
```
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
- https://docs.litellm.ai/
|
|
- https://github.com/open-webui/open-webui
|
|
- https://console.anthropic.com/
|
|
- https://docs.x.ai/developers/models
|
|
- https://github.com/jekalmin/extended_openai_conversation
|
|
- https://github.com/aurelio-labs/semantic-router
|