New projects from 2026-02-09 research session: Wrightstown Solar: - DIY 48V LiFePO4 battery storage (EVE C40 cells) - Victron MultiPlus II whole-house UPS design - BMS comparison (Victron CAN bus compatible) - EV salvage analysis (new cells won) - Full parts list and budget Wrightstown Smart Home: - Home Assistant Yellow setup (local voice, no cloud) - Local LLM server build guide (Ollama + RTX 4090) - Hybrid LLM bridge (LiteLLM + Claude API + Grok API) - Network security (VLAN architecture, PII sanitization) Machine: ACG-M-L5090 Timestamp: 2026-02-09 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.2 KiB
Hybrid LLM Bridge - Local + Cloud Routing
Created: 2026-02-09 Purpose: Route queries intelligently between local Ollama, Claude API, and Grok API
Architecture
User Query (voice, chat, HA automation)
|
[LiteLLM Proxy]
localhost:4000
|
Routing Decision
/ | \
[Ollama] [Claude] [Grok]
Local Anthropic xAI
Free Reasoning Search
Private $3/$15/1M $3/$15/1M
Recommended: LiteLLM Proxy
Unified API gateway that presents a single OpenAI-compatible endpoint. Everything talks to localhost:4000 and LiteLLM routes to the right backend.
Installation
pip install litellm[proxy]
Configuration (config.yaml)
model_list:
# Local models (free, private)
- model_name: local-fast
litellm_params:
model: ollama/qwen2.5:7b
api_base: http://localhost:11434
- model_name: local-reasoning
litellm_params:
model: ollama/llama3.1:70b-q4
api_base: http://localhost:11434
# Cloud: Claude (complex reasoning)
- model_name: cloud-reasoning
litellm_params:
model: claude-sonnet-4-5-20250929
api_key: sk-ant-XXXXX
- model_name: cloud-reasoning-cheap
litellm_params:
model: claude-haiku-4-5-20251001
api_key: sk-ant-XXXXX
# Cloud: Grok (internet search)
- model_name: cloud-search
litellm_params:
model: grok-4
api_key: xai-XXXXX
api_base: https://api.x.ai/v1
router_settings:
routing_strategy: simple-shuffle
allowed_fails: 2
num_retries: 3
budget_policy:
local-fast: unlimited
local-reasoning: unlimited
cloud-reasoning: $50/month
cloud-reasoning-cheap: $25/month
cloud-search: $25/month
Start the Proxy
litellm --config config.yaml --port 4000
Usage
Everything talks to http://localhost:4000 with OpenAI-compatible format:
import openai
client = openai.OpenAI(
api_key="anything", # LiteLLM doesn't need this for local
base_url="http://localhost:4000"
)
# Route to local
response = client.chat.completions.create(
model="local-fast",
messages=[{"role": "user", "content": "Turn on the lights"}]
)
# Route to Claude
response = client.chat.completions.create(
model="cloud-reasoning",
messages=[{"role": "user", "content": "Analyze my energy usage patterns"}]
)
# Route to Grok
response = client.chat.completions.create(
model="cloud-search",
messages=[{"role": "user", "content": "What's the current electricity rate in PA?"}]
)
Routing Strategy
What Goes Where
Local (Ollama) -- Default for everything private:
- Home automation commands ("turn on lights", "set thermostat to 72")
- Sensor data queries ("what's the temperature in the garage?")
- Camera-related queries (never send video to cloud)
- Personal information queries
- Simple Q&A
- Quick lookups from local knowledge
Claude API -- Complex reasoning tasks:
- Detailed analysis ("analyze my energy trends this month")
- Code generation ("write an HA automation for...")
- Long-form content creation
- Multi-step reasoning problems
- Function calling for HA service control
Grok API -- Internet/real-time data:
- Current events ("latest news on solar tariffs")
- Real-time pricing ("current electricity rates")
- Weather data (if not using local integration)
- Web searches
- Anything requiring information the local model doesn't have
Manual vs Automatic Routing
Phase 1 (Start here): Manual model selection
- User picks "local-fast", "cloud-reasoning", or "cloud-search" in Open WebUI
- Simple, no mistakes, full control
- Good for learning which queries work best where
Phase 2 (Later): Keyword-based routing in LiteLLM
- Route based on keywords in the query
- "search", "latest", "current" --> Grok
- "analyze", "explain in detail", "write code" --> Claude
- Everything else --> local
Phase 3 (Advanced): Semantic routing
- Use sentence embeddings to classify query intent
- Small local model (all-MiniLM-L6-v2) classifies in 50-200ms
- Most intelligent routing, but requires Python development
Cloud API Details
Claude (Anthropic)
Endpoint: https://api.anthropic.com/v1/messages
Get API key: https://console.anthropic.com/
Pricing (2025-2026):
| Model | Input/1M tokens | Output/1M tokens | Best For |
|---|---|---|---|
| Claude Haiku 4.5 | $0.50 | $2.50 | Fast, cheap tasks |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Best balance |
| Claude Opus 4.5 | $5.00 | $25.00 | Top quality |
Cost optimization:
- Prompt caching: 90% savings on repeated system prompts
- Use Haiku for simple tasks, Sonnet for complex ones
- Batch processing available for non-urgent tasks
Features:
- 200k context window
- Extended thinking mode
- Function calling (perfect for HA control)
- Vision support (could analyze charts, screenshots)
Grok (xAI)
Endpoint: https://api.x.ai/v1/chat/completions
Get API key: https://console.x.ai/
Format: OpenAI SDK compatible
Pricing:
| Model | Input/1M tokens | Output/1M tokens | Best For |
|---|---|---|---|
| Grok 4.1 Fast | $0.20 | $1.00 | Budget queries |
| Grok 4 | $3.00 | $15.00 | Full capability |
Free credits: $25 new user + $150/month if opting into data sharing program
Features:
- 2 million token context window (industry-leading)
- Real-time X (Twitter) integration
- Internet search capability
- OpenAI SDK compatibility
Monthly Cost Estimates
Conservative Use (80/15/5 Split, 1000 queries/month)
| Route | Queries | Model | Cost |
|---|---|---|---|
| Local (80%) | 800 | Ollama | $0 |
| Claude (15%) | 150 | Haiku 4.5 | ~$0.45 |
| Grok (5%) | 50 | Grok 4.1 Fast | ~$0.07 |
| Total | 1000 | ~$0.52/month |
Heavy Use (60/25/15 Split, 3000 queries/month)
| Route | Queries | Model | Cost |
|---|---|---|---|
| Local (60%) | 1800 | Ollama | $0 |
| Claude (25%) | 750 | Sonnet 4.5 | ~$15 |
| Grok (15%) | 450 | Grok 4 | ~$9 |
| Total | 3000 | ~$24/month |
Add electricity for LLM server: ~$15-30/month (RTX 4090 build)
Home Assistant Integration
Connect HA to LiteLLM Proxy
Option 1: Extended OpenAI Conversation (Recommended)
Install via HACS, then configure:
- API Base URL:
http://<llm-server-ip>:4000/v1 - API Key: (any string, LiteLLM doesn't validate for local)
- Model:
local-fast(or any model name from your config)
This gives HA natural language control:
- "Turn off all lights downstairs" --> local LLM understands --> calls HA service
- "What's my battery charge level?" --> queries HA entities --> responds
Option 2: Native Ollama Integration
Settings > Integrations > Ollama:
- URL:
http://<llm-server-ip>:11434 - Simpler but bypasses the routing layer
Voice Assistant Pipeline
Wake word detected ("Hey Jarvis")
|
Whisper (speech-to-text, local)
|
Query text
|
Extended OpenAI Conversation
|
LiteLLM Proxy (routing)
|
Response text
|
Piper (text-to-speech, local)
|
Speaker output