# Hybrid LLM Bridge - Local + Cloud Routing **Created:** 2026-02-09 **Purpose:** Route queries intelligently between local Ollama, Claude API, and Grok API --- ## Architecture ``` User Query (voice, chat, HA automation) | [LiteLLM Proxy] localhost:4000 | Routing Decision / | \ [Ollama] [Claude] [Grok] Local Anthropic xAI Free Reasoning Search Private $3/$15/1M $3/$15/1M ``` --- ## Recommended: LiteLLM Proxy Unified API gateway that presents a single OpenAI-compatible endpoint. Everything talks to `localhost:4000` and LiteLLM routes to the right backend. ### Installation ```bash pip install litellm[proxy] ``` ### Configuration (`config.yaml`) ```yaml model_list: # Local models (free, private) - model_name: local-fast litellm_params: model: ollama/qwen2.5:7b api_base: http://localhost:11434 - model_name: local-reasoning litellm_params: model: ollama/llama3.1:70b-q4 api_base: http://localhost:11434 # Cloud: Claude (complex reasoning) - model_name: cloud-reasoning litellm_params: model: claude-sonnet-4-5-20250929 api_key: sk-ant-XXXXX - model_name: cloud-reasoning-cheap litellm_params: model: claude-haiku-4-5-20251001 api_key: sk-ant-XXXXX # Cloud: Grok (internet search) - model_name: cloud-search litellm_params: model: grok-4 api_key: xai-XXXXX api_base: https://api.x.ai/v1 router_settings: routing_strategy: simple-shuffle allowed_fails: 2 num_retries: 3 budget_policy: local-fast: unlimited local-reasoning: unlimited cloud-reasoning: $50/month cloud-reasoning-cheap: $25/month cloud-search: $25/month ``` ### Start the Proxy ```bash litellm --config config.yaml --port 4000 ``` ### Usage Everything talks to `http://localhost:4000` with OpenAI-compatible format: ```python import openai client = openai.OpenAI( api_key="anything", # LiteLLM doesn't need this for local base_url="http://localhost:4000" ) # Route to local response = client.chat.completions.create( model="local-fast", messages=[{"role": "user", "content": "Turn on the lights"}] ) # Route to Claude response = client.chat.completions.create( model="cloud-reasoning", messages=[{"role": "user", "content": "Analyze my energy usage patterns"}] ) # Route to Grok response = client.chat.completions.create( model="cloud-search", messages=[{"role": "user", "content": "What's the current electricity rate in PA?"}] ) ``` --- ## Routing Strategy ### What Goes Where **Local (Ollama) -- Default for everything private:** - Home automation commands ("turn on lights", "set thermostat to 72") - Sensor data queries ("what's the temperature in the garage?") - Camera-related queries (never send video to cloud) - Personal information queries - Simple Q&A - Quick lookups from local knowledge **Claude API -- Complex reasoning tasks:** - Detailed analysis ("analyze my energy trends this month") - Code generation ("write an HA automation for...") - Long-form content creation - Multi-step reasoning problems - Function calling for HA service control **Grok API -- Internet/real-time data:** - Current events ("latest news on solar tariffs") - Real-time pricing ("current electricity rates") - Weather data (if not using local integration) - Web searches - Anything requiring information the local model doesn't have ### Manual vs Automatic Routing **Phase 1 (Start here):** Manual model selection - User picks "local-fast", "cloud-reasoning", or "cloud-search" in Open WebUI - Simple, no mistakes, full control - Good for learning which queries work best where **Phase 2 (Later):** Keyword-based routing in LiteLLM - Route based on keywords in the query - "search", "latest", "current" --> Grok - "analyze", "explain in detail", "write code" --> Claude - Everything else --> local **Phase 3 (Advanced):** Semantic routing - Use sentence embeddings to classify query intent - Small local model (all-MiniLM-L6-v2) classifies in 50-200ms - Most intelligent routing, but requires Python development --- ## Cloud API Details ### Claude (Anthropic) **Endpoint:** `https://api.anthropic.com/v1/messages` **Get API key:** https://console.anthropic.com/ **Pricing (2025-2026):** | Model | Input/1M tokens | Output/1M tokens | Best For | |---|---|---|---| | Claude Haiku 4.5 | $0.50 | $2.50 | Fast, cheap tasks | | Claude Sonnet 4.5 | $3.00 | $15.00 | Best balance | | Claude Opus 4.5 | $5.00 | $25.00 | Top quality | **Cost optimization:** - Prompt caching: 90% savings on repeated system prompts - Use Haiku for simple tasks, Sonnet for complex ones - Batch processing available for non-urgent tasks **Features:** - 200k context window - Extended thinking mode - Function calling (perfect for HA control) - Vision support (could analyze charts, screenshots) ### Grok (xAI) **Endpoint:** `https://api.x.ai/v1/chat/completions` **Get API key:** https://console.x.ai/ **Format:** OpenAI SDK compatible **Pricing:** | Model | Input/1M tokens | Output/1M tokens | Best For | |---|---|---|---| | Grok 4.1 Fast | $0.20 | $1.00 | Budget queries | | Grok 4 | $3.00 | $15.00 | Full capability | **Free credits:** $25 new user + $150/month if opting into data sharing program **Features:** - 2 million token context window (industry-leading) - Real-time X (Twitter) integration - Internet search capability - OpenAI SDK compatibility --- ## Monthly Cost Estimates ### Conservative Use (80/15/5 Split, 1000 queries/month) | Route | Queries | Model | Cost | |---|---|---|---| | Local (80%) | 800 | Ollama | $0 | | Claude (15%) | 150 | Haiku 4.5 | ~$0.45 | | Grok (5%) | 50 | Grok 4.1 Fast | ~$0.07 | | **Total** | **1000** | | **~$0.52/month** | ### Heavy Use (60/25/15 Split, 3000 queries/month) | Route | Queries | Model | Cost | |---|---|---|---| | Local (60%) | 1800 | Ollama | $0 | | Claude (25%) | 750 | Sonnet 4.5 | ~$15 | | Grok (15%) | 450 | Grok 4 | ~$9 | | **Total** | **3000** | | **~$24/month** | **Add electricity for LLM server:** ~$15-30/month (RTX 4090 build) --- ## Home Assistant Integration ### Connect HA to LiteLLM Proxy **Option 1: Extended OpenAI Conversation (Recommended)** Install via HACS, then configure: - API Base URL: `http://:4000/v1` - API Key: (any string, LiteLLM doesn't validate for local) - Model: `local-fast` (or any model name from your config) This gives HA natural language control: - "Turn off all lights downstairs" --> local LLM understands --> calls HA service - "What's my battery charge level?" --> queries HA entities --> responds **Option 2: Native Ollama Integration** Settings > Integrations > Ollama: - URL: `http://:11434` - Simpler but bypasses the routing layer ### Voice Assistant Pipeline ``` Wake word detected ("Hey Jarvis") | Whisper (speech-to-text, local) | Query text | Extended OpenAI Conversation | LiteLLM Proxy (routing) | Response text | Piper (text-to-speech, local) | Speaker output ``` --- ## Sources - https://docs.litellm.ai/ - https://github.com/open-webui/open-webui - https://console.anthropic.com/ - https://docs.x.ai/developers/models - https://github.com/jekalmin/extended_openai_conversation - https://github.com/aurelio-labs/semantic-router