# Local LLM Server - Build Guide **Created:** 2026-02-09 **Purpose:** Dedicated local LLM inference server for smart home + general use --- ## Recommended Build: RTX 4090 (Sweet Spot) | Component | Spec | Est. Cost | |---|---|---| | GPU | NVIDIA RTX 4090 24GB | $1,200-1,500 | | CPU | AMD Ryzen 7 5800X3D | $250 | | Motherboard | B550 | $120 | | RAM | 64GB DDR4 (2x32GB) | $120 | | Storage | 1TB NVMe Gen4 | $80 | | PSU | 850W Gold | $100 | | Case | Mid-tower, good airflow | $70 | | **Total** | | **$1,940-2,240** | ### Why This Build - 24GB VRAM runs 70B parameter models at 4-bit quantization (7-9 tok/s) - 30B models at 8-bit quality (20+ tok/s) - 7-8B models at full speed (30+ tok/s) - Handles everything from quick voice commands to serious reasoning - Single GPU keeps complexity low ### Alternative Builds **Budget (~$580):** - RTX 3060 12GB (used, $250) - Ryzen 5 5600X + B450 + 32GB DDR4 - Runs 7B models great, 13B quantized - Fine for voice commands and basic queries **Flagship (~$4,000-5,400):** - RTX 5090 32GB ($2,500-3,800) - Ryzen 9 7950X + X670E + 128GB DDR5 - 70B models at high quality (20+ tok/s) - 67% faster than 4090 - Future-proof for larger models **Efficiency (Mac Mini M4 Pro, $1,399+):** - 24-64GB unified memory - 20W power draw vs 450W for 4090 - Great for always-on server - Qwen 2.5 32B at 11-12 tok/s on 64GB config --- ## Software Stack ### Primary: Ollama ```bash # Install (Linux) curl -fsSL https://ollama.com/install.sh | sh # Install (Windows) # Download from https://ollama.com/download # Pull models ollama pull qwen2.5:7b # Fast voice commands ollama pull llama3.1:8b # General chat ollama pull qwen2.5:32b # Complex reasoning (needs 24GB+ VRAM) ollama pull llama3.1:70b-q4 # Near-cloud quality (needs 24GB VRAM) # API runs on http://localhost:11434 ``` **Why Ollama:** - One-command install - Built-in model library - OpenAI-compatible API (works with LiteLLM, Open WebUI, HA) - Auto-optimized for your hardware - Dead simple ### Web Interface: Open WebUI ```bash docker run -d -p 3000:8080 \ -v open-webui:/app/backend/data \ -e OLLAMA_BASE_URL=http://localhost:11434 \ ghcr.io/open-webui/open-webui:main ``` - ChatGPT-like interface at `http://:3000` - Model selection dropdown - Conversation history - RAG (Retrieval Augmented Generation) support - User accounts --- ## Model Recommendations ### For Home Assistant Voice Commands (Speed Priority) | Model | VRAM | Speed (4090) | Quality | Use Case | |---|---|---|---|---| | Qwen 2.5 7B | 4GB | 30+ tok/s | Good | Voice commands, HA control | | Phi-3 Mini 3.8B | 2GB | 40+ tok/s | Decent | Ultra-fast simple queries | | Llama 3.1 8B | 5GB | 30+ tok/s | Good | General chat | | Gemma 2 9B | 6GB | 25+ tok/s | Good | Efficient general use | ### For Complex Reasoning (Quality Priority) | Model | VRAM (Q4) | Speed (4090) | Quality | Use Case | |---|---|---|---|---| | Qwen 2.5 32B | 18GB | 15-20 tok/s | Very Good | Detailed analysis | | Llama 3.1 70B | 35GB | 7-9 tok/s | Excellent | Near-cloud reasoning | | DeepSeek R1 | Varies | 10-15 tok/s | Excellent | Step-by-step reasoning | | Mistral 22B | 12GB | 20+ tok/s | Good | Balanced speed/quality | ### For Code | Model | Best For | |---|---| | DeepSeek Coder V2 | Debugging, refactoring | | Qwen3 Coder | Agentic coding tasks | | Llama 3.1 70B | General code + reasoning | ### Quantization Guide 4-bit quantization maintains 95-98% quality at 75% less VRAM. Most users can't tell the difference. | Quantization | VRAM Use | Quality | When to Use | |---|---|---|---| | Q4_K_M | 25% of full | 95-98% | Default for most models | | Q5_K_M | 31% of full | 97-99% | When you have spare VRAM | | Q8_0 | 50% of full | 99%+ | When quality matters most | | FP16 (full) | 100% | 100% | Only for small models (7B) | --- ## RAM Requirements | Model Size | Minimum System RAM | Recommended | GPU VRAM (Q4) | |---|---|---|---| | 7B | 16GB | 32GB | 4-5GB | | 13B | 32GB | 32GB | 8-9GB | | 30B | 32GB | 64GB | 16-18GB | | 70B | 64GB | 128GB | 35-40GB | --- ## Power and Noise Considerations This is an always-on server in your home: | GPU | TDP | Idle Power | Annual Electricity (24/7) | |---|---|---|---| | RTX 3060 12GB | 170W | ~15W | ~$50-80/yr idle | | RTX 4060 Ti 16GB | 165W | ~12W | ~$45-75/yr idle | | RTX 4090 24GB | 450W | ~20W | ~$60-100/yr idle | | RTX 5090 32GB | 575W | ~25W | ~$75-120/yr idle | **Tips:** - Configure GPU to idle at low power when no inference running - Use Ollama's auto-unload (models unload from VRAM after idle timeout) - Consider noise: a 4090 under load is not quiet. Use a case with good fans and put the server in a utility room/closet --- ## Server OS Recommendations **Ubuntu Server 24.04 LTS** (recommended) - Best NVIDIA driver support - Docker native - Easy Ollama install - Headless (no GUI overhead) **Windows 11** (if you want dual-use) - Ollama has native Windows support - Docker Desktop for Open WebUI - More overhead than Linux **Proxmox** (if you want to run multiple VMs/containers) - GPU passthrough to LLM VM - Run other services alongside - More complex setup --- ## Sources - https://localllm.in/blog/best-gpus-llm-inference-2025 - https://sanj.dev/post/affordable-ai-hardware-local-llms - https://introl.com/blog/local-llm-hardware-pricing-guide-2025 - https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/ - https://huggingface.co/blog/daya-shankar/open-source-llms