Files
claudetools/projects/wrightstown-smarthome/documentation/llm-server-build.md
Mike Swanson aaf4172b3c sync: Add Wrightstown Solar and Smart Home projects
New projects from 2026-02-09 research session:

Wrightstown Solar:
- DIY 48V LiFePO4 battery storage (EVE C40 cells)
- Victron MultiPlus II whole-house UPS design
- BMS comparison (Victron CAN bus compatible)
- EV salvage analysis (new cells won)
- Full parts list and budget

Wrightstown Smart Home:
- Home Assistant Yellow setup (local voice, no cloud)
- Local LLM server build guide (Ollama + RTX 4090)
- Hybrid LLM bridge (LiteLLM + Claude API + Grok API)
- Network security (VLAN architecture, PII sanitization)

Machine: ACG-M-L5090
Timestamp: 2026-02-09

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 18:44:35 -07:00

5.4 KiB

Local LLM Server - Build Guide

Created: 2026-02-09 Purpose: Dedicated local LLM inference server for smart home + general use


Component Spec Est. Cost
GPU NVIDIA RTX 4090 24GB $1,200-1,500
CPU AMD Ryzen 7 5800X3D $250
Motherboard B550 $120
RAM 64GB DDR4 (2x32GB) $120
Storage 1TB NVMe Gen4 $80
PSU 850W Gold $100
Case Mid-tower, good airflow $70
Total $1,940-2,240

Why This Build

  • 24GB VRAM runs 70B parameter models at 4-bit quantization (7-9 tok/s)
  • 30B models at 8-bit quality (20+ tok/s)
  • 7-8B models at full speed (30+ tok/s)
  • Handles everything from quick voice commands to serious reasoning
  • Single GPU keeps complexity low

Alternative Builds

Budget (~$580):

  • RTX 3060 12GB (used, $250)
  • Ryzen 5 5600X + B450 + 32GB DDR4
  • Runs 7B models great, 13B quantized
  • Fine for voice commands and basic queries

Flagship (~$4,000-5,400):

  • RTX 5090 32GB ($2,500-3,800)
  • Ryzen 9 7950X + X670E + 128GB DDR5
  • 70B models at high quality (20+ tok/s)
  • 67% faster than 4090
  • Future-proof for larger models

Efficiency (Mac Mini M4 Pro, $1,399+):

  • 24-64GB unified memory
  • 20W power draw vs 450W for 4090
  • Great for always-on server
  • Qwen 2.5 32B at 11-12 tok/s on 64GB config

Software Stack

Primary: Ollama

# Install (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Install (Windows)
# Download from https://ollama.com/download

# Pull models
ollama pull qwen2.5:7b        # Fast voice commands
ollama pull llama3.1:8b        # General chat
ollama pull qwen2.5:32b        # Complex reasoning (needs 24GB+ VRAM)
ollama pull llama3.1:70b-q4    # Near-cloud quality (needs 24GB VRAM)

# API runs on http://localhost:11434

Why Ollama:

  • One-command install
  • Built-in model library
  • OpenAI-compatible API (works with LiteLLM, Open WebUI, HA)
  • Auto-optimized for your hardware
  • Dead simple

Web Interface: Open WebUI

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  ghcr.io/open-webui/open-webui:main
  • ChatGPT-like interface at http://<server-ip>:3000
  • Model selection dropdown
  • Conversation history
  • RAG (Retrieval Augmented Generation) support
  • User accounts

Model Recommendations

For Home Assistant Voice Commands (Speed Priority)

Model VRAM Speed (4090) Quality Use Case
Qwen 2.5 7B 4GB 30+ tok/s Good Voice commands, HA control
Phi-3 Mini 3.8B 2GB 40+ tok/s Decent Ultra-fast simple queries
Llama 3.1 8B 5GB 30+ tok/s Good General chat
Gemma 2 9B 6GB 25+ tok/s Good Efficient general use

For Complex Reasoning (Quality Priority)

Model VRAM (Q4) Speed (4090) Quality Use Case
Qwen 2.5 32B 18GB 15-20 tok/s Very Good Detailed analysis
Llama 3.1 70B 35GB 7-9 tok/s Excellent Near-cloud reasoning
DeepSeek R1 Varies 10-15 tok/s Excellent Step-by-step reasoning
Mistral 22B 12GB 20+ tok/s Good Balanced speed/quality

For Code

Model Best For
DeepSeek Coder V2 Debugging, refactoring
Qwen3 Coder Agentic coding tasks
Llama 3.1 70B General code + reasoning

Quantization Guide

4-bit quantization maintains 95-98% quality at 75% less VRAM. Most users can't tell the difference.

Quantization VRAM Use Quality When to Use
Q4_K_M 25% of full 95-98% Default for most models
Q5_K_M 31% of full 97-99% When you have spare VRAM
Q8_0 50% of full 99%+ When quality matters most
FP16 (full) 100% 100% Only for small models (7B)

RAM Requirements

Model Size Minimum System RAM Recommended GPU VRAM (Q4)
7B 16GB 32GB 4-5GB
13B 32GB 32GB 8-9GB
30B 32GB 64GB 16-18GB
70B 64GB 128GB 35-40GB

Power and Noise Considerations

This is an always-on server in your home:

GPU TDP Idle Power Annual Electricity (24/7)
RTX 3060 12GB 170W ~15W ~$50-80/yr idle
RTX 4060 Ti 16GB 165W ~12W ~$45-75/yr idle
RTX 4090 24GB 450W ~20W ~$60-100/yr idle
RTX 5090 32GB 575W ~25W ~$75-120/yr idle

Tips:

  • Configure GPU to idle at low power when no inference running
  • Use Ollama's auto-unload (models unload from VRAM after idle timeout)
  • Consider noise: a 4090 under load is not quiet. Use a case with good fans and put the server in a utility room/closet

Server OS Recommendations

Ubuntu Server 24.04 LTS (recommended)

  • Best NVIDIA driver support
  • Docker native
  • Easy Ollama install
  • Headless (no GUI overhead)

Windows 11 (if you want dual-use)

  • Ollama has native Windows support
  • Docker Desktop for Open WebUI
  • More overhead than Linux

Proxmox (if you want to run multiple VMs/containers)

  • GPU passthrough to LLM VM
  • Run other services alongside
  • More complex setup

Sources