Files

Mike Swanson aaf4172b3c sync: Add Wrightstown Solar and Smart Home projects

New projects from 2026-02-09 research session:

Wrightstown Solar:
- DIY 48V LiFePO4 battery storage (EVE C40 cells)
- Victron MultiPlus II whole-house UPS design
- BMS comparison (Victron CAN bus compatible)
- EV salvage analysis (new cells won)
- Full parts list and budget

Wrightstown Smart Home:
- Home Assistant Yellow setup (local voice, no cloud)
- Local LLM server build guide (Ollama + RTX 4090)
- Hybrid LLM bridge (LiteLLM + Claude API + Grok API)
- Network security (VLAN architecture, PII sanitization)

Machine: ACG-M-L5090
Timestamp: 2026-02-09

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-10 18:44:35 -07:00

5.4 KiB

Raw Blame History

Local LLM Server - Build Guide

Created: 2026-02-09 Purpose: Dedicated local LLM inference server for smart home + general use

Recommended Build: RTX 4090 (Sweet Spot)

Component	Spec	Est. Cost
GPU	NVIDIA RTX 4090 24GB	$1,200-1,500
CPU	AMD Ryzen 7 5800X3D	$250
Motherboard	B550	$120
RAM	64GB DDR4 (2x32GB)	$120
Storage	1TB NVMe Gen4	$80
PSU	850W Gold	$100
Case	Mid-tower, good airflow	$70
Total		$1,940-2,240

Why This Build

24GB VRAM runs 70B parameter models at 4-bit quantization (7-9 tok/s)
30B models at 8-bit quality (20+ tok/s)
7-8B models at full speed (30+ tok/s)
Handles everything from quick voice commands to serious reasoning
Single GPU keeps complexity low

Alternative Builds

Budget (~$580):

RTX 3060 12GB (used, $250)
Ryzen 5 5600X + B450 + 32GB DDR4
Runs 7B models great, 13B quantized
Fine for voice commands and basic queries

Flagship (~$4,000-5,400):

RTX 5090 32GB ($2,500-3,800)
Ryzen 9 7950X + X670E + 128GB DDR5
70B models at high quality (20+ tok/s)
67% faster than 4090
Future-proof for larger models

Efficiency (Mac Mini M4 Pro, $1,399+):

24-64GB unified memory
20W power draw vs 450W for 4090
Great for always-on server
Qwen 2.5 32B at 11-12 tok/s on 64GB config

Software Stack

Primary: Ollama

# Install (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Install (Windows)
# Download from https://ollama.com/download

# Pull models
ollama pull qwen2.5:7b        # Fast voice commands
ollama pull llama3.1:8b        # General chat
ollama pull qwen2.5:32b        # Complex reasoning (needs 24GB+ VRAM)
ollama pull llama3.1:70b-q4    # Near-cloud quality (needs 24GB VRAM)

# API runs on http://localhost:11434

Why Ollama:

One-command install
Built-in model library
OpenAI-compatible API (works with LiteLLM, Open WebUI, HA)
Auto-optimized for your hardware
Dead simple

Web Interface: Open WebUI

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  ghcr.io/open-webui/open-webui:main

ChatGPT-like interface at http://<server-ip>:3000
Model selection dropdown
Conversation history
RAG (Retrieval Augmented Generation) support
User accounts

Model Recommendations

For Home Assistant Voice Commands (Speed Priority)

Model	VRAM	Speed (4090)	Quality	Use Case
Qwen 2.5 7B	4GB	30+ tok/s	Good	Voice commands, HA control
Phi-3 Mini 3.8B	2GB	40+ tok/s	Decent	Ultra-fast simple queries
Llama 3.1 8B	5GB	30+ tok/s	Good	General chat
Gemma 2 9B	6GB	25+ tok/s	Good	Efficient general use

For Complex Reasoning (Quality Priority)

Model	VRAM (Q4)	Speed (4090)	Quality	Use Case
Qwen 2.5 32B	18GB	15-20 tok/s	Very Good	Detailed analysis
Llama 3.1 70B	35GB	7-9 tok/s	Excellent	Near-cloud reasoning
DeepSeek R1	Varies	10-15 tok/s	Excellent	Step-by-step reasoning
Mistral 22B	12GB	20+ tok/s	Good	Balanced speed/quality

For Code

Model	Best For
DeepSeek Coder V2	Debugging, refactoring
Qwen3 Coder	Agentic coding tasks
Llama 3.1 70B	General code + reasoning

Quantization Guide

4-bit quantization maintains 95-98% quality at 75% less VRAM. Most users can't tell the difference.

Quantization	VRAM Use	Quality	When to Use
Q4_K_M	25% of full	95-98%	Default for most models
Q5_K_M	31% of full	97-99%	When you have spare VRAM
Q8_0	50% of full	99%+	When quality matters most
FP16 (full)	100%	100%	Only for small models (7B)

RAM Requirements

Model Size	Minimum System RAM	Recommended	GPU VRAM (Q4)
7B	16GB	32GB	4-5GB
13B	32GB	32GB	8-9GB
30B	32GB	64GB	16-18GB
70B	64GB	128GB	35-40GB

Power and Noise Considerations

This is an always-on server in your home:

GPU	TDP	Idle Power	Annual Electricity (24/7)
RTX 3060 12GB	170W	~15W	~$50-80/yr idle
RTX 4060 Ti 16GB	165W	~12W	~$45-75/yr idle
RTX 4090 24GB	450W	~20W	~$60-100/yr idle
RTX 5090 32GB	575W	~25W	~$75-120/yr idle

Tips:

Configure GPU to idle at low power when no inference running
Use Ollama's auto-unload (models unload from VRAM after idle timeout)
Consider noise: a 4090 under load is not quiet. Use a case with good fans and put the server in a utility room/closet

Server OS Recommendations

Ubuntu Server 24.04 LTS (recommended)

Best NVIDIA driver support
Docker native
Easy Ollama install
Headless (no GUI overhead)

Windows 11 (if you want dual-use)

Ollama has native Windows support
Docker Desktop for Open WebUI
More overhead than Linux

Proxmox (if you want to run multiple VMs/containers)

GPU passthrough to LLM VM
Run other services alongside
More complex setup

5.4 KiB Raw Blame History