sync: Add Wrightstown Solar and Smart Home projects
New projects from 2026-02-09 research session: Wrightstown Solar: - DIY 48V LiFePO4 battery storage (EVE C40 cells) - Victron MultiPlus II whole-house UPS design - BMS comparison (Victron CAN bus compatible) - EV salvage analysis (new cells won) - Full parts list and budget Wrightstown Smart Home: - Home Assistant Yellow setup (local voice, no cloud) - Local LLM server build guide (Ollama + RTX 4090) - Hybrid LLM bridge (LiteLLM + Claude API + Grok API) - Network security (VLAN architecture, PII sanitization) Machine: ACG-M-L5090 Timestamp: 2026-02-09 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
192
projects/wrightstown-smarthome/documentation/llm-server-build.md
Normal file
192
projects/wrightstown-smarthome/documentation/llm-server-build.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Local LLM Server - Build Guide
|
||||
|
||||
**Created:** 2026-02-09
|
||||
**Purpose:** Dedicated local LLM inference server for smart home + general use
|
||||
|
||||
---
|
||||
|
||||
## Recommended Build: RTX 4090 (Sweet Spot)
|
||||
|
||||
| Component | Spec | Est. Cost |
|
||||
|---|---|---|
|
||||
| GPU | NVIDIA RTX 4090 24GB | $1,200-1,500 |
|
||||
| CPU | AMD Ryzen 7 5800X3D | $250 |
|
||||
| Motherboard | B550 | $120 |
|
||||
| RAM | 64GB DDR4 (2x32GB) | $120 |
|
||||
| Storage | 1TB NVMe Gen4 | $80 |
|
||||
| PSU | 850W Gold | $100 |
|
||||
| Case | Mid-tower, good airflow | $70 |
|
||||
| **Total** | | **$1,940-2,240** |
|
||||
|
||||
### Why This Build
|
||||
|
||||
- 24GB VRAM runs 70B parameter models at 4-bit quantization (7-9 tok/s)
|
||||
- 30B models at 8-bit quality (20+ tok/s)
|
||||
- 7-8B models at full speed (30+ tok/s)
|
||||
- Handles everything from quick voice commands to serious reasoning
|
||||
- Single GPU keeps complexity low
|
||||
|
||||
### Alternative Builds
|
||||
|
||||
**Budget (~$580):**
|
||||
- RTX 3060 12GB (used, $250)
|
||||
- Ryzen 5 5600X + B450 + 32GB DDR4
|
||||
- Runs 7B models great, 13B quantized
|
||||
- Fine for voice commands and basic queries
|
||||
|
||||
**Flagship (~$4,000-5,400):**
|
||||
- RTX 5090 32GB ($2,500-3,800)
|
||||
- Ryzen 9 7950X + X670E + 128GB DDR5
|
||||
- 70B models at high quality (20+ tok/s)
|
||||
- 67% faster than 4090
|
||||
- Future-proof for larger models
|
||||
|
||||
**Efficiency (Mac Mini M4 Pro, $1,399+):**
|
||||
- 24-64GB unified memory
|
||||
- 20W power draw vs 450W for 4090
|
||||
- Great for always-on server
|
||||
- Qwen 2.5 32B at 11-12 tok/s on 64GB config
|
||||
|
||||
---
|
||||
|
||||
## Software Stack
|
||||
|
||||
### Primary: Ollama
|
||||
|
||||
```bash
|
||||
# Install (Linux)
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
|
||||
# Install (Windows)
|
||||
# Download from https://ollama.com/download
|
||||
|
||||
# Pull models
|
||||
ollama pull qwen2.5:7b # Fast voice commands
|
||||
ollama pull llama3.1:8b # General chat
|
||||
ollama pull qwen2.5:32b # Complex reasoning (needs 24GB+ VRAM)
|
||||
ollama pull llama3.1:70b-q4 # Near-cloud quality (needs 24GB VRAM)
|
||||
|
||||
# API runs on http://localhost:11434
|
||||
```
|
||||
|
||||
**Why Ollama:**
|
||||
- One-command install
|
||||
- Built-in model library
|
||||
- OpenAI-compatible API (works with LiteLLM, Open WebUI, HA)
|
||||
- Auto-optimized for your hardware
|
||||
- Dead simple
|
||||
|
||||
### Web Interface: Open WebUI
|
||||
|
||||
```bash
|
||||
docker run -d -p 3000:8080 \
|
||||
-v open-webui:/app/backend/data \
|
||||
-e OLLAMA_BASE_URL=http://localhost:11434 \
|
||||
ghcr.io/open-webui/open-webui:main
|
||||
```
|
||||
|
||||
- ChatGPT-like interface at `http://<server-ip>:3000`
|
||||
- Model selection dropdown
|
||||
- Conversation history
|
||||
- RAG (Retrieval Augmented Generation) support
|
||||
- User accounts
|
||||
|
||||
---
|
||||
|
||||
## Model Recommendations
|
||||
|
||||
### For Home Assistant Voice Commands (Speed Priority)
|
||||
|
||||
| Model | VRAM | Speed (4090) | Quality | Use Case |
|
||||
|---|---|---|---|---|
|
||||
| Qwen 2.5 7B | 4GB | 30+ tok/s | Good | Voice commands, HA control |
|
||||
| Phi-3 Mini 3.8B | 2GB | 40+ tok/s | Decent | Ultra-fast simple queries |
|
||||
| Llama 3.1 8B | 5GB | 30+ tok/s | Good | General chat |
|
||||
| Gemma 2 9B | 6GB | 25+ tok/s | Good | Efficient general use |
|
||||
|
||||
### For Complex Reasoning (Quality Priority)
|
||||
|
||||
| Model | VRAM (Q4) | Speed (4090) | Quality | Use Case |
|
||||
|---|---|---|---|---|
|
||||
| Qwen 2.5 32B | 18GB | 15-20 tok/s | Very Good | Detailed analysis |
|
||||
| Llama 3.1 70B | 35GB | 7-9 tok/s | Excellent | Near-cloud reasoning |
|
||||
| DeepSeek R1 | Varies | 10-15 tok/s | Excellent | Step-by-step reasoning |
|
||||
| Mistral 22B | 12GB | 20+ tok/s | Good | Balanced speed/quality |
|
||||
|
||||
### For Code
|
||||
|
||||
| Model | Best For |
|
||||
|---|---|
|
||||
| DeepSeek Coder V2 | Debugging, refactoring |
|
||||
| Qwen3 Coder | Agentic coding tasks |
|
||||
| Llama 3.1 70B | General code + reasoning |
|
||||
|
||||
### Quantization Guide
|
||||
|
||||
4-bit quantization maintains 95-98% quality at 75% less VRAM. Most users can't tell the difference.
|
||||
|
||||
| Quantization | VRAM Use | Quality | When to Use |
|
||||
|---|---|---|---|
|
||||
| Q4_K_M | 25% of full | 95-98% | Default for most models |
|
||||
| Q5_K_M | 31% of full | 97-99% | When you have spare VRAM |
|
||||
| Q8_0 | 50% of full | 99%+ | When quality matters most |
|
||||
| FP16 (full) | 100% | 100% | Only for small models (7B) |
|
||||
|
||||
---
|
||||
|
||||
## RAM Requirements
|
||||
|
||||
| Model Size | Minimum System RAM | Recommended | GPU VRAM (Q4) |
|
||||
|---|---|---|---|
|
||||
| 7B | 16GB | 32GB | 4-5GB |
|
||||
| 13B | 32GB | 32GB | 8-9GB |
|
||||
| 30B | 32GB | 64GB | 16-18GB |
|
||||
| 70B | 64GB | 128GB | 35-40GB |
|
||||
|
||||
---
|
||||
|
||||
## Power and Noise Considerations
|
||||
|
||||
This is an always-on server in your home:
|
||||
|
||||
| GPU | TDP | Idle Power | Annual Electricity (24/7) |
|
||||
|---|---|---|---|
|
||||
| RTX 3060 12GB | 170W | ~15W | ~$50-80/yr idle |
|
||||
| RTX 4060 Ti 16GB | 165W | ~12W | ~$45-75/yr idle |
|
||||
| RTX 4090 24GB | 450W | ~20W | ~$60-100/yr idle |
|
||||
| RTX 5090 32GB | 575W | ~25W | ~$75-120/yr idle |
|
||||
|
||||
**Tips:**
|
||||
- Configure GPU to idle at low power when no inference running
|
||||
- Use Ollama's auto-unload (models unload from VRAM after idle timeout)
|
||||
- Consider noise: a 4090 under load is not quiet. Use a case with good fans and put the server in a utility room/closet
|
||||
|
||||
---
|
||||
|
||||
## Server OS Recommendations
|
||||
|
||||
**Ubuntu Server 24.04 LTS** (recommended)
|
||||
- Best NVIDIA driver support
|
||||
- Docker native
|
||||
- Easy Ollama install
|
||||
- Headless (no GUI overhead)
|
||||
|
||||
**Windows 11** (if you want dual-use)
|
||||
- Ollama has native Windows support
|
||||
- Docker Desktop for Open WebUI
|
||||
- More overhead than Linux
|
||||
|
||||
**Proxmox** (if you want to run multiple VMs/containers)
|
||||
- GPU passthrough to LLM VM
|
||||
- Run other services alongside
|
||||
- More complex setup
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- https://localllm.in/blog/best-gpus-llm-inference-2025
|
||||
- https://sanj.dev/post/affordable-ai-hardware-local-llms
|
||||
- https://introl.com/blog/local-llm-hardware-pricing-guide-2025
|
||||
- https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/
|
||||
- https://huggingface.co/blog/daya-shankar/open-source-llms
|
||||
Reference in New Issue
Block a user