🧠 Local LLM Inference Setup

Why Local Inference?

💰 60-80% cost reduction vs API calls
🔒 Keep data private (on-device)
⚡ Sub-second latency (local network)
🚀 Run agents 24/7 affordably

20+ Hardware Tiers with Model Recommendations

Complete breakdown from hobbyist ($15) to enterprise ($40k+):

Tier	Hardware	RAM	Storage	Speed	Cost	Models
Minimal	Raspberry Pi Zero 2W	512MB	64GB SD	~0.5 tok/s	$15	tinyllama (1B)
Minimal	Raspberry Pi 4 (2GB)	2GB	32GB SD	~2 tok/s	$35	phi3:mini (2B)
Entry	Raspberry Pi 5 (4GB)	4GB	128GB SSD	~3 tok/s	$65	qwen2.5:3b-q4
Entry	Orange Pi 5 Plus (16GB)	16GB	256GB NVMe	~5 tok/s	$120	qwen2.5:3b-q4, mistral:7b-q4
Entry	Jetson Orin Nano (8GB)	8GB	128GB NVMe	~8 tok/s	$199	qwen2.5:3b, phi3
Mid	Intel NUC (i5-1240P, 32GB)	32GB	512GB SSD	~15 tok/s	$600	qwen2.5:7b-q4, mistral:7b
Mid	Desktop (Ryzen 5 5600X, 32GB)	32GB	1TB SSD	~20 tok/s	$800	qwen2.5:7b-q4, llama3.1:8b
Mid	Jetson Orin AGX (64GB)	64GB	512GB NVMe	~25 tok/s	$999	qwen2.5:7b, llama3.1:8b
High	Desktop (RTX 3090, 64GB)	64GB	2TB SSD	~40 tok/s	$2,500	qwen2.5:7b, qwen3.5:27b-q4
High	Desktop (RTX 4070 Super, 32GB)	32GB	1TB SSD	~35 tok/s	$2,000	qwen2.5:14b-q4, mistral:12b
High	Desktop (RTX 4080, 48GB)	48GB	2TB SSD	~50 tok/s	$3,200	qwen3.5:27b-q4, llama3.1:70b-q4
High	Desktop (RTX 4090, 128GB)	128GB	4TB SSD	~80 tok/s	$5,000	qwen3.5:27b (fp16), llama3.1:70b
High	Desktop (RTX 5090, 256GB)	256GB	8TB SSD	~120+ tok/s	$8,000	qwen3.5:32b, llama3.1:405b-q4
High	Desktop (AMD R9 7950X, 192GB)	192GB	4TB SSD	~60 tok/s	$4,500	qwen3.5:27b, llama3.1:70b-q4
Enterprise	Mac Studio (M2 Ultra, 128GB)	128GB	2TB SSD	~45 tok/s	$4,000	qwen3.5:27b, llama3.1:8b-13b
Enterprise	Mac Studio (M2 Max, 96GB)	96GB	2TB SSD	~35 tok/s	$3,500	qwen2.5:7b, llama3.1:8b
Enterprise	Server (Dual Xeon, RTX 5090, 768GB)	768GB	8TB SSD/NVMe	~200+ tok/s	$15,000+	qwen3.5:32b, llama3.1:405b-q4
Enterprise	Server (Dual Xeon, RTX 6000 Ada, 512GB)	512GB	8TB SSD/NVMe	~150 tok/s	$12,000	qwen3.5:27b, llama3.1:70b
Enterprise	H100 GPU (40GB) + Server	512GB	8TB SSD	~300+ tok/s	$40,000+	Any model (full precision)
Enterprise	Cloud (AWS g4dn.12xlarge)	192GB	4x 550GB	~100 tok/s	$5/hour	Any model (on-demand)

Installation & Running Ollama

Install Ollama

# All platforms (Linux, macOS, Windows)
curl https://ollama.ai/install.sh | sh
ollama serve

# In another terminal:
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "What is CKB?"

Configure for Always-On Service

# Linux: systemd user service
cat > ~/.config/systemd/user/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network.target

[Service]
ExecStart=/usr/bin/ollama serve
Restart=on-failure
RestartSec=10
Environment="OLLAMA_MODELS=/home/$USER/.ollama/models"

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable ollama
systemctl --user start ollama

# Check status
systemctl --user status ollama

Quantization: Speed vs Quality

F32 (full precision): Best quality, huge file size (~27GB for 70B), slow
FP16 (half precision): Nearly identical quality, ~13.5GB for 70B, 2x faster
Q8 (8-bit quantization): Minimal quality loss, ~6.7GB for 70B, 4x faster
Q4 (4-bit quantization): Noticeable quality trade-off, ~3.5GB for 70B, 8x faster (default in model names)
Q2 (2-bit quantization): Significant quality loss, ~1.75GB, 16x faster (use only for simple tasks)

Best practice: Start with Q4 variants. Use Q8 if quality is critical. Use Q2 only for classification/simple tasks.

Cost Analysis (1M tokens/month)

API (Claude/GPT): $10-30
Local Ollama (electricity): $1-3
Hardware amortization (3-year): $2-5/month (depends on setup cost)
Total savings: 85-95% over cloud APIs

Optimization Tips

Use context caching: For long documents, cache context between requests
Batch inferences: Group multiple queries to reduce overhead
Choose right model: Don't use 70B for classification tasks (waste of resources)
Monitor resources: Watch CPU/RAM usage and adjust model size if needed
Pre-quantize: If you have a powerful machine, download FP16 and quantize once

← Back to AI Hub