← Back to AI Hub

🧠 Local LLM Inference Setup

Why Local Inference?

20+ Hardware Tiers with Model Recommendations

Complete breakdown from hobbyist ($15) to enterprise ($40k+):

Tier Hardware RAM Storage Speed Cost Models
Minimal Raspberry Pi Zero 2W 512MB 64GB SD ~0.5 tok/s $15 tinyllama (1B)
Minimal Raspberry Pi 4 (2GB) 2GB 32GB SD ~2 tok/s $35 phi3:mini (2B)
Entry Raspberry Pi 5 (4GB) 4GB 128GB SSD ~3 tok/s $65 qwen2.5:3b-q4
Entry Orange Pi 5 Plus (16GB) 16GB 256GB NVMe ~5 tok/s $120 qwen2.5:3b-q4, mistral:7b-q4
Entry Jetson Orin Nano (8GB) 8GB 128GB NVMe ~8 tok/s $199 qwen2.5:3b, phi3
Mid Intel NUC (i5-1240P, 32GB) 32GB 512GB SSD ~15 tok/s $600 qwen2.5:7b-q4, mistral:7b
Mid Desktop (Ryzen 5 5600X, 32GB) 32GB 1TB SSD ~20 tok/s $800 qwen2.5:7b-q4, llama3.1:8b
Mid Jetson Orin AGX (64GB) 64GB 512GB NVMe ~25 tok/s $999 qwen2.5:7b, llama3.1:8b
High Desktop (RTX 3090, 64GB) 64GB 2TB SSD ~40 tok/s $2,500 qwen2.5:7b, qwen3.5:27b-q4
High Desktop (RTX 4070 Super, 32GB) 32GB 1TB SSD ~35 tok/s $2,000 qwen2.5:14b-q4, mistral:12b
High Desktop (RTX 4080, 48GB) 48GB 2TB SSD ~50 tok/s $3,200 qwen3.5:27b-q4, llama3.1:70b-q4
High Desktop (RTX 4090, 128GB) 128GB 4TB SSD ~80 tok/s $5,000 qwen3.5:27b (fp16), llama3.1:70b
High Desktop (RTX 5090, 256GB) 256GB 8TB SSD ~120+ tok/s $8,000 qwen3.5:32b, llama3.1:405b-q4
High Desktop (AMD R9 7950X, 192GB) 192GB 4TB SSD ~60 tok/s $4,500 qwen3.5:27b, llama3.1:70b-q4
Enterprise Mac Studio (M2 Ultra, 128GB) 128GB 2TB SSD ~45 tok/s $4,000 qwen3.5:27b, llama3.1:8b-13b
Enterprise Mac Studio (M2 Max, 96GB) 96GB 2TB SSD ~35 tok/s $3,500 qwen2.5:7b, llama3.1:8b
Enterprise Server (Dual Xeon, RTX 5090, 768GB) 768GB 8TB SSD/NVMe ~200+ tok/s $15,000+ qwen3.5:32b, llama3.1:405b-q4
Enterprise Server (Dual Xeon, RTX 6000 Ada, 512GB) 512GB 8TB SSD/NVMe ~150 tok/s $12,000 qwen3.5:27b, llama3.1:70b
Enterprise H100 GPU (40GB) + Server 512GB 8TB SSD ~300+ tok/s $40,000+ Any model (full precision)
Enterprise Cloud (AWS g4dn.12xlarge) 192GB 4x 550GB ~100 tok/s $5/hour Any model (on-demand)

Installation & Running Ollama

Install Ollama

# All platforms (Linux, macOS, Windows)
curl https://ollama.ai/install.sh | sh
ollama serve

# In another terminal:
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "What is CKB?"

Configure for Always-On Service

# Linux: systemd user service
cat > ~/.config/systemd/user/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network.target

[Service]
ExecStart=/usr/bin/ollama serve
Restart=on-failure
RestartSec=10
Environment="OLLAMA_MODELS=/home/$USER/.ollama/models"

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable ollama
systemctl --user start ollama

# Check status
systemctl --user status ollama

Quantization: Speed vs Quality

Best practice: Start with Q4 variants. Use Q8 if quality is critical. Use Q2 only for classification/simple tasks.

Cost Analysis (1M tokens/month)

Optimization Tips

← Back to AI Hub