RAG combines knowledge retrieval + LLM generation:
Documents (research, blog posts, internal docs)
↓
Chunking (300-token chunks with overlap)
↓
Embedding (sentence-transformers: all-MiniLM-L6-v2, 384-dim)
↓
Vector Store (FAISS: fast similarity search + SQLite metadata)
↓
Retrieval (top-K similar chunks + metadata filtering)
↓
LLM Generation (context-aware answering with temp/top-p control)
↓
Cited Response (with source attribution)
# Clone from GitHub
git clone https://github.com/toastmanAu/rag-system.git ~/rag-system
# Setup
cd ~/rag-system
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run on port 9990
python3 app.py
# Test health
curl http://localhost:9990/health
# → {"status": "ok", "service": "rag-system"}
curl -X POST http://localhost:9990/rag/ingest \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/doc",
"html": "...",
"tags": ["research", "fiber", "payments"]
}'
# Retrieve context (no LLM)
curl -X POST http://localhost:9990/rag/retrieve \
-H "Content-Type: application/json" \
-d '{
"query": "How do I set up Fiber payments?",
"agent_id": "kernel",
"k": 5
}'
# Ask with LLM (retrieve + generate)
curl -X POST http://localhost:9990/rag/ask \
-H "Content-Type: application/json" \
-d '{
"query": "How do I set up Fiber payments?",
"agent_id": "kernel",
"temperature": 0.2
}'
Configure different agents to interpret the same knowledge differently:
agents.yaml:
agents:
kernel:
agent_name: "Kernel"
temperature: 0.2
retrieval_config:
k_retrieve: 20
k_rerank: 5
source_weights:
docs: 1.0
research-findings: 0.8
shannon:
agent_name: "Shannon"
temperature: 0.7
retrieval_config:
k_retrieve: 20
k_rerank: 5
source_weights:
blog: 1.0
research-findings: 0.6
Choose based on knowledge base size, query volume, and latency requirements:
| Tier | Hardware | RAM | Storage | Throughput | Cost (USD) | Use Case |
|---|---|---|---|---|---|---|
| Minimal | Raspberry Pi Zero 2 W | 512MB | 64GB SD | ~0.5 req/s | $15 | Hobby, embedded |
| Minimal | Raspberry Pi 4 (2GB) | 2GB | 32GB SD | ~2 req/s | $35 | Personal RAG, edge device |
| Entry | Raspberry Pi 5 (4GB) | 4GB | 128GB SSD | ~3 req/s | $65 | Small team RAG |
| Entry | Orange Pi 5 Plus (16GB) | 16GB | 256GB NVMe | ~5 req/s | $120 | Personal RAG server |
| Entry | Jetson Orin Nano (8GB) | 8GB | 128GB NVMe | ~8 req/s | $199 | Edge AI RAG with GPU |
| Mid | Intel NUC 12 (i5-1240P, 32GB) | 32GB | 512GB SSD | ~15 req/s | $600 | Home/office RAG server |
| Mid | Desktop (Ryzen 5 5600X, 32GB) | 32GB | 1TB SSD | ~20 req/s | $800 | Single workstation |
| Mid | Jetson Orin AGX (64GB) | 64GB | 512GB NVMe | ~25 req/s | $999 | AI research, embedded RAG |
| High | Desktop (RTX 3090, 64GB) | 64GB | 2TB SSD | ~40 req/s | $2,500 | Team RAG with GPU accel |
| High | Desktop (RTX 4070 Super, 32GB) | 32GB | 1TB SSD | ~35 req/s | $2,000 | Balanced GPU RAG |
| High | Desktop (RTX 4080, 48GB) | 48GB | 2TB SSD | ~50 req/s | $3,200 | Heavy-duty RAG + LLM |
| High | Desktop (RTX 4090, 128GB) | 128GB | 4TB SSD | ~80 req/s | $5,000 | Production RAG cluster node |
| High | Desktop (AMD R9 7950X, 192GB) | 192GB | 4TB SSD | ~60 req/s | $4,500 | Enterprise RAG (CPU-heavy) |
| Enterprise | Mac Studio (M2 Ultra, 128GB) | 128GB | 2TB SSD | ~45 req/s | $4,000 | Apple ecosystem RAG |
| Enterprise | Mac Studio (M2 Max, 96GB) | 96GB | 2TB SSD | ~35 req/s | $3,500 | Mac team RAG |
| Enterprise | Nvidia Jetson AGX Orin (64GB) | 64GB | 512GB NVMe | ~30 req/s | $999 | Edge AI RAG deployment |
| Enterprise | Server (Dual Xeon, RTX 5090, 768GB) | 768GB | 8TB SSD/NVMe | ~200+ req/s | $15,000+ | Multi-tenant RAG service |
| Enterprise | Server (Dual Xeon, RTX 6000 Ada, 512GB) | 512GB | 8TB SSD/NVMe | ~150 req/s | $12,000 | Professional RAG (multi-GPU) |
| Enterprise | H100 GPU (40GB) + Server | 512GB | 8TB SSD | ~300+ req/s | $40,000+ | High-volume RAG with fine-tuning |
| Enterprise | Cloud (AWS g4dn.12xlarge) | 192GB | 4x 550GB | ~100 req/s | $5/hour | Scalable cloud RAG (on-demand) |