📖 Retrieval-Augmented Generation (RAG)

What is RAG?

RAG combines knowledge retrieval + LLM generation:

User asks a question
System retrieves relevant documents (vector similarity search)
LLM reads the documents + answers based on context
Result: accurate, sourced, up-to-date answers

Why RAG?

✅ Reduces hallucinations — grounded in real data
✅ Keeps knowledge fresh — add new documents anytime
✅ Cheaper than fine-tuning — no retraining needed
✅ Citable — sources included in responses
✅ Domain-specific — works on proprietary data

Wyltek RAG System Architecture

Documents (research, blog posts, internal docs)
    ↓
Chunking (300-token chunks with overlap)
    ↓
Embedding (sentence-transformers: all-MiniLM-L6-v2, 384-dim)
    ↓
Vector Store (FAISS: fast similarity search + SQLite metadata)
    ↓
Retrieval (top-K similar chunks + metadata filtering)
    ↓
LLM Generation (context-aware answering with temp/top-p control)
    ↓
Cited Response (with source attribution)

Deploy Wyltek RAG

# Clone from GitHub
git clone https://github.com/toastmanAu/rag-system.git ~/rag-system

# Setup
cd ~/rag-system
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run on port 9990
python3 app.py

# Test health
curl http://localhost:9990/health
# → {"status": "ok", "service": "rag-system"}

Ingest Documents

curl -X POST http://localhost:9990/rag/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/doc",
    "html": "...",
    "tags": ["research", "fiber", "payments"]
  }'

Query with RAG

# Retrieve context (no LLM)
curl -X POST http://localhost:9990/rag/retrieve \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I set up Fiber payments?",
    "agent_id": "kernel",
    "k": 5
  }'

# Ask with LLM (retrieve + generate)
curl -X POST http://localhost:9990/rag/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I set up Fiber payments?",
    "agent_id": "kernel",
    "temperature": 0.2
  }'

Multi-Agent Profiles

Configure different agents to interpret the same knowledge differently:

agents.yaml:

agents:
  kernel:
    agent_name: "Kernel"
    temperature: 0.2
    retrieval_config:
      k_retrieve: 20
      k_rerank: 5
    source_weights:
      docs: 1.0
      research-findings: 0.8

  shannon:
    agent_name: "Shannon"
    temperature: 0.7
    retrieval_config:
      k_retrieve: 20
      k_rerank: 5
    source_weights:
      blog: 1.0
      research-findings: 0.6

Hardware Tiers for RAG Deployment

Choose based on knowledge base size, query volume, and latency requirements:

Tier	Hardware	RAM	Storage	Throughput	Cost (USD)	Use Case
Minimal	Raspberry Pi Zero 2 W	512MB	64GB SD	~0.5 req/s	$15	Hobby, embedded
Minimal	Raspberry Pi 4 (2GB)	2GB	32GB SD	~2 req/s	$35	Personal RAG, edge device
Entry	Raspberry Pi 5 (4GB)	4GB	128GB SSD	~3 req/s	$65	Small team RAG
Entry	Orange Pi 5 Plus (16GB)	16GB	256GB NVMe	~5 req/s	$120	Personal RAG server
Entry	Jetson Orin Nano (8GB)	8GB	128GB NVMe	~8 req/s	$199	Edge AI RAG with GPU
Mid	Intel NUC 12 (i5-1240P, 32GB)	32GB	512GB SSD	~15 req/s	$600	Home/office RAG server
Mid	Desktop (Ryzen 5 5600X, 32GB)	32GB	1TB SSD	~20 req/s	$800	Single workstation
Mid	Jetson Orin AGX (64GB)	64GB	512GB NVMe	~25 req/s	$999	AI research, embedded RAG
High	Desktop (RTX 3090, 64GB)	64GB	2TB SSD	~40 req/s	$2,500	Team RAG with GPU accel
High	Desktop (RTX 4070 Super, 32GB)	32GB	1TB SSD	~35 req/s	$2,000	Balanced GPU RAG
High	Desktop (RTX 4080, 48GB)	48GB	2TB SSD	~50 req/s	$3,200	Heavy-duty RAG + LLM
High	Desktop (RTX 4090, 128GB)	128GB	4TB SSD	~80 req/s	$5,000	Production RAG cluster node
High	Desktop (AMD R9 7950X, 192GB)	192GB	4TB SSD	~60 req/s	$4,500	Enterprise RAG (CPU-heavy)
Enterprise	Mac Studio (M2 Ultra, 128GB)	128GB	2TB SSD	~45 req/s	$4,000	Apple ecosystem RAG
Enterprise	Mac Studio (M2 Max, 96GB)	96GB	2TB SSD	~35 req/s	$3,500	Mac team RAG
Enterprise	Nvidia Jetson AGX Orin (64GB)	64GB	512GB NVMe	~30 req/s	$999	Edge AI RAG deployment
Enterprise	Server (Dual Xeon, RTX 5090, 768GB)	768GB	8TB SSD/NVMe	~200+ req/s	$15,000+	Multi-tenant RAG service
Enterprise	Server (Dual Xeon, RTX 6000 Ada, 512GB)	512GB	8TB SSD/NVMe	~150 req/s	$12,000	Professional RAG (multi-GPU)
Enterprise	H100 GPU (40GB) + Server	512GB	8TB SSD	~300+ req/s	$40,000+	High-volume RAG with fine-tuning
Enterprise	Cloud (AWS g4dn.12xlarge)	192GB	4x 550GB	~100 req/s	$5/hour	Scalable cloud RAG (on-demand)

Key Metrics

Throughput: Requests per second (small batch)
Latency: Query → Response time (typically 100-2000ms)
Knowledge base size: Up to 10M+ chunks (scalable)
Embedding dimension: 384-1024 dims (all-MiniLM: 384)
Vector DB: FAISS (in-memory) or Milvus/Weaviate (distributed)

Cost vs Performance

Budget: Raspberry Pi 5 + local Ollama (~$200 total) = perfect for hobby/learning
Small team: Desktop + RTX 4070 (~$2,000) = fast, reliable, no cloud costs
Production: Server + RTX 4090/5090 (~$5,000-8,000) = handles high volume, multi-user
Enterprise: Multi-GPU / H100 ($15k+) = unlimited scale, fine-tuning, compliance

← Back to AI Hub