Concepts

RAG Explained

Retrieval-Augmented Generation is how you make AI accurate instead of confidently wrong. Give it a cheat sheet, not a bigger brain.

HalperBot

1. The Problem: Confident and Wrong

Remember from the LLMs page — the model always picks the most probable next token. It has no way to say "I don't know." So when you ask about your company's PTO policy, last quarter's revenue, or today's news, it invents an answer that sounds right but isn't. This is called hallucination, and it's the #1 reason people don't trust AI at work.

LLMs are trained on the internet, not your data

Their training data has a cutoff date. They don't know your internal docs, your Slack history, your database, or anything that happened after training.

Bigger models don't fix this

GPT-5 won't know your company's expense policy. A smarter brain with the same information still can't answer questions about data it's never seen.

2. The Solution: Give It a Cheat Sheet

RAG stands for Retrieval-Augmented Generation. Instead of hoping the model memorized the answer, you retrieve relevant information and inject it into the prompt. Here's how it works.

1
User asks a question
The starting point — a natural language query
"What is our company's parental leave policy?" — The user asks something the LLM couldn't possibly know from its training data alone.
2
Question gets embedded
Convert the question into a mathematical vector
The question is turned into a list of numbers (an "embedding") that captures its meaning. Think of it as plotting the question on a map where similar meanings are close together. "Parental leave policy" lands near "maternity benefits" and "paternity time off."
3
Search the knowledge base
Find the most relevant chunks of information
The system compares the question's embedding against pre-embedded chunks of your documents. It finds the 3-5 most similar chunks — maybe a section from the employee handbook, a recent HR update, and a benefits FAQ. This is "retrieval" — the R in RAG.
4
Inject context into the prompt
Combine retrieved info with the user's question
The retrieved chunks get inserted into the prompt as context: "Based on the following documents: [chunks]. Answer the user's question: [question]." The LLM now has the exact information it needs — no guessing required.
5
LLM generates a grounded answer
Answer based on real data, not training memorization
"Our parental leave policy provides 16 weeks for primary caregivers and 8 weeks for secondary caregivers, effective January 2025." — A specific, accurate, citable answer. The "augmented generation" — AG in RAG.

3. See the Difference: 12 Examples

Click through 12 real scenarios. Left = what a plain LLM says. Right = what a RAG-enhanced system says. The difference is night and day.

✖ Without RAG
✔ With RAG

4. Chunking: How Documents Get Split

Before RAG can search your documents, they need to be split into chunks — small pieces that each cover one idea. Too big = noise. Too small = lost context. Try it yourself.

~150 chars

5. Should You Use RAG?

RAG isn't always the answer. Answer 3 quick questions to find out the right approach for your use case.

Key Takeaways

1
RAG = give AI a cheat sheet

Instead of hoping the model memorized the answer, you retrieve the relevant info and inject it into the prompt. Simple concept, massive impact.

2
Embeddings are meaning-coordinates

Text gets converted to numbers where similar meanings are close together. That's how the system finds relevant chunks without keyword matching.

3
Chunk size is a real engineering decision

Too small and you lose context. Too big and you get noise. Most production systems use 200-500 tokens per chunk with some overlap.

4
RAG beats fine-tuning for most use cases

Fine-tuning changes the model permanently and is expensive. RAG keeps the model general and just feeds it the right info at query time. Cheaper, faster, updatable.

5
Personal RAG is the new resume

Peter built his own pRAG (Personal RAG) — an AI that answers questions grounded in his actual knowledge base: blog posts, talks, investor memos, and 4 years of building with AI. It powers the Saarvis chatbot on this site. Read how to build yours →