Concepts

RAG Explained

Retrieval-Augmented Generation is how you make AI accurate instead of confidently wrong. Give it a cheat sheet, not a bigger brain.

1. The Problem: Confident and Wrong

Remember from the LLMs page — the model always picks the most probable next token. It has no way to say "I don't know." So when you ask about your company's PTO policy, last quarter's revenue, or today's news, it invents an answer that sounds right but isn't. This is called hallucination, and it's the #1 reason people don't trust AI at work.

⚠

LLMs are trained on the internet, not your data

Their training data has a cutoff date. They don't know your internal docs, your Slack history, your database, or anything that happened after training.

⚠

Bigger models don't fix this

GPT-5 won't know your company's expense policy. A smarter brain with the same information still can't answer questions about data it's never seen.

2. The Solution: Give It a Cheat Sheet

RAG stands for Retrieval-Augmented Generation. Instead of hoping the model memorized the answer, you retrieve relevant information and inject it into the prompt. Here's how it works.

User asks a question

The starting point — a natural language query

"What is our company's parental leave policy?" — The user asks something the LLM couldn't possibly know from its training data alone.

Question gets embedded

Convert the question into a mathematical vector

The question is turned into a list of numbers (an "embedding") that captures its meaning. Think of it as plotting the question on a map where similar meanings are close together. "Parental leave policy" lands near "maternity benefits" and "paternity time off."

Search the knowledge base

Find the most relevant chunks of information

The system compares the question's embedding against pre-embedded chunks of your documents. It finds the 3-5 most similar chunks — maybe a section from the employee handbook, a recent HR update, and a benefits FAQ. This is "retrieval" — the R in RAG.

Inject context into the prompt

Combine retrieved info with the user's question

The retrieved chunks get inserted into the prompt as context: "Based on the following documents: [chunks]. Answer the user's question: [question]." The LLM now has the exact information it needs — no guessing required.

LLM generates a grounded answer

Answer based on real data, not training memorization

"Our parental leave policy provides 16 weeks for primary caregivers and 8 weeks for secondary caregivers, effective January 2025." — A specific, accurate, citable answer. The "augmented generation" — AG in RAG.

3. See the Difference: 12 Examples

Click through 12 real scenarios. Left = what a plain LLM says. Right = what a RAG-enhanced system says. The difference is night and day.

✖ Without RAG

✔ With RAG

4. Chunking: How Documents Get Split

Before RAG can search your documents, they need to be split into chunks — small pieces that each cover one idea. Too big = noise. Too small = lost context. Try it yourself.

Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval. Instead of relying solely on what the model learned during training, RAG systems search a knowledge base for relevant information before generating a response. This approach significantly reduces hallucinations and allows the AI to provide accurate, up-to-date answers.

The RAG pipeline has several key components. First, documents are split into chunks and converted into vector embeddings. These embeddings are stored in a vector database for fast similarity search. When a user asks a question, the question is also embedded and compared against the stored chunks.

The most relevant chunks are then injected into the LLM's prompt as context. The model uses this context to generate an answer that is grounded in actual source material rather than training data. This is why RAG systems can cite their sources — the information came from specific, retrievable documents.

Chunk size: ~150 chars

5. Should You Use RAG?

RAG isn't always the answer. Answer 3 quick questions to find out the right approach for your use case.

Key Takeaways

RAG = give AI a cheat sheet

Instead of hoping the model memorized the answer, you retrieve the relevant info and inject it into the prompt. Simple concept, massive impact.

Embeddings are meaning-coordinates

Text gets converted to numbers where similar meanings are close together. That's how the system finds relevant chunks without keyword matching.

Chunk size is a real engineering decision

Too small and you lose context. Too big and you get noise. Most production systems use 200-500 tokens per chunk with some overlap.

RAG beats fine-tuning for most use cases

Fine-tuning changes the model permanently and is expensive. RAG keeps the model general and just feeds it the right info at query time. Cheaper, faster, updatable.

Personal RAG is the new resume

Peter built his own pRAG (Personal RAG) — an AI that answers questions grounded in his actual knowledge base: blog posts, talks, investor memos, and 4 years of building with AI. It powers the Saarvis chatbot on this site. Read how to build yours →