RAG: How an LLM Answers From Documents It Never Saw
The last post ended with a model you can talk to. Three training phases turned a random stack of layers into an assistant. You send a prompt, it answers like an assistant instead of autocompleting.
That is where most people stop. It is also where the real limits show up. The model only knows what it saw in training. Ask it about your company's internal framework, or a paper from last week, and it has nothing. Worse, it often makes something up with total confidence.
This post is about the fix almost every real AI product uses: retrieval-augmented generation, or RAG. You have met the term already, probably many times. It is not new. What is worth doing is bolting it to the machinery from the last two posts, so you can see exactly where it plugs in and why. Before the model answers, you go find the right documents and hand them over. Same running theme as before: small examples and diagrams, not notation.
TLDR
- A chat model only knows its training data. It has a knowledge cutoff, no access to private data, and it hallucinates.
- RAG fixes this by putting an application layer between you and the model.
- That layer uses a second kind of model, an embedding model, to turn every document into a list of numbers (a vector).
- Documents about similar things get similar numbers, so they sit near each other. You save all the vectors in a database.
- When you ask something, your prompt gets turned into a vector too, and the layer grabs the nearest documents.
- The text of those documents gets pasted in alongside your prompt. The LLM answers from what it reads, not from memory.
We will take these in order. First, why plain chatting is not enough.
Where plain chatting runs out
Picture the setup from the last post: you and the model, talking directly. You send a prompt, it sends back an answer, and it answers like an assistant because it went through supervised fine-tuning. Over an API or through a chat window, this is fine for a huge number of tasks.
But it has real gaps.
Knowledge cutoff. The model was trained on a snapshot of the internet that ends on some date. It does not know what happened after. Documentation for a language version that shipped last month, news from this morning, none of it is in there.
No private knowledge. Say your company built an internal framework, or you invented your own small language. The model never saw any of it. It cannot help you write code against something that was not on the public internet during training.
Hallucination. In the worst case, a model states things that are just wrong, and states them confidently. There is a nickname for this borrowed from image generation: the too-many-fingers problem. Old image models were bad at hands. Ask why, and the answer is telling: what is the most likely thing to appear next to a finger in a photo? Another finger. Each guess looks right next to its neighbor, but stack them up and you get a hand with six fingers. Text models do the same thing with facts. The model is trained to produce the most plausible next token, not the true one, and when it has no grounded fact to lean on, plausible and false can be the same string. This is not just a sampling knob. Turn the randomness all the way down and a model with no access to the fact will still state a confident, fluent, wrong answer.
Weak multi-step reasoning. These models have gotten much better, but on their own they do not naturally second-guess themselves or grind through a hard problem in depth. Something like Claude Code is not just the model. It is a harness: deterministic code, written by people, that forces the model to look at the environment, check its own changes, and debug. The model does not do that on its own. It has to be pushed.
That last point connects to how you prompt.
A quick aside: zero-shot vs few-shot
Zero-shot is when you give the model a task with no examples. "Solve this math problem." Just the problem.
Few-shot is when you show it a couple of worked examples first. "Here is a math problem, and here is how similar problems were solved," then the real one. The examples steer it.
Reasoning models attack the multi-step problem a different way. They spend extra hidden computation before the final answer, working through the problem in reasoning tokens you never see. OpenAI's o1 was the first to make this a headline feature. That helps, but it is not free. You are now paying for a pile of generated tokens the user never sees, just to get a better answer.
None of this closes the knowledge gap, though. Better prompting cannot tell the model a fact it never learned. For that, you need to bring the fact to it.
The core move: retrieve, then generate
So change the shape of the system. Instead of talking to the model directly, put an application layer in the middle.
The layer's whole job is retrieval. It looks at your prompt, finds documents that relate to it, and pastes those documents into the context it sends to the model. The model then answers using text it can see right in front of it, not just what is baked into its weights.
The interesting question is the retrieval itself. How does a program look at your prompt and a pile of documents and decide which ones are related? Keyword matching is brittle. Ask about "cars" and you miss the document that only ever says "vehicles." What you want is search by meaning. That needs a different kind of model.
First, a different kind of model
Everything in the last two posts was about LLMs: models that predict the next token. The model RAG leans on does not do that. It is an embedding model, sometimes called a vectorizer.
Feed it a chunk of text and it does not generate anything. It hands back a fixed list of numbers. That list is a vector: a single-row tensor, just some numbers with n dimensions.
The neat part is how similar it is to the architecture you already know. Remember the end of the transformer: the last tensor comes out, goes through a language-modeling head, gets turned into a histogram of probabilities, and you sample the next token. As a first mental model, an embedding model is that same stack with the last part chopped off. The real thing has a few more moving parts, and I will come back to them, but hold that picture for now.
So it is trained differently and used differently, but it is close family. It might still use attention and transformer layers. What it does not have is that final step of scoring a vocabulary and picking a token. It just spits out the tensor.
Now the moving parts I promised. A real embedding model is more than a decapitated LLM. First, one chunk of text is many tokens, so there are many tensors, and the model pools them into a single vector, often by averaging or by reading off one special position. Second, and more important, it is trained to make that vector useful. The training objective, called contrastive learning, pulls related texts close together and pushes unrelated ones apart, so distance in the vector space actually means what we want it to. And plenty of the strongest embedding models are built as encoders from the start (BERT-style, reading the whole text at once) rather than as a next-token model with the head removed. The chopped-off-LLM picture is a good way in. It is not the whole story.
One term to get straight, because people mix them up. The numbers that come out are not weights. Weights are the matrices inside the model, the thing the model is. What comes out are activations: the model's response to your specific text. And the whole array has a friendlier name: an embedding.
Embeddings are just coordinates
The key idea is that those numbers are coordinates.
Say an embedding has three numbers: 2.0, 1.3, 3.4. Read them as coordinates. Go 2.0 along the x-axis, 1.3 down the y-axis, 3.4 back on the z-axis. Now you are standing at a single point floating in a room. That point is one document.
A document about JavaScript code lands at one point. A document about Python code lands nearby, because the two are related. A document about making scrambled eggs lands way over in another corner, because it has nothing to do with either.
That is the whole trick. A good embedding model places related text close together and unrelated text far apart. You cannot point at any single number and say what it means. You might hope one number tracks how formal the text is and another tracks whether it is about cooking, but nobody can read them that cleanly. These models project text into spaces with thousands of dimensions, not three, and no human reasons well in a thousand dimensions.
That does not matter for building things. You do not care what each number means. You care that the distances are right, and that you can measure. Two documents about cats show up close together? Good. That is testable. A better embedding model is one that clusters related things more tightly, and you can score that.
You do not build this yourself
Good news: you almost never train an embedding model. You pick one off the shelf. OpenAI, Cohere, and Google all sell embedding models, often the same companies that sell the chat models, because they already have the architectures and the data. They ship both a chat family and an embeddings family.
Storing the vectors
One practical step comes first. Long documents get split into smaller chunks, maybe a few paragraphs each, and every chunk is embedded on its own. That way retrieval can pull the one passage that answers your question instead of a whole 40-page file. From here on, when I say "document" I mean one of these chunks.
Now run every chunk through the embedding model and save the results. You need two things stored: the vectors, and the text they came from.
These can be two databases or one. Postgres supports a vector column type through the pgvector extension, so you keep the document text in one column and its vector in another. MongoDB and even Redis support vector search too. Or you use something built only for vectors: Pinecone is a common hosted option, where you store the vector plus a bit of metadata pointing to the real document. For something lightweight, FAISS (Facebook AI Similarity Search) runs in memory: tell it how many dimensions each vector has, and start adding.
Finding the nearest vectors fast, across millions of them, takes some tricks. You will bump into the names as API options. Hierarchical navigable small worlds (HNSW) builds the search index. Cosine similarity measures how close two vectors point. You rarely implement either one yourself.
Semantic search, step by step
Now the pieces connect. Here is the full loop when a user sends a prompt.
Walk it in order:
- The user sends a prompt. The application layer catches it before the LLM sees it.
- The prompt goes through the same embedding model the documents went through. That gives a prompt vector, a point in the same space.
- The layer asks the vector database: which document vectors are closest to this prompt vector? Close in that space means related in meaning.
- The database returns document IDs, usually a few, not one.
- Those IDs fetch the real text from the document store.
- The layer builds the context: the user's original prompt, plus the retrieved documents. All of it goes to the LLM.
- The model answers from that context and the answer goes back to the user.
This is why it is called semantic search. You are searching on meaning, not on matching words. And this is the "retrieval-augmented" part of the name: the model is not generating on its own from memory, it is generating from retrieved context.
What this buys you
Step back and the payoffs are big.
Private and proprietary data. You never trained or fine-tuned the model. You just fed it documents at question time. So internal wikis, contracts, your own codebase, all of it becomes answerable without touching the model's weights.
Fresh data. Same reason. If the documents are current, the answers are current, cutoff be damned.
Retrieval does not have to be semantic search. This part surprises people. The only part that matters is the retrieval. It is often semantic search, but it could be a plain API call. Ask about the weather, and the layer can hit a weather API and paste the result in. It could even be a Google search, grab the top results, and hand them over. That last one is basically what Perplexity and other AI search engines do.
Auditable answers. Since you handed the model specific documents, you can ask it to cite which one it used. That gives you a way to check whether the answer is grounded or invented. It is not a guarantee: a model can cite a document it did not really use, or answer past what the documents say. But it is far easier to verify than a bare claim from a black box.
A workaround for limited context. Ever notice that a couple of years ago chat apps would warn you the conversation was getting too long, and now they do not? Part of that is just bigger context windows. But part of it is retrieval under the hood. The app can quietly save older parts of the chat off to the side and pull them back only when they become relevant again, sometimes alongside a running summary of what came before. One student project turned every single back-and-forth into a document in a vector database, so any past topic could be pulled back in on demand. Effectively infinite context, built out of retrieval.
The catch: context engineering
RAG is not free of hard problems. It just moves them.
The big one is how much context to send. Gather too few documents and the answer is missing. Gather too many and two things go wrong: you can overwhelm the model, and you pay for every token. Deciding the right amount is what people now call context engineering, and it is a real part of production work.
How you handle it depends on the model's price. On a cheap model with a long context window, like Gemini, you might dump a small repo or a fat slice of a big one and let it sort things out. Google keeps those prices low, so it can be fine. Only up to a point: even a million-token window hits limits, gets slower, and starts missing things buried in the middle. On an expensive model like Claude Opus, you get selective from the start: pull only the specific files that matter, or the abstract syntax tree, not the whole repo. This is the exact problem the Cursor and Claude Code teams grind on: which files to grab so the model gets the right code without paying to read everything, every time.
For big document sets, a two-pass move helps. First hand the model short synopses of many documents and ask which look relevant. Then do a second retrieval to pull the full text of just those. Two rounds, less waste.
One honest limit, because it is easy to oversell RAG. Nothing in this system forces the model to actually use the documents. There are two pressures that push it to. First, instruction tuning and the post-training these models go through generally reward using the context you provide, so a well-trained model leans on context by habit. Second, you say so in the prompt: tell it to answer only from the context, and these days you can add reasoning and ask it to weigh what is in the context against what it thinks it already knows. But there is no hard switch. The model can still fall back on its own weights, and still hallucinate. RAG lowers the odds. It does not remove them.
There is also a speed cost worth knowing. When you send a big context, the model first has to build the attention matrix over all those tokens, which is roughly the number of tokens squared. That first build is called prefill, and it is usually the main reason a reply takes a beat to start streaming. After that, each new token only adds one row. So more context does not just cost money. It costs latency up front.
The idea worth keeping
The last post's takeaway was that the architecture is the easy half. Here is the companion for this one: retrieval is just search, and RAG is the plumbing around it.
There is no new math here, and barely a new model. The embedding model is close cousin to the transformer you already know, trained to turn text into coordinates. The vector database is a database. Semantic search is nearest-neighbor lookup dressed up. What makes RAG powerful is not any one clever piece. It is that you stopped asking the model to know everything, and started handing it the right thing to read.
So when someone says an AI product "uses a custom-trained model," check. More often it is a strong off-the-shelf model with good retrieval in front of it. That front half, the part that decides what the model gets to see, is where most of the real engineering lives now.
Glossary
- RAG (retrieval-augmented generation): fetching relevant documents and giving them to the model as context before it answers.
- Embedding model / vectorizer: a model that turns text into a fixed list of numbers instead of generating tokens.
- Embedding / vector: that list of numbers. A point in a high-dimensional space.
- Chunk: a slice of a longer document (a few paragraphs), embedded and retrieved on its own so you fetch only the relevant passage.
- Semantic search: finding related items by comparing embeddings, so you match on meaning rather than exact words.
- Vector database: storage built to find the nearest vectors to a query vector quickly.
- Knowledge cutoff: the date the model's training data stops. It knows nothing after it.
- Hallucination: the model confidently stating something false or unsupported.
- Context engineering: deciding how much and which context to send, balancing quality against cost and speed.
- Prefill: the up-front work of building the attention matrix over the input, the main source of first-token latency.
- Zero-shot / few-shot: prompting a task with no examples, versus prompting it with a few worked examples first.
References and further reading
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: the paper that named RAG.
- Efficient and robust approximate nearest neighbor search using HNSW: the graph structure behind fast vector search.
- Lost in the Middle: how models use, and fail to use, long contexts, which is why context engineering matters.
- Pinecone's learning center: practical guides to vector databases and retrieval.
- FAISS: the in-memory similarity search library to prototype with.
- Earlier in this series: transformers and attention and from transformer to LLM.