Five Places RAG Shows Up in Agentic Systems
Ask most people what RAG is and they’ll tell you it’s semantic document search. You chunk up a pile of text, embed it, stuff it in a vector database, and pull the relevant bits back at query time. That’s the textbook example, and it’s a good one. But retrieval augmented generation does a lot more work inside agentic systems than “search the docs,” and I think it’s worth walking through where it actually shows up. So let’s talk about it.
1. High-Precision Semantic Search
This is the one everybody knows, so let’s get it out of the way first. You take raw text and convert it into high-dimensional vectors, where distance corresponds to conceptual similarity. Store those vectors, index them, and you can look things up by meaning instead of exact keywords.
The interesting part is how you index them, because the algorithm you pick is a real tradeoff.
HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. The upper layers have long-distance links for fast routing across the space, and the lower layers have short-distance links for local search. You get low query latency and near-perfect recall. The catch is memory. It wants to keep the raw vectors around, so the footprint gets big.
IVF-PQ (Inverted File with Product Quantization) goes the other direction. It partitions the vector space into cells using k-means clustering, then compresses the high-dimensional vectors into compact quantized codes. Partition, then squish. That cuts memory consumption dramatically, which makes it a great fit for massive datasets with millions of vectors. The price you pay is recall accuracy, since all that compression throws away detail, and rebuilds get slower when you add new data.
Neither one is “correct.” You pick based on whether you’re optimizing for recall or for fitting the index in memory.
2. Tool Selection (RATS)
Here’s where it gets less obvious. Picture a CLI harness. As your developer toolkit grows, your agent slowly gets “equipped” with dozens or hundreds of possible actions. APIs, database calls, helpers, command executors. At some point you’ve just overloaded the thing with too much stuff.
Three bad things happen when you do that:
- Tool space interference (TSI). Overlapping tool descriptions confuse the agent, and it calls the wrong one.
- Context window saturation. Every tool schema, whether it’s JSON or Markdown, eats thousands of tokens. Pile up enough MCP servers and custom skills and you’re soaking the context window, which drives up cost and latency.
- The lost-middle problem. Models tend to ignore tools and instructions buried in the middle of a very long prompt.
Retrieval augmented tool selection fixes this by treating your tools like a corpus. Instead of dumping every schema into the prompt, you retrieve only the handful of tools relevant to the current task. The agent sees a short, sharp menu instead of the entire pantry.
3. Dynamic Few-Shot Prompting
Few-shot prompting is a reliable way to enforce formatting constraints like a strict JSON schema, teach reasoning paradigms like chain of thought, or train an agent on error recovery. The problem is that static examples baked into a prompt are a guess. They might not match the task in front of you.
RAG lets you select the examples at runtime. You curate a database of gold-standard trajectories, each one pairing a specific query or error case with the correct step-by-step reasoning, tool calls, and final output that solved it. When a new task comes in, you search that database using the user’s intent, grab the top few most similar past trajectories, and prepend them to the system instructions.
So the agent always gets examples that actually resemble what it’s being asked to do, instead of whatever examples you happened to hardcode three weeks ago.
4. Long-Term Agent Memory
Work directly with a model and it forgets everything the moment the session closes. Your preferences, your corrections, the choices you already made. Gone. For an agent to be useful, it needs persistent memory across sessions. I’ve written about this before.
One system here is mem0, which uses a hybrid RAG architecture to persist state. It does asynchronous fact extraction, conflict resolution when new information contradicts old, and grounds the retrieved memories back into the prompt. The retrieval layer is what lets the agent surface the right past fact at the right time instead of replaying the entire history.
5. Evaluation and Test Harnesses
Testing AI in production is hard because the outputs aren’t deterministic. So you build evaluation harnesses that run your agent across hundreds of test cases, and RAG turns out to be a quiet workhorse in that loop.
Two ways it helps:
- Diffing the test suite. Running every eval on every pull request is slow and expensive. Instead, query a vector index of your test suite using the git diff as the query, and run only the cases relevant to the code you actually touched.
- Semantic assertions. Exact string matching is useless when you’re verifying something like an agent’s summary. Instead, the harness retrieves historic successful runs and uses vector similarity to ask whether the new output matches the intent and tone of the target, rather than matching it character for character.
None of this replaces the document-search version of RAG. It’s the same core trick, embed things, retrieve by similarity, ground the result, pointed at different problems: which tool, which example, which memory, which test. Once you start seeing retrieval as a general-purpose way to feed an agent the relevant slice of a much bigger pile, it shows up everywhere. I’ll probably keep finding more.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Sources
- Yury A. Malkov and Dmitry A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” (arXiv:1603.09320, 2016) — details on the multi-layer graph architecture and logarithmic complexity of the HNSW index.
- Hervé Jégou, Matthijs Douze, and Cordelia Schmid, “Product Quantization for Nearest Neighbor Search” (IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011) — explains compressing high-dimensional vectors and combining product quantization with inverted file indexing (IVF-PQ).
- mem0ai, “mem0: The Memory Layer for Personalized AI” (GitHub) — documentation and codebase for the persistent, self-improving memory layer for AI agents.
- Mostafa Ibrahim, “Agentic RAG vs Classic RAG: From a Pipeline to a Control Loop” (Towards Data Science, March 2026) — commentary on the shift from static document retrieval to agentic control loops and its associated system failure modes.
- Microsoft Research, “Tool-space interference in the MCP era: Designing for agent compatibility at scale” — the tool-space interference (TSI) problem from section 2, where overlapping tool descriptions degrade agent tool selection.
- rewire.it, “Dynamic Tool Allocation for AI Agents (The RATS Pattern)” — the retrieval-augmented tool selection (RATS) pattern from section 2: a router retrieves a relevant subset of tools from a larger catalog.