Five Places RAG Shows Up in Agentic Systems

Ask most people what RAG is and they’ll tell you it’s semantic document search. You chunk up a pile of text, embed it, stuff it in a vector database, and pull the relevant bits back at query time. That’s the textbook example, and it’s a good one. But retrieval augmented generation does a lot more work inside agentic systems than “search the docs,” and I think it’s worth walking through where it actually shows up. So let’s talk about it.

This is the one everybody knows, so let’s get it out of the way first. You take raw text and convert it into high-dimensional vectors, where distance corresponds to conceptual similarity. Store those vectors, index them, and you can look things up by meaning instead of exact keywords.

The interesting part is how you index them, because the algorithm you pick is a real tradeoff.

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. The upper layers have long-distance links for fast routing across the space, and the lower layers have short-distance links for local search. You get low query latency and near-perfect recall. The catch is memory. It wants to keep the raw vectors around, so the footprint gets big.

IVF-PQ (Inverted File with Product Quantization) goes the other direction. It partitions the vector space into cells using k-means clustering, then compresses the high-dimensional vectors into compact quantized codes. Partition, then squish. That cuts memory consumption dramatically, which makes it a great fit for massive datasets with millions of vectors. The price you pay is recall accuracy, since all that compression throws away detail, and rebuilds get slower when you add new data.

Neither one is “correct.” You pick based on whether you’re optimizing for recall or for fitting the index in memory.

2. Tool Selection (RATS)

Here’s where it gets less obvious. Picture a CLI harness. As your developer toolkit grows, your agent slowly gets “equipped” with dozens or hundreds of possible actions. APIs, database calls, helpers, command executors. At some point you’ve just overloaded the thing with too much stuff.

Three bad things happen when you do that:

  • Tool space interference (TSI). Overlapping tool descriptions confuse the agent, and it calls the wrong one.
  • Context window saturation. Every tool schema, whether it’s JSON or Markdown, eats thousands of tokens. Pile up enough MCP servers and custom skills and you’re soaking the context window, which drives up cost and latency.
  • The lost-middle problem. Models tend to ignore tools and instructions buried in the middle of a very long prompt.

Retrieval augmented tool selection fixes this by treating your tools like a corpus. Instead of dumping every schema into the prompt, you retrieve only the handful of tools relevant to the current task. The agent sees a short, sharp menu instead of the entire pantry.

3. Dynamic Few-Shot Prompting

Few-shot prompting is a reliable way to enforce formatting constraints like a strict JSON schema, teach reasoning paradigms like chain of thought, or train an agent on error recovery. The problem is that static examples baked into a prompt are a guess. They might not match the task in front of you.

RAG lets you select the examples at runtime. You curate a database of gold-standard trajectories, each one pairing a specific query or error case with the correct step-by-step reasoning, tool calls, and final output that solved it. When a new task comes in, you search that database using the user’s intent, grab the top few most similar past trajectories, and prepend them to the system instructions.

So the agent always gets examples that actually resemble what it’s being asked to do, instead of whatever examples you happened to hardcode three weeks ago.

4. Long-Term Agent Memory

Work directly with a model and it forgets everything the moment the session closes. Your preferences, your corrections, the choices you already made. Gone. For an agent to be useful, it needs persistent memory across sessions. I’ve written about this before.

One system here is mem0, which uses a hybrid RAG architecture to persist state. It does asynchronous fact extraction, conflict resolution when new information contradicts old, and grounds the retrieved memories back into the prompt. The retrieval layer is what lets the agent surface the right past fact at the right time instead of replaying the entire history.

5. Evaluation and Test Harnesses

Testing AI in production is hard because the outputs aren’t deterministic. So you build evaluation harnesses that run your agent across hundreds of test cases, and RAG turns out to be a quiet workhorse in that loop.

Two ways it helps:

  • Diffing the test suite. Running every eval on every pull request is slow and expensive. Instead, query a vector index of your test suite using the git diff as the query, and run only the cases relevant to the code you actually touched.
  • Semantic assertions. Exact string matching is useless when you’re verifying something like an agent’s summary. Instead, the harness retrieves historic successful runs and uses vector similarity to ask whether the new output matches the intent and tone of the target, rather than matching it character for character.

None of this replaces the document-search version of RAG. It’s the same core trick, embed things, retrieve by similarity, ground the result, pointed at different problems: which tool, which example, which memory, which test. Once you start seeing retrieval as a general-purpose way to feed an agent the relevant slice of a much bigger pile, it shows up everywhere. I’ll probably keep finding more.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources

AI Agents Agentic Rag LLM