Rag
-
How to Pick an Embedding Model (Without Overthinking It)
It’s easy to get deep into vector database comparisons, HNSW vs. IVF, pgvector vs. Pinecone, Qdrant vs. Chroma, and completely skip over the thing that actually matters most: the embedding model.
The way I think about it, the embedding model is the brain of your retrieval system. The vector database is just its filing cabinet. If the model creates poor mathematical representations of your data, no amount of indexing strategy or database performance is going to save you. You’ll get fast, confident, wrong results.
So let’s talk about how to pick a model.
Dimensionality: More Isn’t Always Better
Embeddings are high-dimensional vectors. Common sizes are 384, 768, 1536, or 3072 dimensions. Higher dimensions capture more nuance, but they also mean more storage, more memory, and slower search.
For a lean, local-first setup, something like
all-MiniLM-L6-v2at 384 dimensions gives you a surprisingly good balance of speed and accuracy. You don’t need 3072 dimensions to search your notes. Save the big vectors for when you actually have a reason.Sequence Length: The Silent Data Killer
Sequence length determines how much text the model can look at to create a single vector. If you’re embedding long technical docs or sprawling Markdown files and your model caps out at 512 tokens, it’s just truncating everything past that point. Your carefully written documentation gets chopped, and the embedding only represents the first few paragraphs.
Modern long-context embedding models handle 8k to 32k tokens, which lets you embed entire chapters or large code blocks as single semantic units. If your content is longer than a few paragraphs, check this number before anything else.
Domain Matters More Than You Think
General-purpose models like OpenAI’s
text-embedding-3-smallwork well across most tasks. They’ve been trained on massive, diverse datasets and they’re solid defaults.If you’re searching a codebase or technical documentation, models fine-tuned on programming languages (like
voyage-code-2) will outperform the general ones. The same applies to medical or legal text, where domain-specific jargon means the difference between a relevant result and a completely wrong one.Check MTEB Before You Commit
The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing models. It breaks performance into sub-categories like Retrieval, Summarization, and Clustering. If you’re building RAG, look at the Retrieval scores specifically. A model that ranks well for clustering might be mediocre at retrieval, and vice versa.
Local vs. API: Pick Your Tradeoff
This decision is as important as the model itself.
- Local models (via HuggingFace or Ollama) keep everything offline. Zero per-request costs, full privacy. Something like
bge-small-en-v1.5running locally is perfect for personal knowledge management or anything where your data shouldn’t leave your machine. - Hosted APIs (OpenAI, Voyage, Cohere) give you the highest performance and longest context windows without managing GPU infrastructure. Better for enterprise scale where you’re willing to trade privacy and recurring costs for accuracy.
Local models make sense for personal projects and hosted APIs make sense when the scale demands it. There’s no universal right answer, but there is a wrong one: picking a deployment model without thinking about where your data lives.
The vector database conversation is important, but it’s second in line to getting the embedding model right first. Everything downstream depends on it.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Rag / Embeddings / Vector-databases
- Local models (via HuggingFace or Ollama) keep everything offline. Zero per-request costs, full privacy. Something like
-
NotebookLM Is Just RAG With a Nice UI
I’ve been watching AI YouTubers recommend NotebookLM integrations that involve authenticating your Claude instance with some random skill they built. “Download my thing, hook it up, trust me bro.” No details on how it works under the hood. No mention of why piping your credentials through someone else’s code might be a terrible idea. Let’s just gloss over that, I guess.
So here we are. Let me explain what NotebookLM actually is, because once you understand RAG, the magic disappears pretty quickly.
What Is RAG?
RAG stands for Retrieval Augmented Generation. It’s an AI framework that improves LLM accuracy by retrieving data from trusted sources before generating a response.
The LLM provides the reasoning and token generation. RAG provides specific, trusted context. Combining the two gives you general reasoning grounded in your actual data instead of whatever the model has or hallucinated from its training set.
The core pipeline looks like this:
- Take your trusted data (docs, PDFs, YouTube transcripts, whatever)
- Chunk it into pieces
- Create vector embeddings from those chunks
- Store the vectors in a database
- When you ask a question, embed the question into the same vector space
- Find the most similar chunks
- Feed those chunks into the LLM as context alongside your question
That’s it. That’s NotebookLM. Steps 1 through 6 are the retrieval half. Step 7 is where the LLM synthesizes an answer. The nice UI on top doesn’t change what’s happening underneath.
I Accidentally Built Half of It?
I was interested in the semantic embeddings portion of this pipeline and ended up building something I called Semantic Docs. It handles the retrieval half, steps 1 through 6.
You point it at a knowledge base, internal company docs, research papers, whatever you’re interested in. It chunks the content, creates vector embeddings, and stores them in a database. When you search, it creates a new embedding from your query, finds the most similar chunks, and returns those as search results.
The difference between Semantic Docs and NotebookLM is that last step. Semantic Docs gives you the relevant files and passages. It says “here’s where the answers live, go read it.” It doesn’t pipe everything through an LLM to generate a synthesized response. This is a choice, a deliberate choice, not a missing feature.
Why No Official API Is a Problem
NotebookLM doesn’t have an official API. People have reverse-engineered how it works, which means every integration you see is built on undocumented behavior that could break at any time. The AI YouTubers recommending these workflows are essentially saying “trust this unofficial thing with your data and credentials.” That should make you uncomfortable.
If you understand RAG, you can build the parts you actually need. The retrieval half is genuinely useful on its own, and you control the whole pipeline. No third-party authentication. No undocumented APIs. No wondering what happens to your data.
I’ll probably write more about RAG in the future. It’s a good topic and there’s a lot of noise to cut through. For now, just know that the next time someone tells you NotebookLM is magic, it’s really just vector search with a chat interface on top.
If you’re a developer, I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week. Or you can find me on Mastodon at @[email protected].
/ AI / Programming / Rag / Notebooklm