AI

Embeddings Are Cheap Enough for Personal Wikis Now
My Obsidian vault is my main second brain: around 1,800 Markdown notes. Lately I’ve been less interested in what to put in it and more interested in a different question: what useful tools could I build on top of it?

A few ideas came to mind. Could I do better than Obsidian’s built-in search for the times I remember the idea but not the exact words I used? Could I eventually wrap the whole thing in my own little plugin? Both sound like fun future projects. But the one I wanted to tackle first was better backlinking: getting the system to find related notes and suggest links, so the graph helps maintain its own structure.

So I did what any reasonable person would do. I turned it into a small experiment.

The task was deliberately concrete: given an ordinary note, suggest the topic notes it should link up to. In Obsidian terms, that means adding a frontmatter field like this:
```
---
up:
  - "[[TypeScript]]"
  - "[[Agentic Harness Landscape]]"
---
```
Call them parent backlinks, MOC links, or knowledge graph edges. The label matters less than the shape: each note can have zero, one, or several parents, and the system should never force every note into exactly one folder.

The Setup

My vault had about 1,800 notes, 27 topic hubs (the pages other notes already linked to, like “TypeScript” or “Tailwind CSS”), and a hand-labeled gold set of 38 notes. No vector database. No pre-existing up: links.

Every approach had to emit the same report schema, so I could compare methods mechanically instead of eyeballing a few good examples. And nothing wrote to the vault by default. Applying links was a separate, explicit step. That “report first, apply later” rule is the difference between a useful automation and a spooky one.

The Baseline: Full-Text Search

I started with classic keyword retrieval: index the topic notes, turn each child note into a query, score with Postgres full-text search plus trigram similarity. It’s a good baseline because it’s deterministic, cheap, and explainable.

But it has an obvious weakness in a personal wiki. A note can be about “serverless container hosting” without ever saying “Azure Container Apps.” A note about pyenv and virtual environments might belong under “Python” without naming it. FTS is great when the words line up. Personal notes often don’t.

The Embeddings Version

The embeddings approach was simple: build a text representation for every topic note and every child note, embed both with Google’s Gemini Embedding 2 model, normalize, take the cosine similarity, and keep the top matches above a threshold.

The text prep mattered more than anything else. For child notes I used the title plus opening body. For topic notes, the title, aliases, summary, and first section, stripping wiki-link syntax. MOC pages are often long lists of [[Some Note]] links, and that’s worse input than a short prose description of what the topic actually means.

I did not need vector infrastructure. Vectors were stored as plain arrays in Postgres, cosine similarity computed in Python. With 27 topic vectors and 1,800 child notes, brute force ran in about 6 seconds, and embedding the whole vault plus every test run cost about 30 cents. At this scale, architecture matters less than good text prep.

The Numbers

On the 38-note gold set:

Approach Recall@1 Recall@3 MRR F1

Full-text search 0.519 0.615 0.609 0.424

Embeddings 0.923 0.962 1.000 0.622

Of the 30 intended links across the gold set, embeddings found 28 and missed 2. Full-text search found 18 and missed 12.

The precision wasn’t amazing, and that’s fine. The goal of the first pass is high-recall suggestion, not automatic writing. It produces candidates for review. A later LLM reranker or a human can tighten precision.

One detail surprised me. I tested two query framings: one worded as “search result,” one as “semantic similarity.” Ranking was identical, but the similarity framing had worse precision because it surfaced loosely related matches. For “which topic should this note be filed under?”, the task behaves more like search than duplicate detection. Semantic search and semantic similarity are not the same product requirement.

Thresholds and the Gold Set

My first threshold was too low. Everything looked related to everything. Raising cosine similarity to around 0.70 cut the output to a reviewable set: of 1,805 notes scanned, 682 got at least one suggestion, averaging about 2.4 parents each. The rest got nothing, which is correct. Plenty of notes are fragments, logs, and one-offs. A good backlink suggester should be comfortable saying nothing.

You don’t find that threshold on a model card. You find it on your own notes.

Which brings me to the most valuable artifact in the whole project: the gold set. It wasn’t the embedding index. It was 38 hand-labeled notes, including negative examples with no expected links. Without it, I’d have judged by vibes, and that’s dangerous with embeddings because the good examples look magical. You need the misses in the same report or you’ll overestimate the system. Thirty to fifty labeled examples is enough to expose bad assumptions.

What I’d Actually Do

If I were adding this to another wiki, the order would be:
1. Pick 20 to 100 topic notes explicitly, by folder, tag, or inbound link count. Don’t make every note a possible parent.
2. Label 30 to 50 examples, negatives included.
3. Build an FTS baseline first. If embeddings can’t beat it on your labeled data, your text prep is wrong.
4. Build the embedding baseline and compare.
5. Tune text prep and threshold.
6. Emit a report before any writes.
7. Only then consider LLM reranking or a real vector database.
That order keeps you honest. You’ll know whether embeddings help your wiki before you spend time getting Qdrant setup.

Personal wikis are small enough that we can stop treating semantic search as an enterprise architecture problem. The interesting work isn’t the vector database. It’s deciding what “related” means in your own graph, measuring it against your own notes, and building the tool so it suggests, explains, and writes only when asked. Use semantic search to propose structure. Use reports and gold sets to keep it honest. Let the human knowledge base stay human.

I’ll probably keep poking at it.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Jul 7, 2026 AI Embeddings Obsidian Semantic search Personal wiki

Enjoyed this?
Build Your Own Skills Repo
If you’ve been working with AI coding agents for a while, you’ve probably started collecting workflows. You might not call them that yet, but they’re there. Another name for workflows is Skills.

Some are tiny: run these checks before shipping. Some are project-specific: when you touch this library, preserve this API contract. Some are operational: never print secrets, always summarize the logs. Right now they probably live in chat history, a README note, a shell alias, or your own memory. That works until it doesn’t.

A skills repo is a better shape for this. It’s one place to collect, version, review, and share the workflows that make agents useful in your projects. I built my own yesterday, so let me walk through how I think about it.

A skill is judgment, not a command list

The first mistake is treating a skill like a list of commands. Commands matter, but they’re the easy part. The real value is judgment: when to run the command, what to inspect first, what not to do, how to validate the result, and what risks are specific to this tool.

A good skill makes an agent more careful. It narrows the space of bad decisions. So before you write a single instruction, figure out who the skill is for.

Separate adoption from maintenance

Most serious projects have two audiences: people using the project, and people maintaining it. Those should usually be two different skills.

An adoption skill helps an outside developer get value from your project. Install the package, configure it correctly, use the right import path, migrate existing code, avoid the common mistakes, run the right validation.

A maintainer skill helps contributors work inside the source repo. Understand the layout, run the local quality gate, preserve compatibility promises, follow the release conventions.

Here’s why you need two. If you only write maintainer skills, your repo becomes a private automation folder. If you write adoption skills too, it becomes onboarding infrastructure.

Keep each skill focused

A skill should have a job. Not “everything about this project.” Not a duplicate README. Good skill names are verbs: integrate, audit, upgrade, migrate, debug, ship. That keeps the trigger obvious, so when someone asks the agent to do that kind of work, the skill has a clear reason to load.

If a project needs multiple workflows, split them. A library might have integrate and develop. A deployment system might have deploy and rollback. Don’t cram them into one file.

Put the safety rules near the top

The most important part of many skills is the “do not” section:
- Do not print secret values.
- Do not delete or archive anything without confirmation.
- Do not add failing CI enforcement unless asked.
- Do not do broad rewrites before previewing a diff.
- Do not commit local registry URLs.
Agents are good at momentum. Safety rules are how you make that momentum usable. The more destructive the workflow, the more explicit the guardrails should be.

Include validation, not just execution

Every skill should answer one question: how do we know this worked? That might be pnpm test && pnpm build, or cargo test && cargo clippy -- -D warnings, or go test ./.... For non-code workflows it might be “export the review list” or “verify the generated config is ignored by git.”

This matters because agents can complete every step without completing the work. Validation closes the loop.

Write for the agent inside the repo

Skills should assume the agent is operating in a real project with real files and existing conventions. So the useful instructions look like:
- Inspect package.json before choosing a package manager.
- Read the existing test scripts before adding new ones.
- Prefer the local task-runner commands when they exist.
- Check the framework boundary before picking an import path.
That’s the context generic model knowledge won’t reliably infer. And it’s why you shouldn’t just copy the README into the skill. A README is for a human browsing the project. A skill is for an agent doing work. They overlap, but they aren’t the same artifact. Keep the skill short enough that loading it is cheap.

Use a marketplace repo as the index

Your skills repo doesn’t need to be the canonical home for every skill. Some projects should own their own plugin metadata, especially if they already have a CLI, release process, and docs. Your marketplace can just point at them remotely. Other skills can live directly in the marketplace repo. One structure that works:
```
skills/
  .claude-plugin/
    marketplace.json
  plugins/
    esm/
      .claude-plugin/plugin.json
      skills/develop/SKILL.md
    upkeep-rs/
      .claude-plugin/plugin.json
      skills/audit/SKILL.md
```
The marketplace becomes the thing people add once. Individual plugins stay free to live locally or point at their canonical upstream.

Scan third-party skills before you import them

The moment your marketplace points at someone else’s plugin, you’ve inherited their security posture. And skills are a soft target. The dangerous payload usually isn’t code, it’s prose: an attacker buries instructions inside a SKILL.md, gated behind an innocent-sounding trigger, that tell the agent to read your .env and send it somewhere. A normal code scanner walks right past that. There’s no malware signature to match. It’s just English.

This isn’t hypothetical. Snyk’s ToxicSkills research found prompt injection in 36% of the skills they tested, across more than a thousand malicious payloads. If you’re pulling skills from a public index, some fraction of them are trying to do something you didn’t ask for.

So run a scanner before you add anything you didn’t write. A few worth knowing:
- Snyk agent-scan inventories your installed agents, MCP servers, and skills, then checks them for prompt injection and data-handling problems.
- NVIDIA SkillSpector scans repos, URLs, or single files against a big catalog of patterns: injection, exfiltration, privilege escalation, tool poisoning.
- claude-skill-antivirus is purpose-built for Claude Code skills and runs several detection engines at once.
One caveat worth internalizing: scanning an MCP config can execute it, because starting a stdio server means running the command in the file. Do that in a sandbox, a container or a throwaway VM, not on your main machine. The tool you run to check for danger shouldn’t be the thing that sets it off.

This cuts both ways. If you publish a plugin others will install, a clear “do not” block and an honest description of what the skill touches is part of being a good citizen of the marketplace.

Start with your serious projects

You don’t need a skill for everything. Start where better agent behavior would matter: public libraries people might adopt, CLIs with safety-sensitive workflows, tools with tricky setup, projects with recurring maintenance, systems where mistakes are expensive.

For each one, ask yourself:
- Who is this for: user, maintainer, operator, contributor?
- What’s the concrete task?
- What should the agent inspect first?
- What commands are preferred, and which are dangerous?
- What should never happen silently?
- What validation proves the work succeeded?
Answer those and you have enough to write a useful first skill.

Why it’s worth doing

A skills repo turns scattered project knowledge into reusable operational guidance. But it also forces a better product question. If this project is meant to help people, what would it look like for an AI agent to help them use it well?

That’s a higher bar than “can the agent run the command?” The point isn’t to automate everything. It’s to package the judgment around your tools so the next agent starts from a better place.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Jul 6, 2026 AI Programming Agents Software development Developer tools

Enjoyed this?
The Capture Trap: Why Your Note Vault Is a Graveyard
Open your notes app and scroll to the bottom of the inbox. How many of those clippings have you reread? How many turned into anything? The answer is probably “almost none.” You have hundreds of saved articles, and half-finished thoughts, and the pile only ever grows. That’s not a second brain. That’s a graveyard.

I walked through Forte’s CODE workflow recently, four stages from Capture to Express. This post is about the stage everyone skips, and why skipping it is so easy that most vaults quietly die of it.

Capture feels like work. It isn’t.

Clipping an article gives you a little hit. You found something useful, you saved it, you can close the tab and feel like you made progress. But you didn’t learn anything. You filed it. The act of saving stands in for the act of understanding, and your brain happily accepts the substitution.

The Zettelkasten people have a name for this: the collector’s fallacy. Gathering material feels like knowledge work, so you keep gathering, and the gathering itself becomes the hobby. The collection grows. Your understanding doesn’t. You end up with a beautifully organized library you’ve never read.

Capture is frictionless now, which makes the trap worse. Web clippers, voice memos, a hotkey that drops anything into your inbox. The easier it gets to collect, the faster the graveyard fills.

Express is where the value is, and it’s the part that hurts

Express is the stage where you do something with a note: write the post, make the decision, ship the code, send the reply. It’s the only stage that produces anything. It’s also the one that takes effort, because it forces you to actually think about the material instead of just owning it.

So it gets deferred. Forever. And a vault where nothing ever reaches Express is just an expensive way to forget things slowly.

The fix isn’t more capture discipline or a prettier folder structure. It’s making Express the default destination of a note instead of an optional last step you’ll get to someday.

Give every note a lifecycle

Stop treating notes as either “saved” or “not saved.” Give them a status, a small piece of frontmatter that says where the note is in its life:
- raw is something you captured and haven’t processed.
- distilled is a note you’ve summarized in your own words.
- expressed is one that fed into actual output.
Now your vault has a pulse. You can query it. “Show me everything still sitting at raw from the last two weeks” turns the invisible backlog into a list you can act on. The graveyard problem was always that dead notes looked exactly like live ones. A status field makes the dead ones visible.

Point an agent at the backlog

This is where it gets fun, and where a CLI agent that can read your vault earns its keep.

Once notes carry a status, you can hand the boring half of Express to an agent. Wire up a weekly job that does three things:
1. Query every note still sitting at raw.
2. For each one, draft a two-sentence summary and a single question: is this worth keeping, and what would you make from it?
3. Drop the results in front of you as a short review list.
You’re no longer staring at a wall of three hundred clippings. You’re answering ten questions about ten notes, and the agent did the reading. The notes you keep get promoted to distilled. The ones you don’t get archived without guilt. Either way they leave the inbox, which is the whole point.

The model as a sparring partner

The last piece is using the model to get from a distilled note to actual output. Hand it a cluster of related notes and an outline, and ask it to argue with you. Where’s the thesis weak? What’s the counterargument? What example would make this land?

The model doesn’t write the thing for you, and you don’t want it to, that’s how you end up with generic mush in your own voice. It pushes the note one stage further down the pipeline, from a pile of research into a draft with a spine. You take it from there.

That’s the anti-graveyard loop. Capture stays frictionless, because friction there is bad. But every captured note now enters a pipeline that pushes it toward output instead of letting it rot in an inbox. The status field makes the backlog visible, the agent works it down for you, and the model helps you ship.

A vault isn’t valuable because of what’s in it. It’s valuable because of what comes out. Build the part that gets things out, and the graveyard turns back into a brain.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources
- Tiago Forte, Building a Second Brain (2022) — the CODE workflow and the “Express” stage as the antidote to collect-and-forget note-taking.
- Christian Tietze, “The Collector’s Fallacy” (Zettelkasten.de, 2015) — why gathering material feels like learning when it isn’t, and how the collection becomes the hobby.
Jul 4, 2026 AI Agents Second brain Note-taking Knowledge management

Enjoyed this?
Four Ways to Build Your Own Agentic Harness
The model is the commodity. You rent it. Everything else, the loop, the tools, the state, the permissions, the memory, the orchestration, is the harness. And the harness is the part you get to own.

So the real question, when people say they want to “build their own agent,” isn’t which model. It’s how much of the scaffolding you want to inherit versus how much you want to build yourself. That’s the whole decision. Where you land on that spectrum gives you, more or less, four approaches.

A: Build on top of a full harness

This is the composition-heavy approach, and it’s the one I’ve mostly done. You take a fully featured harness like Claude Code or opencode (and I think a few of the newer agentic IDEs support this now too), and you extend it. Sub-agents, skills, slash commands. You can call these things remotely with CLI commands and wire your own behavior around them.

There are two flavors here. You can build your harness inside the existing one, or you can build it around the existing one. Either way, you own all the agent definitions, the skills, the model choices. But you inherit a lot too: which tools the agent can call, the permission model, all the safety controls. You don’t get to change those, you just get to use them.

And honestly? That’s the appeal. You don’t start from scratch. Somebody already solved the boring, dangerous parts. You show up and build the part you care about.

B: Self-host a fully open harness

This one looks a lot like A, but with one important difference: the underlying harness is open. You’re self-hosting and extending something fully open-source, then polishing and tailoring it to your needs.

The trade is control. In approach A, the tools and permissions and controls are handed to you and you live with them. Here, the entire stack is yours to crack open. If the permission model annoys you, you change it. If you want a tool to behave differently at the loop level, you can reach in and do that. You’re still not building from nothing, but nothing is off-limits either.

C: Assemble from primitives

Now we’re getting low. With this approach you build your harness up from very minimal pieces. Think of it like a full-stack framework in the web world, except for agents. Something like pydantic-ai’s SDK, or pulling aider in as a library rather than running it as a tool.

You don’t start from absolute zero, but you’re close. You define the agent loop yourself. You register the tools. You add the features you need, one at a time. You’re scaffolding basically everything, and all you’re inheriting is a handful of core primitives that you get to shape into whatever you want.

This is the approach for people who have opinions and want to express all of them. It’s more work. It’s also the most yours.

D: An agentic harness framework

The last one is a different animal, and I’ll admit it took me a second to see why it’s its own category. These are orchestration frameworks like LangGraph or Letta (formerly MemGPT). They’re not coding CLIs. They’re SDKs for building a custom agent application: graphs, state machines, first-class memory.

The distinction that finally clicked for me is what you’re building. With A through C, you’re mostly building a personal dev tool, something that helps you write code. With D, the harness is the product. You’re shipping it. A domain agent, a service, a customer-facing thing that happens to be agentic under the hood. The orchestration framework is what you reach for when the agent isn’t your tool, it’s your deliverable.

So where do you land?

There’s no correct answer here, just a trade you’re making on purpose. The more you inherit, the faster you move and the less you control. The more you build, the more it’s yours and the more of the boring, dangerous plumbing you’re now on the hook for.

For most of what I do, A is the sweet spot. Building around and within an existing harness is just so much easier than starting cold, and I’d rather spend my time on the agent definitions than on reinventing a permission model. But if I were shipping an agent as a product instead of a tool, I’d be over in D without thinking twice.

Figure out which thing you’re building first. The approach falls out of that.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources
- Anthropic, “Claude Code” (GitHub) — documentation and repository for the terminal-based agentic coding tool that reads codebases, runs commands, and integrates MCP servers.
- Aider AI, “Aider” — official documentation and pair-programming guides for the open-source, git-native AI coding assistant.
- Anomaly Co., “OpenCode” — official homepage and repository for the open-source terminal, desktop, and IDE-based AI coding agent.
- Pydantic, “Pydantic AI” — official documentation and API references for the type-safe Python agent framework designed for structured agent trajectories.
- Letta AI, “Letta” (GitHub) — repository for the stateful agent runtime (formerly MemGPT) that manages persistent memory tiers and agentic state machines.
Jul 3, 2026 AI Agents LLM Developer tools

Enjoyed this?
Five Places RAG Shows Up in Agentic Systems
Ask most people what RAG is and they’ll tell you it’s semantic document search. You chunk up a pile of text, embed it, stuff it in a vector database, and pull the relevant bits back at query time. That’s the textbook example, and it’s a good one. But retrieval augmented generation does a lot more work inside agentic systems than “search the docs,” and I think it’s worth walking through where it actually shows up. So let’s talk about it.

1. High-Precision Semantic Search

This is the one everybody knows, so let’s get it out of the way first. You take raw text and convert it into high-dimensional vectors, where distance corresponds to conceptual similarity. Store those vectors, index them, and you can look things up by meaning instead of exact keywords.

The interesting part is how you index them, because the algorithm you pick is a real tradeoff.

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. The upper layers have long-distance links for fast routing across the space, and the lower layers have short-distance links for local search. You get low query latency and near-perfect recall. The catch is memory. It wants to keep the raw vectors around, so the footprint gets big.

IVF-PQ (Inverted File with Product Quantization) goes the other direction. It partitions the vector space into cells using k-means clustering, then compresses the high-dimensional vectors into compact quantized codes. Partition, then squish. That cuts memory consumption dramatically, which makes it a great fit for massive datasets with millions of vectors. The price you pay is recall accuracy, since all that compression throws away detail, and rebuilds get slower when you add new data.

Neither one is “correct.” You pick based on whether you’re optimizing for recall or for fitting the index in memory.

2. Tool Selection (RATS)

Here’s where it gets less obvious. Picture a CLI harness. As your developer toolkit grows, your agent slowly gets “equipped” with dozens or hundreds of possible actions. APIs, database calls, helpers, command executors. At some point you’ve just overloaded the thing with too much stuff.

Three bad things happen when you do that:
- Tool space interference (TSI). Overlapping tool descriptions confuse the agent, and it calls the wrong one.
- Context window saturation. Every tool schema, whether it’s JSON or Markdown, eats thousands of tokens. Pile up enough MCP servers and custom skills and you’re soaking the context window, which drives up cost and latency.
- The lost-middle problem. Models tend to ignore tools and instructions buried in the middle of a very long prompt.
Retrieval augmented tool selection fixes this by treating your tools like a corpus. Instead of dumping every schema into the prompt, you retrieve only the handful of tools relevant to the current task. The agent sees a short, sharp menu instead of the entire pantry.

3. Dynamic Few-Shot Prompting

Few-shot prompting is a reliable way to enforce formatting constraints like a strict JSON schema, teach reasoning paradigms like chain of thought, or train an agent on error recovery. The problem is that static examples baked into a prompt are a guess. They might not match the task in front of you.

RAG lets you select the examples at runtime. You curate a database of gold-standard trajectories, each one pairing a specific query or error case with the correct step-by-step reasoning, tool calls, and final output that solved it. When a new task comes in, you search that database using the user’s intent, grab the top few most similar past trajectories, and prepend them to the system instructions.

So the agent always gets examples that actually resemble what it’s being asked to do, instead of whatever examples you happened to hardcode three weeks ago.

4. Long-Term Agent Memory

Work directly with a model and it forgets everything the moment the session closes. Your preferences, your corrections, the choices you already made. Gone. For an agent to be useful, it needs persistent memory across sessions. I’ve written about this before.

One system here is mem0, which uses a hybrid RAG architecture to persist state. It does asynchronous fact extraction, conflict resolution when new information contradicts old, and grounds the retrieved memories back into the prompt. The retrieval layer is what lets the agent surface the right past fact at the right time instead of replaying the entire history.

5. Evaluation and Test Harnesses

Testing AI in production is hard because the outputs aren’t deterministic. So you build evaluation harnesses that run your agent across hundreds of test cases, and RAG turns out to be a quiet workhorse in that loop.

Two ways it helps:
- Diffing the test suite. Running every eval on every pull request is slow and expensive. Instead, query a vector index of your test suite using the git diff as the query, and run only the cases relevant to the code you actually touched.
- Semantic assertions. Exact string matching is useless when you’re verifying something like an agent’s summary. Instead, the harness retrieves historic successful runs and uses vector similarity to ask whether the new output matches the intent and tone of the target, rather than matching it character for character.
None of this replaces the document-search version of RAG. It’s the same core trick, embed things, retrieve by similarity, ground the result, pointed at different problems: which tool, which example, which memory, which test. Once you start seeing retrieval as a general-purpose way to feed an agent the relevant slice of a much bigger pile, it shows up everywhere. I’ll probably keep finding more.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources
- Yury A. Malkov and Dmitry A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” (arXiv:1603.09320, 2016) — details on the multi-layer graph architecture and logarithmic complexity of the HNSW index.
- Hervé Jégou, Matthijs Douze, and Cordelia Schmid, “Product Quantization for Nearest Neighbor Search” (IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011) — explains compressing high-dimensional vectors and combining product quantization with inverted file indexing (IVF-PQ).
- mem0ai, “mem0: The Memory Layer for Personalized AI” (GitHub) — documentation and codebase for the persistent, self-improving memory layer for AI agents.
- Mostafa Ibrahim, “Agentic RAG vs Classic RAG: From a Pipeline to a Control Loop” (Towards Data Science, March 2026) — commentary on the shift from static document retrieval to agentic control loops and its associated system failure modes.
- Microsoft Research, “Tool-space interference in the MCP era: Designing for agent compatibility at scale” — the tool-space interference (TSI) problem from section 2, where overlapping tool descriptions degrade agent tool selection.
- rewire.it, “Dynamic Tool Allocation for AI Agents (The RATS Pattern)” — the retrieval-augmented tool selection (RATS) pattern from section 2: a router retrieves a relevant subset of tools from a larger catalog.
Jul 2, 2026 AI Agents Agentic Rag LLM

Enjoyed this?
A Field Glossary for Agentic Knowledge Work
Everywhere you look right now, somebody is saying agentic this and agent that. Harness, scaffold, skill, subagent, agentic OS. The vocabulary is piling up faster than anyone can keep track of, and a lot of it gets used loosely, sometimes by people who don’t actually know what the words mean.

So I figured I’d write down my own working glossary. This isn’t a textbook, and I’m not pretending these are official definitions. It’s how I think about the terms when I’m doing the work. If you’ve been nodding along in conversations without being totally sure what a harness is, this one’s for you.

Agent

There are a lot of flavors of agent, and I’m not going to catalog all of them. Generally speaking, an agent is some code wrapped around an LLM that runs in a loop, has access to tools, and can act using those tools.

That’s the key difference from a plain prompt-and-response call. A direct call to the model gives you one answer and stops. An agent has some degree of autonomy. It can decide to use a tool, look at the result, and keep going. The loop and the tools are what make it an agent instead of a chatbot.

Harness

A harness is the program that sits on top of the model. It manages conversation state, runs the reasoning loop, gives the model access to tools, and enforces the guardrails, things like permissions, controls, and budget.

Here’s an easy way to understand it. The model is the intelligence. The harness is the control on the intelligence. The harness sits between you and the LLM.

A harness can show up in a lot of places. It might be a CLI. It might be a GUI or an app on your phone. It might be a chat thread. You could wire up something like OpenClaw to talk to you in WhatsApp or Telegram, and that chat becomes your harness, while OpenClaw is also a harness underneath. So yes, harnesses can call other harnesses. It’s turtles a little way down.

Scaffold

You’ll hear people say scaffold or scaffolding. This is usually just another word for harness. The prompt, the tools, and the control structure wrapped around the model. Same idea, different label.

Framework or SDK

These are the libraries you build harnesses with. LangChain, the various agent SDKs, or a ready-to-run harness like Claude Code or Hermes.

Worth flagging that framework and SDK mean something specific in regular programming. In the agentic context they’re a little looser. They’re what you build agents and harnesses out of. And it doesn’t have to be off-the-shelf. You can absolutely build your own framework for building your own harnesses if that’s where your head is at.

Context Engineering

This is the big one. The term comes from Karpathy, and while it mattered even more a year ago than it does today, it still applies.

Context engineering is deliberately managing what’s in the context window of the current session. It’s the work of deciding what gets loaded into context and, just as importantly, what gets left out. It’s the successor to what we used to call prompt engineering. The framing shifted because the prompt is only one piece of what the model sees, and the rest of it matters a lot.

MCP

Model Context Protocol. I won’t go deep here, it deserves its own post. MCP is an open protocol for exposing tools and data to an LLM harness. It’s the standard way for your harness to reach out and use third-party software or pull in outside data.

Skills

Skills are a major, important thing, and I’m not going to do them justice in a glossary entry. But here’s the definition.

A skill is a reusable, often self-created capability that bundles up the instructions an agent needs to accomplish a specific task. You can find skills all over the internet now. Everybody’s got their own. You can generate your own pretty easily with the CLI, and harnesses like Claude Code or Hermes can even author their own skills. The word does the work here. Skills are capabilities. It’s how you extend what your agents and harnesses can actually do.

Subagent

A subagent is an isolated child agent spawned from inside a working session.

Say you’re in your harness of choice with a main session running. That orchestration session can fork off a new agent with its own context window, hand it a specific task, and say go do this. The subagent runs on its own, often in parallel with others, working in the background. The main session knows when it finishes and can check the work.

A lot of the time you’ll have a second subagent review the first one’s output. That review loop is the whole idea behind the agentic maturity model, which is a way of thinking about how to structure this kind of work. It’s on GitHub if you want to dig in.

Agentic OS

This is an orchestration layer that combines agents, memory, and tools. It’s not really an operating system, but the name has stuck. You take all these concepts, the skills, the agents, the memory, the tools, and combine them into one organized whole. People are calling that amalgamation an agentic OS.

Second Brain / PKM

A personal knowledge vault. I posted about this just this week. It’s a personal knowledge base that the model can read, search, and extend. Your notes, your references, your accumulated thinking, made available to the agent.

Vibe Coding vs. Agentic Engineering

This is the distinction I care about a lot.

Vibe coding is not really knowing what you’re doing or how it’s being done. Anyone can vibe code. You describe what you want and you accept what comes back.

Agentic engineering is knowing what you’re doing and caring about how it gets done. Not everyone can do that part.

The way people put it is that vibe coding raises the floor and agentic engineering raises the ceiling. Vibe coding lets anyone build something. Agentic engineering lets a professional move a lot faster than they used to. Both are real. They’re not the same thing.

That’s the glossary, at least the version that lives in my head. None of these terms are settled, and half of them will probably mean something slightly different in six months. But hopefully this has helped!

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources
- Vibe coding — Wikipedia — the term coined by Andrej Karpathy (Feb 2025); the “vibe coding vs. agentic engineering” distinction the post draws.
- Andrej Karpathy, "+1 for ‘context engineering’ over ‘prompt engineering’" (X, June 25, 2025) — the origin of the term “context engineering” as the successor to prompt engineering.
- Model Context Protocol — Wikipedia — MCP as the open standard introduced by Anthropic (Nov 2024) for exposing tools and data to LLM harnesses.
- Emma Roth, “Anthropic launches tool to connect AI systems directly to datasets” (The Verge, Nov 25, 2024) — news coverage of the MCP launch.
- Agentic Maturity Model — GitHub — the AMM referenced in the Subagent entry; Level 3 describes the worker-plus-reviewer subagent pattern.
Jul 1, 2026 AI Agents Agentic LLM Glossary

Enjoyed this?
Two Kinds of Memory for Your CLI Agent
So you set up a memory layer for your local CLI agent. Now what? How do you actually get that memory in front of the agent so it does something useful?

I’m going to walk through what I did with mem0, but the shape of this applies to pretty much any memory layer. The first thing worth understanding is that CLI agents work with memory in two very different ways, and the difference matters.

The first way is text that’s always loaded. It gets injected into every session’s context automatically. No action needed on the agent’s part, it’s just there. This is your guaranteed data, the stuff that shows up at the start of every conversation.

The second way is semantic memory. For me that’s mem0 and the tooling I’ve built around it. This layer is accessed through an MCP server that exposes commands like recall and remember. It’s poll-based. The agent has to decide to call recall, because nothing gets auto-injected. The agent needs to be smart enough to say “I’m not sure about this, let me go look it up.”

Those are the two flavors. Let me break them down.

Layer 1A: The Shared Instructions File

For most CLI agents, this is a single markdown file that the harness auto-loads into every session. Claude Code reads CLAUDE.md. Gemini and Antigravity read GEMINI.md. And AGENTS.md has become the cross-tool convention, read by OpenCode, Antigravity, and Cursor alike. Same idea everywhere, just a different filename.

The one rule here: keep it minimal. Every line in this file is context you’re paying for on every single session. So don’t dump your whole knowledge base into it. The durable conventions, the project-specific facts, the things you only occasionally need? Those belong in your semantic layer, not here. This file is for the handful of rules that need to be loaded 100% of the time.

Layer 1B: Auto-Memory

Claude Code shipped a feature called auto-memory. It lives in ~/.claude/projects/, inside a subfolder that’s basically a slug of your project’s path on disk. In there you get a memory folder with a MEMORY.md file alongside the individual memory files.

MEMORY.md works like an index. It holds short pointers to the durable memories stored next to it, and the whole thing gets loaded every session. It’s still part of layer 1, the always-loaded kind.

Worth noting: this is a Claude Code thing. OpenCode and Antigravity don’t load or even know about these files. There’s no equivalent. Antigravity does have its own separate memory store that it syncs on its own, but it’s a different mechanism entirely, not a reader of Claude’s auto-memory.

Layer 2: The Semantic Layer

This is where it gets fun. I built a small MCP server in Go, a local binary that forwards requests to another server on my network. That server talks to two databases: Qdrant for the vectors, and Neo4j for the graph. The three functions I lean on most are recall, remember, and add_relation.

If MCP is new to you, the short version: it’s an open standard that lets your agent connect to external tools and data over a common protocol. Instead of N bespoke integrations, you run one MCP server per capability and the host discovers it. People call it “a USB-C port for AI,” which is annoying and also pretty accurate.

Wiring it up is just config. For Claude Code, it goes in ~/.claude.json under the top-level mcpServers.memory block. For OpenCode, it’s ~/.config/opencode/opencode.json under mcp.memory, with type: local and a command that runs the binary.

The Part People Forget

Here’s the step that ties it all together. Setting up the MCP server doesn’t do anything on its own. Remember, the semantic layer is poll-based. The agent won’t call recall unless it knows it should.

So you go back to layer 1, your always-loaded instructions, and you add a few lines telling the agent how and when to use the MCP server. Something like “before answering questions about this project, call recall with a relevant query” and “when the user tells you something worth keeping, call remember.” That instruction is small, it’s cheap, and it’s what turns a dormant memory store into a memory layer the agent actually reaches for.

That’s the whole architecture. Always-loaded text that’s guaranteed but expensive, and a semantic store that’s huge but only as good as the agent’s instinct to go check it. Get both layers talking and your agent stops forgetting who you are every morning.

Sources
- mem0 — GitHub — the universal memory layer for AI agents that the post describes wiring up; 59.6k stars, Apache 2.0.
- Model Context Protocol — Wikipedia — MCP as an open standard introduced by Anthropic (Nov 2024) for connecting AI systems to external tools and data sources.
- Emma Roth, “Anthropic launches tool to connect AI systems directly to datasets” (The Verge, Nov 25, 2024) — news coverage of the MCP launch; confirms the “USB-C port for AI” framing and the standard-protocol pitch.
- Jonathan Kemper, “Claude Code now remembers your fixes, your preferences, and your project quirks on its own” (The Decoder, Feb 27, 2026) — news coverage of the auto-memory feature; confirms the MEMORY.md per-project file and the ~/.claude/projects/ directory structure.
- How Claude remembers your project — Claude Code Docs — official documentation for CLAUDE.md files and the auto-memory system.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Jun 30, 2026 AI Tools Agents Mcp Memory

Enjoyed this?
Karpathy's LLM Wiki: Your Second Brain, Maintained by the Machine
A few months ago Andrej Karpathy dropped a GitHub gist that coined a term I haven’t stopped thinking about: the LLM Wiki. The pitch is simple enough to fit on a napkin. Obsidian is the IDE, the LLM is the programmer, and the wiki is the codebase.

Sit with that for a second. It reframes your second brain as something the model builds and maintains, not something you query. When I walked through the history of the second brain, this is the corner I promised to come back to.

Compile, Don’t Retrieve

The usual move with a big pile of notes is query-time RAG. You ask a question, some vector embeddings go find the closest chunks, and the model stitches together an answer on the spot. It works, but the knowledge never gets organized. You’re re-deriving structure every single time you ask.

Karpathy flips it. Instead of retrieving at query time, the LLM compiles the knowledge ahead of time and keeps it current. The result is a persistent, cross-linked markdown wiki. The model doesn’t get bored, doesn’t skip the boring summary, doesn’t forget to update the index. It just keeps the thing tidy.

Google landed in a similar spot with their Open Knowledge Format (OKF), which formalizes the same idea as a curated markdown bundle with an open spec. So this isn’t one person’s hot take. The pattern is in the water.

The Three Components

Karpathy’s wiki has three parts, and the separation is the whole point.

Raw sources. Your curated collection of source documents. These are read-only. The LLM reads them but never edits them. He recommends a raw/ directory, with subdirectories for non-text files. This is your ground truth, and keeping it untouchable is what keeps the rest honest.

The wiki. A directory of LLM-generated markdown: summaries, entity pages, concept pages, comparisons, overviews. The model owns this entirely. It creates the pages, maintains the cross-references, and enforces consistency. You’re not in here hand-editing.

The schema. A document like CLAUDE.md or AGENTS.md that tells the LLM how the wiki is structured, what the conventions are, and how the workflows run. It evolves as you use the system. Think of it as the contract between you and the machine.

The Three Operations

The classic Second Brain CODE method has four steps. Karpathy’s version lands on three operations, which I appreciate.

Ingest. You add a source. The LLM reads it, writes a summary page, and updates the index. While it’s in there, it can also touch up related entity and concept pages and log the event. One new document ripples outward into everything it connects to.

Query. You ask a question against the wiki. The model finds the relevant pages, reads, and synthesizes an answer as markdown, tables, whatever fits. Here’s the part I like: a useful exploration can be filed as a new page. So the act of asking a good question makes the next answer easier to produce. The knowledge base gets smarter by being used.

Lint. A periodic health check. The model hunts for contradictions, stale claims, orphan pages, broken references, and missing concepts. The gaps. Then it suggests new questions to ask and new sources to chase. It’s the maintenance pass you’d never do yourself.

The Supporting Cast

A few files make the whole thing run:
- index.md is the content catalog. Every page, a link, a one-line summary, and some metadata, grouped by category. The model reads this first on a query.
- log.md is an append-only chronological log with greppable entries like ## [DATE] operation | title.
- qmd is a local markdown search engine (BM25 plus vector plus an LLM re-rank), available as a CLI or MCP server, for when the wiki outgrows a plain index lookup.
And because this lives in Obsidian, you get some nice perks for free. The Web Clipper turns a browser tab into markdown straight into raw/. Graph View is a visual lint, hubs and orphan pages pop right out. Dataview pulls frontmatter into dynamic tables. And Marp spits out slides directly from your wiki pages.

Why This Clicks for Me

The thing I keep coming back to is the read-only raw/ boundary. So much of the anxiety around AI touching your notes is “what if it mangles something I care about.” Splitting sources from the generated wiki means the model can be as aggressive as it wants in the wiki layer, and your source of truth never moves. The worst case is you regenerate a summary page. No harm done.

I haven’t fully committed my own vault to this yet, but the shape is right. A knowledge base that maintains itself, where asking good questions leaves the place better than you found it. I’ll probably keep poking at it.

Sources
- Andrej Karpathy, LLM Wiki: A Pattern for AI-Maintained Knowledge Bases — the three components (raw/wiki/schema), the three operations (ingest/query/lint), index.md, log.md, qmd, and the Obsidian tooling (Web Clipper, Graph View, Dataview, Marp).
- Sam McVeety & Amir Hormati, “Introducing the Open Knowledge Format” (Google Cloud Blog, June 12, 2026) — Google’s OKF as a curated markdown bundle with an open spec, explicitly formalizing the same LLM-wiki pattern.
- Tiago Forte, Building a Second Brain (2022) — the CODE method (Capture, Organize, Distill, Express), the “classic” four-step workflow this post lines up against Karpathy’s three operations.
- Roger Montti, “Google Cloud Announces The Open Knowledge Format” (Search Engine Journal, June 15, 2026) — independent news coverage of OKF, including the quote linking it back to Karpathy’s LLM Wiki gist.
- Cecilia Meis, “Google Launches Open Knowledge Format, an AI Standard” (Semrush Blog, June 23, 2026) — news coverage framing OKF as a vendor-neutral markdown spec for AI agent knowledge.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Jun 29, 2026 AI LLM Second-brain Obsidian Knowledge-management

Enjoyed this?

Approach	Recall@1	Recall@3	MRR	F1
Full-text search	0.519	0.615	0.609	0.424
Embeddings	0.923	0.962	1.000	0.622

A Brief History of the Second Brain

The phrase “second brain” is glued to Tiago Forte and the productivity wave of the 2010s. But the ambition behind it is nearly five centuries old, and the method is a lot older than the name. So let’s walk the timeline, because the story is more interesting than the buzzword.

It Starts With Slips of Paper

Go back to antiquity and you’ll find people keeping personal notebooks of quotes, recipes, and observations. The commonplace book. Nothing fancy, just a place to park the ideas worth keeping.

Things get more systematic in the mid-1500s. A naturalist named Conrad Gessner suggested cutting notes into individual slips and gluing them onto sheets so you could rearrange and reassemble ideas from pieces. That modular instinct, breaking knowledge into movable units, is the seed of what the Germans would later call the Zettelkasten, the “slip box.”

About a hundred years later, Thomas Harrison built the Arca Studiorum, the “ark of studies.” It was a literal cabinet where paper slips hung on labeled metal hooks, sorted by subject. The design was published posthumously by Vincent Placcius in 1689, which makes it one of the first documented personal knowledge devices. Leibniz reportedly relied on it for one of his projects. A hundred years after that, Carl Linnaeus was working with standard-sized paper slips, over a thousand of which survived. Basically the index card before the index card existed.

For roughly 300 years this stayed a scholarly habit. Researchers, clergy, naturalists, the PhD crowd. Not something the general public thought about.

The Idea Goes Electric (In Theory)

In 1945, Vannevar Bush wrote an essay in The Atlantic called “As We May Think,” and described the Memex: a desk-sized microfilm machine that would store all of your books, records, and correspondence, with mechanical “associative trails” linking related items together. It never got built. But read that description again and tell me it doesn’t sound like every linked-notes tool we use today.

Then comes the patron saint of the second brain: Niklas Luhmann. From the 1950s onward he built a Zettelkasten of around 90,000 index cards over four decades. Each card got a unique ID, each linked to others by ID, a physical knowledge graph made of paper. Out of it came dozens of books (some counts say 70) and hundreds of articles. He described the system as a thinking partner he could have a conversation with. His archive was digitized and put online in 2019, so you can go poke around in it.

Luhmann wasn’t a one-off. The 20th century is full of scholars running the same playbook:

Walter Benjamin (Arcades Project, 1927-1940)
Roland Barthes (12,250 cards)
Hans Blumenberg (30,000+ cards)
Arno Schmidt (100,000+ cards for Zettels Traum)
Mario Bunge (70 books, 540 articles out of his card files)

The Computer Was Supposed to Be the Second Brain

Through the 1980s to the 2000s, we still didn’t have today’s vocabulary. We had the PIM, the personal information manager, and the PKM, personal knowledge management. The personal computer itself was pitched as the thing that would know everything about you.

Apple, Xerox, and Microsoft all took a swing:

NoteCards (Xerox PARC, 1985) was modeled directly on 3×5 index cards with typed links, an early hypertext take on the slip box.
HyperCard (Apple, 1987) handed people a hypertext stack system, and its card metaphor was a straight callback to the Zettelkasten. It’s also what inspired Ward Cunningham to build the first wiki in 1994.
Outliners like MORE, Ecco Pro, and Lotus Agenda chased hierarchical thought.
OneNote (Microsoft, 2003) was the first mass-market freeform digital notebook.
Evernote (2008) nailed the capture half with “remember everything,” but stayed folder-and-tag based, never a graph.

Zettelkasten Goes Public

In 2017, a writing coach named Sönke Ahrens published How to Take Smart Notes. He translated Luhmann’s dense German academic method into plain English for students and knowledge workers, and put the slip-box workflow, capture, permanent notes, link, develop, in front of a non-academic audience for the first time.

Almost in parallel, Tiago Forte coined the modern “second brain” and aimed it squarely at the everyday knowledge worker. Through Forte Labs he taught PARA (Projects, Areas, Resources, Archives), an action-oriented filing system that rejects the Dewey Decimal instinct, and the CODE workflow (Capture, Organize, Distill, Express) for the life cycle of a note. The 2022 book Building a Second Brain turned it into a movement.

From Folders to Graphs

Then the tools caught up to Luhmann. Roam Research (2020) made bidirectional links the whole point. Its early adopters were overwhelmingly academics, PhD students, and writers, and they showed everyone what diligent linking actually buys you.

Obsidian launched around the same time and is what most people picture now. Local-first, plain markdown, bidirectional links, a huge plugin ecosystem, and a motto that matters: your data is yours. That’s the real pitch. With Roam, Evernote, or Notion you can get your data in, but getting it back out in a format you own is a different story. You’re renting access to your own thinking. Obsidian doesn’t do that. It’s just markdown files on your disk. For my money that makes it the default, and everything with lock-in is a harder sell.

Either way, the shift is the headline: we moved from folders and notebooks to graphs. Links and backlinks. The knowledge graph stopped being a Luhmann eccentricity and became the norm.

And Now You Hook It Up to an LLM

Which brings us to right now. Somewhere around 2023, people started asking the obvious question: what if you point a language model at your second brain? The best way I’ve found is through a CLI-based agentic harness like Claude Code, pointed at your notes. I’ve been running Claude Code against my own vault for over a year, and the same approach works with other agentic coders. OpenCode is good, and Google’s Antigravity (Gemini) has been pretty good too.

There are different strategies for structuring a vault so an agent can read and extend it, and I’ll dig into those in a future post. One you may have heard of is Karpathy’s recent “LLM wiki” idea, which isn’t just about plugging an LLM in, it’s an opinion about how to structure the vault so the model works well with it.

That’s the real evolution. For 500 years the second brain was storage you read from. Now it’s becoming something that reads and writes back.

Era	Metaphor	Key figure/tool	What it solved
1540s-1890s	Card file / commonplace book	Gessner, Harrison, Linnaeus	Modular scholarly notes
1945	Memex (associative trails)	Vannevar Bush	The idea of external memory
1950s-1990s	Zettelkasten (linked slips)	Niklas Luhmann	Memory as a thinking partner
1980s-2000s	PIM (folders, notebooks)	HyperCard, OneNote, Evernote	Digital storage
2010s	Methodology + capture	Tiago Forte (BASB, PARA/CODE)	A repeatable workflow
2020-22	Knowledge graph	Roam, Obsidian	Connection over hierarchy
2023-	Agent-readable wiki	LLMs, Claude Code, Karpathy	Active synthesis

Gessner was gluing paper slips onto sheets so he could rearrange his ideas. We’re doing the same thing. We just gave the slip box a way to talk back.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources

Vannevar Bush, “As We May Think,” The Atlantic, July 1945 — the Memex proposal.
Memex — Wikipedia — background on the Memex device and its hypertext legacy.
Zettelkasten — Wikipedia — Gessner’s glued slips, Harrison’s Arca Studiorum (published by Placcius, 1689), Linnaeus’s paper slips, and the 20th-century card-file users (Benjamin, Barthes, Blumenberg, Schmidt, Bunge).
Niklas Luhmann — Wikipedia — the ~90,000-card Zettelkasten, digitized and put online by the University of Bielefeld in 2019.
NoteCards — Wikipedia — the Xerox PARC hypertext system (1985) modeled on 3×5 index cards.
HyperCard — Wikipedia — Apple’s 1987 hypertext stack system; Ward Cunningham traces the wiki concept back to a HyperCard stack.

Jun 27, 2026 AI Obsidian Second brain Note-taking Zettelkasten Knowledge management

Six Months with Git-Native Issue Tracking: Am I Still Using Beads?

Back in January, I wrote about discovering beads, a git-native issue tracker built specifically for AI-assisted development. I was pretty hyped. Having agents track their own tasks natively inside the git repo felt like the missing link for long-running autonomous workflows.

It’s been six months. So, the inevitable question: am I still using it?

Short answer: yes, but no longer as my default.

So this is where my task-tracking workflow actually landed in mid-2026.

The Shift Away From the Default

In January, I was trying to force beads into every single project I spun up. New repo? Init beads. Quick experiment? Init beads. As the months went on, I realized I was over-engineering things.

For a lot of projects, the task tracking built into Claude Code is good enough for day-to-day work. If an agent just needs to hold a checklist for a couple of hours during a coding session, standing up a full git-native tracking system is overkill. The lightweight, ephemeral tasks handle that perfectly, and they disappear when the session ends, which is exactly what you want for session-scoped work.

Wrapping the Real Things

For durable issue tracking, I pivoted back to where the code already lives.

Instead of leaning on beads, I built a couple of custom agent skills that wrap the GitHub CLI (gh) and the GitLab CLI (glab). Now when an agent needs to pull down tasks, update tickets, or log a blocker, it just talks directly to GitHub or GitLab through those wrappers.

The split looks like this: the remote issue tracker is the source of truth, and lightweight Claude Code tasks handle immediate session context. That combination has been far more resilient for me than routing everything through a separate system. The truth stays where my teammates and my future self will actually look for it.

Where Beads Still Lives

I haven’t abandoned beads. It still sticks around on a handful of repos where it’s useful.

The big one is my Obsidian vault, the “second brain” I wrote about earlier this week. When I’m managing personal research, blog pipelines, or local-only projects that don’t need a heavy GitHub project board, the git-native approach is still fantastic. No remote, no API, no ceremony, just issues that travel with the repo. Version 1.0 leaned into that even harder by moving to an embedded Dolt backend, basically git for your database, so there’s no separate server to babysit and the full history lives right alongside the code.

Beads isn’t worse than I thought in January. It’s just a sharper tool than I was treating it as. I was reaching for it everywhere when it was really built for a specific shape of project.

The hype of January has settled into the pragmatism of June. Not every project needs a custom tracking system, and sometimes letting your agents talk to GitHub directly is the easiest path forward.

What are you using to keep your agents on task these days?

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Jun 27, 2026 Productivity AI Developer-tools Task-tracking

Enjoyed this?
How Do You Actually Review Code an Agent Wrote?

We’ve all had the magic moment by now. You fire up an agent in Claude Code, hand it a bunch of tasks, then let it do its thing. It feels like the future.

Right up until you have to hit git commit.

That’s when the magic curdles into a very specific kind of anxiety: how do I actually review this? I didn’t write it. I barely watched it happen. And now I’m supposed to vouch for it.

The AI Reviewer Trap

The obvious move is to fight fire with fire. An AI wrote it, so pipe the diff into an AI PR review tool and let the machines sort it out, right?

I wrote about this back in May: AI code reviewers won’t save you. Having one LLM grade another LLM’s homework can potentially lead to disaster. They share the same blind spots. They’re trained on the same patterns, so they tend to nod along at the same plausible-looking mistakes, and they’re notoriously bad at catching the subtle, systemic logic flaws that span an entire architecture. The bug that matters is rarely on one line. It’s the interaction between four files, and that’s exactly the kind of thing a second LLM waves through.

So you swing the other way and read it yourself, line by line. Every variable assignment, every branch. And that’s exhausting. Worse, it defeats the entire point of using an agent. If I have to mentally re-type every line the agent produced, I might as well have typed it for real.

So we’re stuck between a reviewer we can’t trust and a review process that erases the speedup. Neither one is the answer.

The Bottleneck Just Moved

Agentic development didn’t make software engineering easier. It moved the hard part.

Writing the implementation used to be the bottleneck. That’s the part the agent is genuinely good at now. What it can’t do for you is tell you the behavior is correct. Verifying the behavior is the whole job now, and that’s a different skill than writing code.

This is why migrating my test suite to Vitest earlier this year has paid off more than I expected at the time. When you’re driving autonomous agents, automated testing stops being a chore you do to keep a coverage badge green. It becomes the only safety net you actually have.

Trust the Spec, Not the Code

In an agentic workflow, my job as the human isn’t to write the function anymore. It’s to write the tests, or at least to rigorously verify the ones the agent proposes.

Think about what that buys you. If I have a comprehensive, fast test suite, I don’t ALWAYS need to read all 400 lines Claude Code just generated. I need to watch the runner light up green. If the tests pass, and the tests are good, the implementation details matter a lot less than they used to. The tests are the contract. The code is just one way to satisfy it.

That second condition is doing a lot of work, though, so I want to be honest about it. “If the tests are good” is the entire game. A passing suite that doesn’t cover the edge cases is worse than no suite, because it hands you false confidence at the exact moment you’ve stopped reading the code. So the scrutiny doesn’t disappear. It relocates. Instead of reviewing the implementation, you review the spec. Are the right behaviors tested? Are the failure modes tested? Did the agent quietly write a test that asserts its own bug?

That’s a much smaller surface to review than 400 lines of implementation, and it’s a far more durable thing to spend your attention on. The tests outlive any single refactor.

TDD Didn’t Die, It Got Promoted

For years people treated test-driven development as a discipline you adopted if you were virtuous and skipped if you were busy. Agentic coding flipped that. It made testing the load-bearing skill, because tests are now the interface between what you want and what the machine builds.

So the answer to “how do I review code an agent wrote” turns out to be: mostly, you don’t. You review what it’s supposed to do, you encode that in tests you trust, and you let the green checkmark tell you whether the agent got there.

I’m curious whether this matches your experience. I’ve found I write more tests now than I did before the agents showed up, not fewer. The implementation got cheap. Being sure it works did not.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Jun 26, 2026 AI Programming Testing Software-development

Enjoyed this?
The Agentic Dev Space Is Moving Fast, and I'm Having a Blast
Step back and look at the last six months of software development. It’s almost hard to believe how much has shifted. Back in January we were doing “vibe coding” with basic autocomplete. By June we’re handing fully autonomous agents a feature, walking off to grab a coffee, and coming back to find it architected, written, and tested.

It’s easy to look at the volume of new tools dropping every single day and feel a little fatigued. But honestly? I’m just excited. This is one of the most fun eras of programming I’ve worked through in decades.

Here’s a quick look at the tools that have already changed my workflow, and the ones I want to dig into next.

The Tools I’ve Adopted

If you read my post earlier this week, you know Supacode has become my daily driver. It bridged the gap between raw terminal access and a UI that understands how context-heavy agentic work is. That’s a harder problem than it sounds, and it’s really feeling good so far.

I also have to shout out the tools that paved the way earlier this year. Claude Code and Antigravity proved out the model of a CLI-native agent that could navigate your file system and do the work. Running Claude Code daily—and turning to Antigravity occasionally when I have access—has completely retrained my habits. I stopped typing every line of code and started acting more like a technical lead reviewing pull requests from a tireless junior developer. That shift in posture is the real unlock, and these were the tools that taught me it.

The Tools on My Radar

Because the space moves so fast, my “stuff I want to learn” backlog keeps growing faster than I can clear it. Here are the three I’m most eager to explore when I carve out the time:
- Hermes. I keep hearing great things about how it handles multi-step reasoning and huge context windows. I haven’t had the right project to throw at it yet, but it’s at the top of the list.
- Pi Agents. The concept here is highly specialized, networked agents working in tandem. Instead of one monolithic agent doing everything, you’d have a “frontend agent” talking to a “database agent.” That feels like a different way to structure the work, and I want to see if it holds up in practice.
- Evaluating Memory Systems. I’ve been diving into how agents remember things over time. I wrote a Mem0 MCP server that runs locally, and I’ve added hooks for Mem0 right into my CLI agents. It’s been a great way to slowly improve my own local agentic memory system. I’m really eager to see how other memory systems (like LangChain’s implementations) work under the hood, and I want to spend more time comparing and contrasting them.
You Don’t Have to Learn It All Today

The best part about this explosion of tooling is that you don’t need to know all of it. You don’t.

Find a tool that solves an immediate problem in your workflow, master that one, and let the rest sit on your radar until you need them. The ecosystem will keep evolving whether or not you’re watching every release. The tools will keep getting better. Chasing every launch is a great way to learn nothing well.

So pick one. Get good at it. Let the backlog wait.

What’s the one agentic tool sitting on your radar that you just haven’t had time to dig into yet?

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Jun 24, 2026 AI Programming Developer-tools Workflow Agentic

Enjoyed this?
Why Supacode Became My Daily Driver for Claude Code

The number of ways to get an AI to write your code right now is … great? Claude Code, Codex, Opencode, and all the others. I think that’s a good thing. The easier we make it to hand real work to a capable coding agent, the better off we all are in the long run. I want to try most of them because that’s half the fun, figuring things out.

My cool tool for the week is Supacode.

What Supacode Is

Supacode is a native macOS app, written in Swift and built on libghostty, with no Electron and no web wrapper anywhere in sight. It’s fast the way native software is fast, the kind of speed you feel in every keystroke.

The website calls it a “command center for coding agents,” and well ok I get it now. Supacode is the nicest way I’ve found to run Claude Code. It’s open source too, so you can poke at the internals on GitHub if you’re curious.

What It Does for Me

This is the part that I like.

Supacode keeps all my projects organized down the left side, so jumping between them is one click instead of a mental map of which terminal tab is which. When I start a Claude Code session, its chat window moves up into the active area, and I can see every live session I have going at a glance.

The big one: it hooks into Claude Code, so Claude tells Supacode when it’s finished a task or when it needs something from me. Instead of babysitting a terminal waiting for the next prompt, I just get told when Claude is waiting on me. I tried for a long time to get something like that working in a plain terminal and never really nailed it. Here it just works.

And when I need more than one CLI open on the same project, I can group those tabs together instead of squinting at a wall of identical-looking shells.

None of that is glamorous. It’s just the difference between running Claude Code and pleasantly running Claude Code.

I Still Love Ghostty

I’m not leaving Ghostty behind, to be clear. I love Ghostty. Mitchell Hashimoto and everyone who’s worked on it has built something special. When I need a clean shell to compile a binary or grep through some logs, Ghostty is where I want to live.

Supacode is built on libghostty, the same engine that powers Ghostty. So the speed and feel I love about Ghostty is sitting right underneath Supacode too. They’re cousins. I get Ghostty for raw terminal work and Supacode for running Claude Code, and both of them are quietly standing on the same excellent foundation.

So Here We Are

We’re spoiled for choice with AI coding tools right now, and I love that. Most of what I try is interesting, and then I move on. Every so often something just slots into how I already work and earns a permanent spot. For me, right now, that’s Supacode. If you’re running Claude Code all day on a Mac, it’s well worth a look.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Jun 22, 2026 AI Programming Claude-code Developer-tools

Enjoyed this?
A Dotfiles Manager That Snapshots Every Change

Managing dotfiles in 2026 is a solved problem in the same way that managing your own backups is a solved problem: there are five tools for it, all of them work, all of them require you to set up some plumbing first, and once you’re set up you still don’t have a great answer to “I just broke my shell config, get me back to yesterday.”

The conventional answer is some combination of: a git repo for your ~/.zshrc and friends, a symlink script (or stow, or chezmoi, or yadm), and the discipline to remember to commit after every change. The setup is a one-time hassle. The “wait, what did I change?” recovery story is not great. And if you want to sync across machines, you’ve now got opinions about remote repos, SSH keys on a fresh box, and which order things have to happen in.

I wanted something different, so not a configuration framework, but a record of every change to the files I care about, in a place I can roll back from, with the lowest possible setup cost.

That’s what dfm is.

What It Does

dfm is a single static Go binary. You point it at the files you want to track (~/.zshrc, anything under ~/.config/, whatever), and every time one of them changes it takes a content-addressed snapshot. The snapshots live on disk in ~/.local/share/dotfiles/backups/. A small state database (SQLite locally, or libSQL via Turso if you want cross-machine sync) records which file maps to which snapshot at which point in time.

You can roll back. You can diff against an old snapshot. You can see when you last touched a file. And because every snapshot is content-addressed, you never re-store the same bytes twice — switching themes in ~/.zshrc ten times costs the size of two configs, not ten.

The other half is the backup story. dfm init walks you through cloning (or creating, via gh) a private GitHub repo that mirrors your tracked files plus their history. The point isn’t to make you adopt a new git workflow. It’s that pulling your config onto a fresh machine should be one command, and recovering from rm -rf should never have a “well, hopefully my last commit was recent” caveat.

Why Setup Is the Hard Part

The reason people don’t audit their dotfiles is the same reason people don’t back up their laptops: the setup is annoying, and the payoff is theoretical until it isn’t.

dfm init is a six-step interactive wizard. It detects a TURSO_DATABASE_URL env var if you’ve got one, offers sensible defaults for everything else, lets you opt in to tracking ~/.zshrc immediately, and writes a single config file with the right permissions. Re-run it on an existing config and it pre-fills every prompt with your current value, so the cost of changing your mind later is also low. --yes accepts every default for scripted setup.

If that sounds boring, that’s the point. Boring is what makes a tool actually get used.

The AI Bit

There’s an optional AI integration. dfm suggest <file> asks a local AI CLI (Claude Code by default, configurable) to propose an improvement to one of your tracked files, returns the proposal as a unified diff, and stores it as a pending suggestion. dfm apply <id> reviews the diff and applies it, with a fresh snapshot first, so you can roll back if the suggestion turns out to be wrong.

I’m exited to try this feature out, because I’m sure there is something i"m doing wrong. The “Look at my ~/.zshrc and tell me what I could clean up” is useful feature that doesn’t require me copy and pasting or granting read or write access to my entire home directory.

Where to Get It

github.com/llbbl/dotfiles-manager. Pre-built binaries for darwin and linux on arm64/amd64. Current version, as of writing, is v1.4.0.

If you’ve been meaning to actually back up your dotfiles and the friction has stopped you, this is the post where I tell you the friction is solvable.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

May 19, 2026 DevOps AI Programming Go Dotfiles

Enjoyed this?
Your AI Coding Agent Can Read Every Secret on Your Machine
Every developer running an AI coding agent has handed that agent the keys to their machine. Not metaphorically. Literally. The agent runs as your user. It can read every file you can read, execute every command you can execute, and hit every API your stored credentials authorize.

For most workflows, that’s the point. You want the agent to read your code, modify your project, ship your work. But there’s a quieter implication: the agent can also read your .env files. It can invoke your secret-management tooling. It can grep for API_KEY= across your home directory. And nothing in the agent stack says “wait, you didn’t ask for this.”

Same-UID isolation isn’t isolation. It’s the absence of isolation labeled politely.

The usual answer to “keep secrets safe from your coding agent” is: don’t store them where the agent can find them. Use a cloud secret manager. Rotate aggressively. These are good practices, and for local development, they’re often impractical. The agent is going to encounter secrets whether or not your security-best-practices doc approves.

So over the last week, I built an audit subsystem into lsm, my Local Secrets Manager. The whole thing is designed to answer one forensic question: did anything weird touch my secrets last night?

The Threat Model

A defense without a threat model is theater, so let me be specific.

The threat isn’t a sophisticated remote attacker. lsm is public, open-source code. The threat isn’t a buggy lsm either; bugs happen, and the user can read the source.

The threat is the agent layer running adjacent to lsm. Coding agents have legitimate access to a wide swath of your filesystem. They’re imperfect at intent inference. They sometimes get prompt-injected. They sometimes run in the background while you’re asleep. When an agent calls lsm get prod DATABASE_URL, the action is indistinguishable from you doing the same thing. The audit log’s job is to make those calls retrospectively distinguishable.

A secondary threat is an agent covering its tracks. If something reads a secret and then edits the audit log to erase the evidence, the log is worse than useless.

What Got Built

The audit subsystem records every access as a structured event: a sequence number, a timestamp, the action, the app and environment, an Actor block describing the calling process, and two cryptographic fields linking each event to the previous one.

The Actor block was the interesting design problem. It captures parent process ID, parent process name, TTY device path (or empty if there’s no terminal), current working directory, an agent marker derived from environment variables that tools like Claude Code, Cursor, Aider, and Continue set, and the calling user ID. Every field is captured every time. No omitempty. UID zero is a real, meaningful value, and silently dropping it would be a footgun.

Events land in a hash-chained JSONL file at ~/.lsm/audit.jsonl. Each row carries the SHA-256 of the previous row plus its own body. If anyone edits, inserts, or deletes a row in the middle, the next row’s prev no longer matches and lsm audit verify surfaces the break.

The chain doesn’t catch tail truncation. If you chop off the end of the file, what’s left is internally consistent. A sidecar file storing the last expected hash is the obvious fix, and I deliberately rejected it. lsm is public code. Any local attacker who knows about the sidecar can rewrite both files in lockstep. Tail-truncation detection is deferred to the off-machine path: when events ship to a remote stack, the last hash naturally lives somewhere the local attacker doesn’t control.

Reading the Log

Three commands cover the read side. lsm audit tail does what you’d expect. lsm audit show <seq> prints a single event. lsm audit query is the workhorse, with every field as a filterable dimension: --app, --env, --event, --parent-comm, --agent-marker, --tty present|absent, --since, --until. Output is JSONL when piped and columnar text when interactive.

Then there’s lsm audit suspicious, which runs four hard-coded detectors in one pass:
- Outside hours. Events whose timestamps fall outside 07:00–23:00. The 3 a.m. canary.
- Burst. More than N events from a single parent process within a sliding window. The runaway-agent canary.
- New parent_comm. Process names not seen in the prior 30 days. The “what is this new thing” canary.
- Non-interactive, no agent. No TTY, no recognized agent marker. The “what is even running this” canary.
A single event can stack reasons. A 3 a.m. burst from an unknown parent is unambiguously interesting.

The detector doesn’t learn baselines, doesn’t call out to an ML model, doesn’t require a service. High-signal patterns are obvious patterns, and obvious patterns are well-served by hard-coded predicates.

Shipping Events Off the Box

If you already run an observability stack, lsm can ship audit events over OTLP (the OpenTelemetry wire protocol). Three design choices matter here.

The local file sink is always authoritative. The remote sink is a mirror, not a replacement. An lsm operation never fails because the remote endpoint is down.

Redaction is allowlist-based. App and environment names are HMAC-hashed with a per-host salt before becoming labels. The TTY device path is dropped and replaced with a tty_present: true/false boolean. Secret values, cwd, hash, prev, and the schema version never leave the host. Secret names are replaced with key_present: true markers; the remote observer can see that a key was accessed, never which key.

Events whose name starts with audit. (chain failures, suspicious matches, sink drops) are always local. Telling a remote attacker that local integrity has been compromised is counterproductive.

What’s Still Open

The most important non-feature: no command in lsm emits events yet. set doesn’t log. get doesn’t log. delete doesn’t log. The plumbing is complete, the calls are not wired in. Each emit site needs careful thought about which fields are appropriate, whether the event should be local-only, and how it interacts with sensitive operations. That’s the next chunk of work.

The agent-coding era is normalizing a model where AI tools have wide-ranging access to developer machines. The premise that the agent operates as a fully-trusted local user is unlikely to change soon. Managing the risk means visibility. It means being able to answer “what touched my secrets last night” with a record the agent couldn’t silently rewrite.

The code is at github.com/llbbl/lsm. The full design lives in docs/observability.md.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 18, 2026 DevOps AI Programming security

Enjoyed this?
Sandboxing AI Agents Without Buying Anything
The previous post (and the one before it) covered the threat model and the per-ecosystem mitigations: lockfiles, --ignore-scripts, cargo-audit, Trusted Publishing. All of that helps. None of it answers the question that keeps me up at night, which is: what happens when an AI agent on my laptop installs a malicious package, and the malicious package was the literal point of the operation?

This is the new shape of the threat. You’re not getting compromised because you typed npm install wrong. You’re getting compromised because Claude or Cursor confidently invented a package name that didn’t exist, an attacker registered it five hours ago, and the agent ran pip install hallucinated-thing on your behalf without asking. The agent has shell access. Your SSH keys are right there. Your ~/.aws/credentials file is right there. The entire premise of giving an AI agent the ability to just figure it out depends on it being able to execute untrusted code at the speed of conversation, which is also the worst possible threat model.

If you’re a solo developer, an open-source maintainer, or a startup with no budget for Socket or Endor Labs licenses (more on those next post), the answer isn’t a commercial firewall. The answer is local isolation, and the tools have gotten dramatically better in the last 18 months.

Containers as the Baseline

The minimum viable isolation in 2026 is don’t run untrusted code as your user on your host OS. The cleanest way to do that on macOS or Linux is a devcontainer, a fully described, reproducible Linux environment that VS Code, Cursor, and the Claude Code CLI all natively support. You give the agent the container as its sandbox. Project files mount in. SSH keys, AWS credentials, and the rest of your home directory don’t.

The container runtime matters. Docker Desktop on macOS is a memory pig, 3 to 4 GB resident at idle, with sluggish startup times that make iterative work miserable. OrbStack is the obvious replacement: free for personal use, native Apple Silicon, dynamically allocates memory instead of reserving fixed blocks, and benchmarks show container startup times around 0.2 seconds versus Docker Desktop’s multi-second cold starts. If Docker Desktop is eating half your RAM before you even start Claude Code, OrbStack will give you that memory back.

The thing to internalize, though, is that a container is not a security boundary by default. It’s a deployment mechanism that happens to have isolation properties when configured correctly. Misconfigured developer containers have been implicated in some of the largest crypto-industry breaches of the last few years. The pattern: a container running with privileged flags, or mounting the wrong host directory, turns into a path straight to the host. Containers help. They don’t save you from yourself.

The configuration mistakes that void the isolation:
- Mounting ~/.ssh into the container so the agent can git push. Now any process inside the container can read your SSH keys.
- Mounting your entire home directory as a convenience. Now everything is accessible.
- Running with --privileged or sharing the host’s Docker socket. Container escape becomes trivial.
- Letting the agent run sudo inside the container. The container’s root can chain to host kernel exploits.
Least privilege, applied seriously. The agent gets the project directory and nothing else. If it needs to commit, it pushes through a credential helper that lives on the host, not by mounting your SSH keys.

Lighter-Weight Sandboxes

Spinning up a full container for every test this snippet the LLM wrote interaction is too heavy. There’s a middle layer worth knowing about.

Python. Pyodide compiles CPython to WebAssembly, which means Python code runs in a deny-by-default memory sandbox with no filesystem or network access unless you explicitly grant it. Works great for evaluating LLM-generated snippets, struggles with C extensions and heavy dependencies. safe-py-runner is the pragmatic alternative: it runs Python in a restricted subprocess with timeouts, memory limits, and I/O marshaling. No container needed. For code that absolutely cannot touch your machine, remote V8-isolate services like Deno Sandbox boot pre-snapshotted Python environments in the cloud and air-gap execution entirely.

Rust. The build.rs problem from the last post has no first-class solution yet, but on Linux you can wrap cargo build in Landlock, a kernel feature available on 5.13+ that lets unprivileged processes restrict their own filesystem access. Combined with seccomp-bpf for syscall filtering and cgroups v2 for resource limits, you can run a build script that genuinely cannot read your SSH keys or open arbitrary network sockets. Projects like sandbox-rs wrap these primitives into something usable without writing your own seccomp filters. None of this works on macOS without a Linux VM in the way, which is another reason OrbStack plus a devcontainer is the path of least resistance for most people.

The Mindset Shift

The honest version of all of this: if you’re running AI agents locally, you have to assume they will eventually install something malicious. Not might. Will. The question is whether the blast radius is the contents of one project directory inside a container, or every credential on your machine plus your entire git history. That gap is what isolation buys you.

Containers, Landlock, WASM sandboxes, none of these are particularly hard to set up. They’re just things most developers haven’t bothered with because the threat model didn’t feel real. After Shai-Hulud, faster_log, and a year of watching AI agents pip install whatever they invent, the threat model is real.

Next post I’ll wrap up the series with the commercial side: Socket, Snyk, Endor Labs, Mend, Sonatype, the pricing comparison, and the actual ROI math for whether any of it makes sense for teams below 50 developers.

Sources
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 16, 2026 DevOps AI security Supply-chain Containers

Enjoyed this?

SAST vs AI PR Review: Two Tools, Different Jobs

If you have worked in DevSecOps, you might be wondering if AI pull request review tools are going to replace traditional SAST scanners. Short answer: no. Longer answer: they’re solving different problems, and if you’re picking one over the other, you might be making a mistake.

Here is how I think about it.

SAST is the Compliance Gatekeeper

Static Application Security Testing tools, think Semgrep, SonarQube, Checkmarx, Fortify, parse your source code (usually into an Abstract Syntax Tree) and hunt for known vulnerability patterns. They don’t run the code. They just read it and “pattern-match” against rules.

The focus here is security, compliance, and strict rule enforcement. SAST is the automated gatekeeper that makes sure your code clears the OWASP Top 10 bar before it merges.

What SAST does well:

It’s deterministic. If a rule matches a pattern, the engine flags it every single time. Run it twice on the same code, get the same result.
It satisfies auditors. Frameworks like PCI-DSS, SOC 2, and HIPAA expect documented secure-development practices, and a formal SAST scanner is the easiest way to produce that evidence. AI agents don’t count here, at least not yet.
It can do real taint analysis. Enterprise tools can track untrusted input from the moment it enters your app to the moment it hits a dangerous sink.

Where SAST falls down:

The false positive rate is brutal. Rigid rules with no context means a lot of noise. Developer fatigue is real, and once your team starts ignoring scanner output, you’ve lost the game.
It can’t see your business logic. A SAST tool has no idea what your application is supposed to do, so it can’t tell you when the logic itself is broken.
Comprehensive scans are slow. Hours on large codebases isn’t unusual, though Semgrep has been doing good work on this front.

AI PR Agents are the Peer Reviewer

Tools like CodeRabbit, Qodo, Greptile, GitHub Copilot Code Review, Cursor Bugbot, and Claude Code (set up as a review skill) plug into your version control and read the PR diff with the surrounding code context. They behave less like a scanner and more like a colleague who actually read your changes.

The focus is developer productivity, code quality, logic bugs, and contextual feedback.

What they do well:

They understand intent. LLMs can reason about why the code is changing, not just whether it matches a rule. That’s a different category of feedback.
The signal-to-noise ratio is good. When an AI flags something, it usually comes with an explanation that makes sense. Less noise, more useful comments.
They suggest fixes. Not just “this is wrong” but “here’s a diff you can apply.” That’s huge for actually closing the loop on review feedback.
The scope is broader. Architecture, performance, style, security, all in one pass.

Where they fall down:

They’re non-deterministic. Same vulnerability, two PRs, two different outcomes. That’s not a bug, that’s how LLMs work, and it’s why auditors don’t trust them.
They don’t satisfy compliance. No auditor is going to accept “the AI looked at it” as a substitute for a formal scanner.
Hallucinations happen. Invented issues, misread intent, suggestions that refactor things that didn’t need refactoring. You still need a human filtering the output.

The Quick Comparison

Feature	SAST	AI PR Review
Primary Goal	Security & Compliance	Code Quality & Productivity
Analysis Method	Deterministic rules & AST	Non-deterministic LLMs
Business Logic	Blind	Context-aware
False Positives	Often high	Usually low
Compliance Proof	Accepted as evidence	Not accepted
Feedback Loop	Dashboard / CI output	PR comments / chat

The Lines Are Starting to Blur

The interesting thing happening right now is convergence from both directions.

On the SAST side, tools like DryRun Security are pitching themselves as “AI-native SAST,” trying to keep the deterministic backbone while using LLMs to filter out the false positives that make traditional scanners painful to live with.

On the AI agent side, CodeRabbit and Greptile keep getting better at catching real security vulnerabilities, not just style issues. They’re slowly creeping into territory that used to belong exclusively to SAST.

This is going somewhere, but it’s not there yet.

Where to Start Your Evaluation

Treat them as complementary, not competitive.

For SAST, evaluate against your audit footprint, the languages in your codebase, and how much false-positive triage your team can absorb. Semgrep, SonarQube, Checkmarx, and Fortify all sit in different price-and-friction zones, and the right one depends on what your business actually needs to prove.

For AI PR review, evaluate based on how it fits your existing review workflow, what languages and frameworks it understands well, and the signal-to-noise ratio in practice on your codebase. CodeRabbit, Qodo, Greptile, Copilot Code Review, Bugbot, and a Claude Code review skill all approach the problem differently.

If you pick one category and skip the other, you’re either passing compliance with mediocre code review, or getting great review feedback while failing your next audit. Neither is a win.

The AI tools aren’t replacing SAST. They’re filling in the gap SAST was never designed to cover.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

May 8, 2026 DevOps AI Programming security

AI Code Reviewers Won't Save You

Dropping an AI reviewer into your pull request pipeline is just a band-aid. Tools like CodeRabbit or Greptile are great for catching syntax errors or basic anti-patterns, but they can’t assess architectural intent or domain-specific business logic. They’re spell-checkers for code. Useful, sure. But nobody ever said “our codebase is solid because we run spell check.”

AI doesn’t change your engineering baseline. It just accelerates it. If your foundational guardrails are weak, agentic tools will help your team generate technical debt at unprecedented speeds. So the real question isn’t “how do we review AI code?” It’s “how do we build systems that prevent slop from ever reaching production?”

Shift Left, Hard

When engineers use agents to scaffold a new Go service or spin up a SvelteKit frontend, they’re inevitably pulling in generated dependencies or utilizing unfamiliar libraries. Models hallucinate packages. They suggest insecure patterns with total confidence.

Your CI pipeline needs to be ruthless before a human ever looks at the code. Aggressive SAST and SCA should automatically block PRs that introduce vulnerable dependencies or hardcoded secrets. If the agent generates slop, the pipeline rejects it instantly. No discussion.

Make the Agents Write the Tests

Agents are incredibly eager to generate feature code, but humans are historically lazy about writing the tests for it. The influx of AI-generated code means human reviewers can’t possibly step through every logic branch manually.

So flip the script. Use the agentic tools to build the guardrails themselves. Mandate that any generated feature code must be accompanied by generated, human-verified unit tests. If an agent writes a sprawling TypeScript function, the build should fail if the test coverage doesn’t meet a strict threshold. You’re already using AI to write the code. Use it to prove the code works, too.

Context Boundaries Matter

Bloated AI output often happens because the model is given too much context or allowed to generate too much at once. Heavyweight IDEs with aggressive multi-file auto-completion can easily create cascading messes across a codebase.

Define strict architectural boundaries and API contracts upfront. Agents should be tasked with solving small, well-defined, modular problems. “Write a function that parses this specific JSON schema” is a good prompt. “Build the backend” is not. The tighter the scope, the less room for generated nonsense.

Observability Is Your Safety Net

You can’t catch all generated slop at the PR level. Some of it only reveals itself under load. An agent might write a technically correct query that causes an N+1 database issue, or introduce a subtle memory leak that passes all unit tests.

Your ultimate safety net is what happens at runtime. You need an airtight observability stack to trust the velocity AI brings. Logs, distributed tracing, metrics, all feeding into dashboards your team actually watches. When generated code hits staging, you need the immediate telemetry to spot performance regressions before they reach production.

Redefine the Human Review

Because AI makes the “typing” part of coding trivial, the human code review needs to fundamentally shift. Reviewers should no longer be looking for missing semicolons. They should be asking: “Does this component fit our architecture?” and “Did the agent over-engineer this solution?”

Train your senior engineers to review for intent and systemic impact. That’s the stuff AI genuinely can’t do yet. Leave the syntax checking to the robots.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

May 2, 2026 DevOps AI Software-development Code-review

Enjoyed this?
Leading Teams When AI Does the Typing
AI is changing how teams build things. That part is obvious. What’s less obvious is that it doesn’t change what makes people want to follow you.

If anything, as the technical execution gets more automated, the human stuff becomes more important. Empathy, vision, strategic judgment. This has been top of mind for me a lot lately. I think the leaders who are going to thrive aren’t the ones who adopt AI the fastest. They’re the ones who understand what AI can’t do.

Focus on Outcomes, Not Output

AI accelerates generation. Code, docs, data parsing. All of it gets faster. That means individual output spikes, and measuring your team’s success by raw productivity becomes a losing game.

Your job is to provide the why. AI can handle the how, but it cannot determine why you’re building something in the first place. Crystal-clear business context and architectural vision are what keep your team’s AI-augmented velocity pointed in the right direction.

Something worth thinking about: when engineers can generate solutions faster, system complexity increases. Fast. Leadership means making sure the team is building the right things, not just building things quickly. Speed without direction is just expensive chaos.

Build the Guardrails Before You Need Them

Before you hand a team tools that let them move at lightspeed, the brakes and steering are important to have in place.

Automate your compliance. Strong SAST/SCA tooling, foolproof secret management, the boring stuff that lets people experiment without risking the infrastructure. As AI assists with more logic and agents take on more autonomous tasks, these guardrails become non-negotiable.

Same goes for observability. You can’t manage what you can’t see. When AI agents are handling high-volume tasks, a solid observability stack is what lets you trust the automation and catch it when a model hallucinates or a system drifts.

Automate the Toil, Protect the Thinking

A good leader uses AI to elevate human effort, not replace it. Look for the most repetitive, low-joy tasks in your team’s workflow. Scaffolding test environments, parsing logs, summarizing meetings. Deploy AI to handle the toil so your people can do the work that actually requires a brain.

Then protect that time. With AI handling the busywork, guard your team’s calendar for deep architectural thinking, complex debugging, creative system design. The stuff AI still struggles with. That’s where your team’s real value lives.

The Human Skills Are the Whole Game Now

The most valuable skills in an AI-driven org are the ones algorithms can’t replicate.
- Psychological safety. AI tools are changing fast. Your team needs to feel safe experimenting, failing, and learning. Punish well-intentioned experimentation and you’ll kill innovation before it starts.
- Mentorship. AI can answer factual questions all day long. It cannot mentor a junior engineer through a crisis of confidence or help a senior navigate organizational politics. Put your energy into 1:1s, career mapping, and active listening.
- Ethical judgment. As AI agents take on more responsibility in domains like finance, underwriting, or automated operations, you are the moral compass. You’re the one who needs to ask the hard questions about bias, fairness, and unintended consequences.
So What’s the Job Now?

The job is the same as it’s always been: create clarity, remove obstacles, and give a damn about your people. AI just raises the stakes on all three.

The leaders who get this right won’t be the ones with the best AI strategy deck. They’ll be the ones whose teams actually want to show up and build something together.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 1, 2026 AI Leadership Management

Enjoyed this?
How to Pick an Embedding Model (Without Overthinking It)
It’s easy to get deep into vector database comparisons, HNSW vs. IVF, pgvector vs. Pinecone, Qdrant vs. Chroma, and completely skip over the thing that actually matters most: the embedding model.

The way I think about it, the embedding model is the brain of your retrieval system. The vector database is just its filing cabinet. If the model creates poor mathematical representations of your data, no amount of indexing strategy or database performance is going to save you. You’ll get fast, confident, wrong results.

So let’s talk about how to pick a model.

Dimensionality: More Isn’t Always Better

Embeddings are high-dimensional vectors. Common sizes are 384, 768, 1536, or 3072 dimensions. Higher dimensions capture more nuance, but they also mean more storage, more memory, and slower search.

For a lean, local-first setup, something like all-MiniLM-L6-v2 at 384 dimensions gives you a surprisingly good balance of speed and accuracy. You don’t need 3072 dimensions to search your notes. Save the big vectors for when you actually have a reason.

Sequence Length: The Silent Data Killer

Sequence length determines how much text the model can look at to create a single vector. If you’re embedding long technical docs or sprawling Markdown files and your model caps out at 512 tokens, it’s just truncating everything past that point. Your carefully written documentation gets chopped, and the embedding only represents the first few paragraphs.

Modern long-context embedding models handle 8k to 32k tokens, which lets you embed entire chapters or large code blocks as single semantic units. If your content is longer than a few paragraphs, check this number before anything else.

Domain Matters More Than You Think

General-purpose models like OpenAI’s text-embedding-3-small work well across most tasks. They’ve been trained on massive, diverse datasets and they’re solid defaults.

If you’re searching a codebase or technical documentation, models fine-tuned on programming languages (like voyage-code-2) will outperform the general ones. The same applies to medical or legal text, where domain-specific jargon means the difference between a relevant result and a completely wrong one.

Check MTEB Before You Commit

The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing models. It breaks performance into sub-categories like Retrieval, Summarization, and Clustering. If you’re building RAG, look at the Retrieval scores specifically. A model that ranks well for clustering might be mediocre at retrieval, and vice versa.

Local vs. API: Pick Your Tradeoff

This decision is as important as the model itself.
- Local models (via HuggingFace or Ollama) keep everything offline. Zero per-request costs, full privacy. Something like bge-small-en-v1.5 running locally is perfect for personal knowledge management or anything where your data shouldn’t leave your machine.
- Hosted APIs (OpenAI, Voyage, Cohere) give you the highest performance and longest context windows without managing GPU infrastructure. Better for enterprise scale where you’re willing to trade privacy and recurring costs for accuracy.
Local models make sense for personal projects and hosted APIs make sense when the scale demands it. There’s no universal right answer, but there is a wrong one: picking a deployment model without thinking about where your data lives.

The vector database conversation is important, but it’s second in line to getting the embedding model right first. Everything downstream depends on it.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Apr 29, 2026 AI Rag Embeddings Vector-databases

Enjoyed this?

Pandoc vs MarkItDown: Two Tools, Two Eras

Pandoc has been the gold standard for document conversion for nearly two decades. But there’s a newer tool from Microsoft called MarkItDown, and while the names sound like they do similar things, they were built for completely different reasons.

Pandoc is a universal document converter designed for human publishing. It converts almost any format into almost any other format while preserving complex typography, citations, and formatting. MarkItDown is a specialized extraction tool designed for AI. It converts various files strictly into Markdown so that LLMs and RAG pipelines can read and process the text.

Same input files, very different goals.

Pandoc: The Universal Translator

Pandoc has been around since 2006, written in Haskell, and it operates on an Abstract Syntax Tree. It reads a document, builds a complex internal model of its structure, and then translates that structure into your desired output. We’re talking 40+ output formats here. PDF, Word, HTML, LaTeX, EPUB, you name it.

Where it really shines is academic and technical writing. It natively understands LaTeX math, footnotes, bibliographies, and cross-referencing. You can turn a Word doc into Markdown, edit it, and use Pandoc to turn it back into a perfectly formatted PDF. Two-way conversion that actually works.

You can also write custom filters in Lua or Python to programmatically alter documents during conversion. Want to automatically downgrade all your H2s to H3s? Pandoc has you covered.

MarkItDown: The LLM Feeder

MarkItDown was released by Microsoft in late 2024 to solve a very modern problem. LLMs need clean, structured text to “read” documents, but corporate data is locked inside messy formats like multi-tab Excel spreadsheets, image-heavy PowerPoints, and ZIP archives.

It’s a Python library first, CLI second. It drops into your scripts in a few lines of code, which makes it easy to wire up with LangChain, LlamaIndex, or raw API calls. The output is always Markdown. That’s it. No PDF generation, no Word docs, no EPUB. Just clean text that an AI can process.

The interesting trick is what it does with images and audio. Feed it a PDF with diagrams and MarkItDown can connect to an LLM like GPT-4o to look at the image and write a Markdown description of what it sees. It can also transcribe audio files. That’s a fundamentally different approach from Pandoc, which preserves images as files rather than describing them.

Quick Comparison

Feature	Pandoc	MarkItDown
Primary Goal	Universal document conversion	Document ingestion for AI
Output Formats	40+ (PDF, Word, HTML, LaTeX, etc.)	Only Markdown
Language	Haskell (standalone CLI)	Python (library-first)
Image Handling	Preserves and extracts image files	Uses OCR/LLM Vision to describe images as text
Complex Formatting	Citations, bibliographies, LaTeX math, custom filters	Basic structural support (headings, tables, slides)

So Which One Do You Want?

Pandoc if you’re writing a book, research paper, or blog and need polished output in multiple formats. If you need to maintain citations, complex formatting, or convert files out of Markdown into something else, Pandoc is your tool.

MarkItDown if you’re building an AI agent, chatbot, or search tool and need to extract text from a pile of PDFs, Excel files, and PowerPoints. If you only care about getting raw structured text and don’t care about the visual layout of the original document, MarkItDown is purpose-built for that.

They’re not competitors. Pandoc is for publishing. MarkItDown is for feeding AI. Pick the one that matches what you’re actually trying to do.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Apr 28, 2026 AI Tools Development

pgvector vs Pinecone: You Probably Don't Need a Separate Vector Database
Every time someone starts building a RAG pipeline, the same question will come up: do I need a “real” vector database like Pinecone, or can I just use pgvector with the Postgres I already have?

I can imagine teams agonizing over this decision for weeks. So maybe this will save you some time?

The Case for Staying Put

If you already have a PostgreSQL instance in your stack, adding pgvector is almost always the right first move.

You manage one stateful service instead of two. Your existing backup strategy, monitoring, and security all stay the same. Your vector embeddings live next to your metadata, so you get ACID compliance and standard SQL joins. No syncing between two data stores. No eventual consistency headaches.

Performance? From what I found, for datasets under a few million vectors, pgvector with HNSW indexes is fast. Really fast. It satisfies the latency requirements of most applications without breaking a sweat.

And you’re not paying for another SaaS subscription…

When Pinecone Actually Makes Sense

Pinecone is a purpose-built vector database designed for high-dimensional data at massive scale. It’s serverless and fully managed.

If you’re dealing with hundreds of millions or billions of vectors, a specialized engine handles memory and disk I/O for similarity searches more efficiently than Postgres can. Pinecone also gives you native namespace support, metadata filtering optimized for vector search, and live index updates that are faster than re-indexing a large Postgres table.

Those are real advantages. At a certain scale.

The Decision Is Simpler Than You Think

Stay with Postgres + pgvector if:
- You want to minimize infra sprawl and moving parts
- Your vector dataset is under 5 to 10 million records
- You rely on relational joins between vectors and other business data
- You have existing observability and DBA expertise for Postgres
Consider Pinecone if:
- Your Postgres instance needs massive, expensive vertical scaling just to keep the vector index in memory
- You don’t want to tune HNSW parameters, mmap settings, or vacuuming schedules for large vector tables
- You need sub-millisecond similarity search at a scale where Postgres starts to struggle
That is what I would use to make that decision.

Most teams are probably nowhere near the scale where Pinecone becomes necessary. They have a few hundred thousand vectors, maybe a million or two. Postgres handles that without flinching. Adding a separate managed vector database at that point is just adding operational complexity for no measurable benefit.

The trap is thinking you need to “plan ahead” for scale you don’t have yet. You can always migrate later if you actually hit the ceiling. Moving from pgvector to Pinecone is a well-documented path. But moving from two services back to one because you overengineered your stack? That’s a conversation nobody wants to have.

Start with what you have. Add complexity when the numbers force you to, not when a vendor’s marketing page makes you nervous.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Apr 25, 2026 DevOps AI Programming Databases

Enjoyed this?
LangChain and LLM Routers, the Short Version
LangChain is important to know and understand in the age of agents. Also, LLM routing. They’re related but they’re not the same thing, and the distinction matters.

So lets break it down.

LangChain is the Plumbing

Out of the box, an LLM is a text-in, text-out engine. It only knows what it was trained on. That’s it. LangChain is an open-source framework that connects that engine to the outside world.

It gives you standardized tools to build pipelines:
- Models: Interfaces for talking to different LLMs (Gemini, Claude, OpenAI, whatever you’re using)
- Prompts: Templates for dynamically constructing instructions based on user input
- Memory: Letting the LLM remember past turns in a conversation
- Retrieval (RAG): Connecting the LLM to external databases, PDFs, or the internet so it can answer questions about your data
- Agents & Tools: Letting the LLM actually do things, like execute code, run a SQL query, or send an email
You could wire all of this up yourself, but LangChain gives you the standard pieces so you’re not reinventing the plumbing every time.

LLM Routers are the Traffic Controller

A router is an architectural pattern you build on top of that plumbing. Instead of sending every request through the same prompt to the same massive model, a router evaluates the request and directs it to the right destination. Simple concept, big impact.

Three reasons you’d want one:
- Cost: You don’t need a giant, expensive model to answer “Hello!” or look up a basic fact. Send simple queries to a smaller, cheaper model. Save the heavy model for complex reasoning.
- Specialization: Maybe you have one prompt for writing code and another for searching a company HR manual. The router makes sure the query hits the right expert system.
- Speed: Smaller models and direct database lookups are faster. Routing makes your whole application more responsive.
How Routing Actually Works

In LangChain, there are two main approaches:

Logical Routing uses a fast LLM to read the user’s prompt and categorize it. You tell the router LLM something like: “If the user asks about math, output MATH. If they ask about history, output HISTORY.” LangChain then branches to a specialized chain based on that output.

Semantic Routing skips the LLM entirely for the routing decision. It converts the user’s text into a vector (an array of numbers representing the meaning of the text) and compares it to predefined routes to find the closest match. This is significantly faster and cheaper than asking an LLM to make the call.

LangChain provides RunnableBranch in LCEL (LangChain Expression Language, their declarative syntax for chaining components) for this, basically if/then/else logic for your AI pipelines. Worth digging into if you’re building with LangChain.

Routing is what makes AI applications practical at scale. LangChain is one way to build it. They’re complementary, not interchangeable.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
Apr 23, 2026 AI Programming Langchain LLM

Enjoyed this?
Your Brain vs. a Large Language Model

We don’t fully understand the human brain. That’s just how things are. But we know enough about its structure to make some genuinely interesting comparisons to how large language models work. So let’s walk through the major components of your brain and see where the parallels land.

The Neocortex and the Transformer

The neocortex is the outer layer of your brain, responsible for the higher-order stuff: sensory perception, spatial reasoning, language. The prefrontal cortex (PFC) sits within it as the orchestrator. It handles executive function, decision-making, and complex thought.

The LLM equivalent here is the transformer architecture itself. And the PFC’s role maps surprisingly well to the attention mechanism. The attention mechanism decides what information matters most given the current context, which is essentially what your prefrontal cortex does all day.

If you’ve worked with agentic AI systems, you’ve probably seen this pattern play out directly. You typically have an orchestration agent managing specialized sub-agents, each built for a specific task. That management layer is doing PFC work, deciding which agent to activate, what context to pass along, and how to synthesize the results.

The Hippocampus and Memory

The hippocampus is your storage unit. It’s critical for forming new memories and converting short-term experiences into long-term ones. Think of it as a buffer between what just happened and what you’ll remember later.

The LLM equivalent splits into two pieces. The model weights are your long-term memory, everything learned during training. The context window is your working memory, what the model can hold in its head right now for the current conversation.

LLMs don’t natively have long-term memory. The weights are baked in during training and that’s it. But memory systems get bolted on as part of the harness, and this is where retrieval-augmented generation (RAG) comes in. RAG lets the model pull in external data to contextualize its responses, which is functionally the same thing your hippocampus does when it retrieves a stored memory to help you make sense of something new.

Synapses and Parameters

Synapses are the gaps between neurons where signals pass, chemical or electrical. The strength of those connections determines how information flows through your brain. Stronger connections mean faster, more reliable signal paths.

This maps directly to model weights and parameters. Stronger connections between data points in the model mean those patterns carry more influence over the output. When we say a model has 170 billion parameters, we’re effectively describing the synaptic density of a digital brain. It’s not a perfect analogy, but it gives you an intuitive sense of scale.

Dopamine and RLHF

Your brain’s dopamine system is its reward circuit. It fires when an outcome is better than expected, reinforcing beneficial behaviors over harmful ones. It’s how you learn that some choices are worth repeating.

The LLM equivalent is reinforcement learning from human feedback, or RLHF. During training, humans rank the model’s responses. Good answers get a mathematical reward signal that makes similar outputs more likely in the future. Bad answers get penalized. This is the alignment problem in a nutshell: teaching the model what we find valuable and useful, the same way dopamine teaches your brain what’s worth pursuing.

This is also where the analogy breaks down the most. Dopamine is intrinsic. It’s wired into your survival. You don’t choose to feel rewarded when you eat, your brain just does that. RLHF is a proxy. The model isn’t learning what’s actually helpful, it’s learning what a secondary reward model scores as helpful. The result is a system that optimizes to appear useful rather than be useful. That’s why models can be confidently wrong or agree with you when you’re clearly mistaken. The reward signal says “the human liked that,” not “that was true.”

The Basal Ganglia and Routing

The basal ganglia is your gating mechanism. It’s a group of structures involved in motor control, habit formation, and deciding which thoughts or movements should surface and which should be suppressed. It’s basically your brain’s security and routing layer.

The LLM equivalent is the routing logic in mixture-of-experts (MoE) models. Every major provider uses some degree of MoE at this point. Different parts of the network activate depending on the task at hand, which is exactly what the basal ganglia does. System prompts play a similar role too, shaping how the model decides to respond given a particular input or situation.

So What?

None of these comparisons are perfect. The brain is biological, messy, and shaped by millions of years of evolution. LLMs are mathematical, deterministic (mostly), and shaped by a few years of engineering. But the structural parallels are hard to ignore. Attention mechanisms, memory systems, reward signals, gating logic. We keep arriving at similar architectural patterns, just built differently.

I don’t think that’s a coincidence, it tells us something about what intelligence requires, regardless of whether it’s running on neurons or GPUs.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Apr 22, 2026 AI Llms Neuroscience

Enjoyed this?
NotebookLM Is Just RAG With a Nice UI
I’ve been watching AI YouTubers recommend NotebookLM integrations that involve authenticating your Claude instance with some random skill they built. “Download my thing, hook it up, trust me bro.” No details on how it works under the hood. No mention of why piping your credentials through someone else’s code might be a terrible idea. Let’s just gloss over that, I guess.

So here we are. Let me explain what NotebookLM actually is, because once you understand RAG, the magic disappears pretty quickly.

What Is RAG?

RAG stands for Retrieval Augmented Generation. It’s an AI framework that improves LLM accuracy by retrieving data from trusted sources before generating a response.

The LLM provides the reasoning and token generation. RAG provides specific, trusted context. Combining the two gives you general reasoning grounded in your actual data instead of whatever the model has or hallucinated from its training set.

The core pipeline looks like this:
1. Take your trusted data (docs, PDFs, YouTube transcripts, whatever)
2. Chunk it into pieces
3. Create vector embeddings from those chunks
4. Store the vectors in a database
5. When you ask a question, embed the question into the same vector space
6. Find the most similar chunks
7. Feed those chunks into the LLM as context alongside your question
That’s it. That’s NotebookLM. Steps 1 through 6 are the retrieval half. Step 7 is where the LLM synthesizes an answer. The nice UI on top doesn’t change what’s happening underneath.

I Accidentally Built Half of It?

I was interested in the semantic embeddings portion of this pipeline and ended up building something I called Semantic Docs. It handles the retrieval half, steps 1 through 6.

You point it at a knowledge base, internal company docs, research papers, whatever you’re interested in. It chunks the content, creates vector embeddings, and stores them in a database. When you search, it creates a new embedding from your query, finds the most similar chunks, and returns those as search results.

The difference between Semantic Docs and NotebookLM is that last step. Semantic Docs gives you the relevant files and passages. It says “here’s where the answers live, go read it.” It doesn’t pipe everything through an LLM to generate a synthesized response. This is a choice, a deliberate choice, not a missing feature.

Why No Official API Is a Problem

NotebookLM doesn’t have an official API. People have reverse-engineered how it works, which means every integration you see is built on undocumented behavior that could break at any time. The AI YouTubers recommending these workflows are essentially saying “trust this unofficial thing with your data and credentials.” That should make you uncomfortable.

If you understand RAG, you can build the parts you actually need. The retrieval half is genuinely useful on its own, and you control the whole pipeline. No third-party authentication. No undocumented APIs. No wondering what happens to your data.

I’ll probably write more about RAG in the future. It’s a good topic and there’s a lot of noise to cut through. For now, just know that the next time someone tells you NotebookLM is magic, it’s really just vector search with a chat interface on top.

If you’re a developer, I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week. Or you can find me on Mastodon at @[email protected].
Apr 20, 2026 AI Programming Rag Notebooklm

Enjoyed this?

AI

The Setup

The Baseline: Full-Text Search

The Embeddings Version

The Numbers

Thresholds and the Gold Set

What I’d Actually Do

A skill is judgment, not a command list

Separate adoption from maintenance

Keep each skill focused

Put the safety rules near the top

Include validation, not just execution

Write for the agent inside the repo

Use a marketplace repo as the index

Scan third-party skills before you import them

Start with your serious projects

Why it’s worth doing

Capture feels like work. It isn’t.

Express is where the value is, and it’s the part that hurts

Give every note a lifecycle

Point an agent at the backlog

The model as a sparring partner

Sources

A: Build on top of a full harness

B: Self-host a fully open harness

C: Assemble from primitives

D: An agentic harness framework

So where do you land?

Sources

1. High-Precision Semantic Search

2. Tool Selection (RATS)

3. Dynamic Few-Shot Prompting

4. Long-Term Agent Memory

5. Evaluation and Test Harnesses

Sources

Agent

Harness

Scaffold

Framework or SDK

Context Engineering

MCP

Skills

Subagent

Agentic OS

Second Brain / PKM

Vibe Coding vs. Agentic Engineering

Sources

Layer 1A: The Shared Instructions File

Layer 1B: Auto-Memory

Layer 2: The Semantic Layer

The Part People Forget

Sources

Compile, Don’t Retrieve

The Three Components

The Three Operations

The Supporting Cast

Why This Clicks for Me

Sources

It Starts With Slips of Paper

The Idea Goes Electric (In Theory)

The Computer Was Supposed to Be the Second Brain

Zettelkasten Goes Public

From Folders to Graphs

And Now You Hook It Up to an LLM

Sources

The Shift Away From the Default

Wrapping the Real Things

Where Beads Still Lives

The AI Reviewer Trap

The Bottleneck Just Moved

Trust the Spec, Not the Code

TDD Didn’t Die, It Got Promoted

The Tools I’ve Adopted

The Tools on My Radar

You Don’t Have to Learn It All Today

What Supacode Is

What It Does for Me

I Still Love Ghostty

So Here We Are

What It Does