AI
-
SAST vs AI PR Review: Two Tools, Different Jobs
If you have worked in DevSecOps, you might be wondering if AI pull request review tools are going to replace traditional SAST scanners. Short answer: no. Longer answer: they’re solving different problems, and if you’re picking one over the other, you might be making a mistake.
Here is how I think about it.
SAST is the Compliance Gatekeeper
Static Application Security Testing tools, think Semgrep, SonarQube, Checkmarx, Fortify, parse your source code (usually into an Abstract Syntax Tree) and hunt for known vulnerability patterns. They don’t run the code. They just read it and “pattern-match” against rules.
The focus here is security, compliance, and strict rule enforcement. SAST is the automated gatekeeper that makes sure your code clears the OWASP Top 10 bar before it merges.
What SAST does well:
- It’s deterministic. If a rule matches a pattern, the engine flags it every single time. Run it twice on the same code, get the same result.
- It satisfies auditors. Frameworks like PCI-DSS, SOC 2, and HIPAA expect documented secure-development practices, and a formal SAST scanner is the easiest way to produce that evidence. AI agents don’t count here, at least not yet.
- It can do real taint analysis. Enterprise tools can track untrusted input from the moment it enters your app to the moment it hits a dangerous sink.
Where SAST falls down:
- The false positive rate is brutal. Rigid rules with no context means a lot of noise. Developer fatigue is real, and once your team starts ignoring scanner output, you’ve lost the game.
- It can’t see your business logic. A SAST tool has no idea what your application is supposed to do, so it can’t tell you when the logic itself is broken.
- Comprehensive scans are slow. Hours on large codebases isn’t unusual, though Semgrep has been doing good work on this front.
AI PR Agents are the Peer Reviewer
Tools like CodeRabbit, Qodo, Greptile, GitHub Copilot Code Review, Cursor Bugbot, and Claude Code (set up as a review skill) plug into your version control and read the PR diff with the surrounding code context. They behave less like a scanner and more like a colleague who actually read your changes.
The focus is developer productivity, code quality, logic bugs, and contextual feedback.
What they do well:
- They understand intent. LLMs can reason about why the code is changing, not just whether it matches a rule. That’s a different category of feedback.
- The signal-to-noise ratio is good. When an AI flags something, it usually comes with an explanation that makes sense. Less noise, more useful comments.
- They suggest fixes. Not just “this is wrong” but “here’s a diff you can apply.” That’s huge for actually closing the loop on review feedback.
- The scope is broader. Architecture, performance, style, security, all in one pass.
Where they fall down:
- They’re non-deterministic. Same vulnerability, two PRs, two different outcomes. That’s not a bug, that’s how LLMs work, and it’s why auditors don’t trust them.
- They don’t satisfy compliance. No auditor is going to accept “the AI looked at it” as a substitute for a formal scanner.
- Hallucinations happen. Invented issues, misread intent, suggestions that refactor things that didn’t need refactoring. You still need a human filtering the output.
The Quick Comparison
Feature SAST AI PR Review Primary Goal Security & Compliance Code Quality & Productivity Analysis Method Deterministic rules & AST Non-deterministic LLMs Business Logic Blind Context-aware False Positives Often high Usually low Compliance Proof Accepted as evidence Not accepted Feedback Loop Dashboard / CI output PR comments / chat The Lines Are Starting to Blur
The interesting thing happening right now is convergence from both directions.
On the SAST side, tools like DryRun Security are pitching themselves as “AI-native SAST,” trying to keep the deterministic backbone while using LLMs to filter out the false positives that make traditional scanners painful to live with.
On the AI agent side, CodeRabbit and Greptile keep getting better at catching real security vulnerabilities, not just style issues. They’re slowly creeping into territory that used to belong exclusively to SAST.
This is going somewhere, but it’s not there yet.
Where to Start Your Evaluation
Treat them as complementary, not competitive.
For SAST, evaluate against your audit footprint, the languages in your codebase, and how much false-positive triage your team can absorb. Semgrep, SonarQube, Checkmarx, and Fortify all sit in different price-and-friction zones, and the right one depends on what your business actually needs to prove.
For AI PR review, evaluate based on how it fits your existing review workflow, what languages and frameworks it understands well, and the signal-to-noise ratio in practice on your codebase. CodeRabbit, Qodo, Greptile, Copilot Code Review, Bugbot, and a Claude Code review skill all approach the problem differently.
If you pick one category and skip the other, you’re either passing compliance with mediocre code review, or getting great review feedback while failing your next audit. Neither is a win.
The AI tools aren’t replacing SAST. They’re filling in the gap SAST was never designed to cover.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / security
-
AI Code Reviewers Won't Save You
Dropping an AI reviewer into your pull request pipeline is just a band-aid. Tools like CodeRabbit or Greptile are great for catching syntax errors or basic anti-patterns, but they can’t assess architectural intent or domain-specific business logic. They’re spell-checkers for code. Useful, sure. But nobody ever said “our codebase is solid because we run spell check.”
AI doesn’t change your engineering baseline. It just accelerates it. If your foundational guardrails are weak, agentic tools will help your team generate technical debt at unprecedented speeds. So the real question isn’t “how do we review AI code?” It’s “how do we build systems that prevent slop from ever reaching production?”
Shift Left, Hard
When engineers use agents to scaffold a new Go service or spin up a SvelteKit frontend, they’re inevitably pulling in generated dependencies or utilizing unfamiliar libraries. Models hallucinate packages. They suggest insecure patterns with total confidence.
Your CI pipeline needs to be ruthless before a human ever looks at the code. Aggressive SAST and SCA should automatically block PRs that introduce vulnerable dependencies or hardcoded secrets. If the agent generates slop, the pipeline rejects it instantly. No discussion.
Make the Agents Write the Tests
Agents are incredibly eager to generate feature code, but humans are historically lazy about writing the tests for it. The influx of AI-generated code means human reviewers can’t possibly step through every logic branch manually.
So flip the script. Use the agentic tools to build the guardrails themselves. Mandate that any generated feature code must be accompanied by generated, human-verified unit tests. If an agent writes a sprawling TypeScript function, the build should fail if the test coverage doesn’t meet a strict threshold. You’re already using AI to write the code. Use it to prove the code works, too.
Context Boundaries Matter
Bloated AI output often happens because the model is given too much context or allowed to generate too much at once. Heavyweight IDEs with aggressive multi-file auto-completion can easily create cascading messes across a codebase.
Define strict architectural boundaries and API contracts upfront. Agents should be tasked with solving small, well-defined, modular problems. “Write a function that parses this specific JSON schema” is a good prompt. “Build the backend” is not. The tighter the scope, the less room for generated nonsense.
Observability Is Your Safety Net
You can’t catch all generated slop at the PR level. Some of it only reveals itself under load. An agent might write a technically correct query that causes an N+1 database issue, or introduce a subtle memory leak that passes all unit tests.
Your ultimate safety net is what happens at runtime. You need an airtight observability stack to trust the velocity AI brings. Logs, distributed tracing, metrics, all feeding into dashboards your team actually watches. When generated code hits staging, you need the immediate telemetry to spot performance regressions before they reach production.
Redefine the Human Review
Because AI makes the “typing” part of coding trivial, the human code review needs to fundamentally shift. Reviewers should no longer be looking for missing semicolons. They should be asking: “Does this component fit our architecture?” and “Did the agent over-engineer this solution?”
Train your senior engineers to review for intent and systemic impact. That’s the stuff AI genuinely can’t do yet. Leave the syntax checking to the robots.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Software-development / Code-review
-
Leading Teams When AI Does the Typing
AI is changing how teams build things. That part is obvious. What’s less obvious is that it doesn’t change what makes people want to follow you.
If anything, as the technical execution gets more automated, the human stuff becomes more important. Empathy, vision, strategic judgment. This has been top of mind for me a lot lately. I think the leaders who are going to thrive aren’t the ones who adopt AI the fastest. They’re the ones who understand what AI can’t do.
Focus on Outcomes, Not Output
AI accelerates generation. Code, docs, data parsing. All of it gets faster. That means individual output spikes, and measuring your team’s success by raw productivity becomes a losing game.
Your job is to provide the why. AI can handle the how, but it cannot determine why you’re building something in the first place. Crystal-clear business context and architectural vision are what keep your team’s AI-augmented velocity pointed in the right direction.
Something worth thinking about: when engineers can generate solutions faster, system complexity increases. Fast. Leadership means making sure the team is building the right things, not just building things quickly. Speed without direction is just expensive chaos.
Build the Guardrails Before You Need Them
Before you hand a team tools that let them move at lightspeed, the brakes and steering are important to have in place.
Automate your compliance. Strong SAST/SCA tooling, foolproof secret management, the boring stuff that lets people experiment without risking the infrastructure. As AI assists with more logic and agents take on more autonomous tasks, these guardrails become non-negotiable.
Same goes for observability. You can’t manage what you can’t see. When AI agents are handling high-volume tasks, a solid observability stack is what lets you trust the automation and catch it when a model hallucinates or a system drifts.
Automate the Toil, Protect the Thinking
A good leader uses AI to elevate human effort, not replace it. Look for the most repetitive, low-joy tasks in your team’s workflow. Scaffolding test environments, parsing logs, summarizing meetings. Deploy AI to handle the toil so your people can do the work that actually requires a brain.
Then protect that time. With AI handling the busywork, guard your team’s calendar for deep architectural thinking, complex debugging, creative system design. The stuff AI still struggles with. That’s where your team’s real value lives.
The Human Skills Are the Whole Game Now
The most valuable skills in an AI-driven org are the ones algorithms can’t replicate.
- Psychological safety. AI tools are changing fast. Your team needs to feel safe experimenting, failing, and learning. Punish well-intentioned experimentation and you’ll kill innovation before it starts.
- Mentorship. AI can answer factual questions all day long. It cannot mentor a junior engineer through a crisis of confidence or help a senior navigate organizational politics. Put your energy into 1:1s, career mapping, and active listening.
- Ethical judgment. As AI agents take on more responsibility in domains like finance, underwriting, or automated operations, you are the moral compass. You’re the one who needs to ask the hard questions about bias, fairness, and unintended consequences.
So What’s the Job Now?
The job is the same as it’s always been: create clarity, remove obstacles, and give a damn about your people. AI just raises the stakes on all three.
The leaders who get this right won’t be the ones with the best AI strategy deck. They’ll be the ones whose teams actually want to show up and build something together.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Leadership / Management
-
How to Pick an Embedding Model (Without Overthinking It)
It’s easy to get deep into vector database comparisons, HNSW vs. IVF, pgvector vs. Pinecone, Qdrant vs. Chroma, and completely skip over the thing that actually matters most: the embedding model.
The way I think about it, the embedding model is the brain of your retrieval system. The vector database is just its filing cabinet. If the model creates poor mathematical representations of your data, no amount of indexing strategy or database performance is going to save you. You’ll get fast, confident, wrong results.
So let’s talk about how to pick a model.
Dimensionality: More Isn’t Always Better
Embeddings are high-dimensional vectors. Common sizes are 384, 768, 1536, or 3072 dimensions. Higher dimensions capture more nuance, but they also mean more storage, more memory, and slower search.
For a lean, local-first setup, something like
all-MiniLM-L6-v2at 384 dimensions gives you a surprisingly good balance of speed and accuracy. You don’t need 3072 dimensions to search your notes. Save the big vectors for when you actually have a reason.Sequence Length: The Silent Data Killer
Sequence length determines how much text the model can look at to create a single vector. If you’re embedding long technical docs or sprawling Markdown files and your model caps out at 512 tokens, it’s just truncating everything past that point. Your carefully written documentation gets chopped, and the embedding only represents the first few paragraphs.
Modern long-context embedding models handle 8k to 32k tokens, which lets you embed entire chapters or large code blocks as single semantic units. If your content is longer than a few paragraphs, check this number before anything else.
Domain Matters More Than You Think
General-purpose models like OpenAI’s
text-embedding-3-smallwork well across most tasks. They’ve been trained on massive, diverse datasets and they’re solid defaults.If you’re searching a codebase or technical documentation, models fine-tuned on programming languages (like
voyage-code-2) will outperform the general ones. The same applies to medical or legal text, where domain-specific jargon means the difference between a relevant result and a completely wrong one.Check MTEB Before You Commit
The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing models. It breaks performance into sub-categories like Retrieval, Summarization, and Clustering. If you’re building RAG, look at the Retrieval scores specifically. A model that ranks well for clustering might be mediocre at retrieval, and vice versa.
Local vs. API: Pick Your Tradeoff
This decision is as important as the model itself.
- Local models (via HuggingFace or Ollama) keep everything offline. Zero per-request costs, full privacy. Something like
bge-small-en-v1.5running locally is perfect for personal knowledge management or anything where your data shouldn’t leave your machine. - Hosted APIs (OpenAI, Voyage, Cohere) give you the highest performance and longest context windows without managing GPU infrastructure. Better for enterprise scale where you’re willing to trade privacy and recurring costs for accuracy.
Local models make sense for personal projects and hosted APIs make sense when the scale demands it. There’s no universal right answer, but there is a wrong one: picking a deployment model without thinking about where your data lives.
The vector database conversation is important, but it’s second in line to getting the embedding model right first. Everything downstream depends on it.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Rag / Embeddings / Vector-databases
- Local models (via HuggingFace or Ollama) keep everything offline. Zero per-request costs, full privacy. Something like
-
Pandoc vs MarkItDown: Two Tools, Two Eras
Pandoc has been the gold standard for document conversion for nearly two decades. But there’s a newer tool from Microsoft called MarkItDown, and while the names sound like they do similar things, they were built for completely different reasons.
Pandoc is a universal document converter designed for human publishing. It converts almost any format into almost any other format while preserving complex typography, citations, and formatting. MarkItDown is a specialized extraction tool designed for AI. It converts various files strictly into Markdown so that LLMs and RAG pipelines can read and process the text.
Same input files, very different goals.
Pandoc: The Universal Translator
Pandoc has been around since 2006, written in Haskell, and it operates on an Abstract Syntax Tree. It reads a document, builds a complex internal model of its structure, and then translates that structure into your desired output. We’re talking 40+ output formats here. PDF, Word, HTML, LaTeX, EPUB, you name it.
Where it really shines is academic and technical writing. It natively understands LaTeX math, footnotes, bibliographies, and cross-referencing. You can turn a Word doc into Markdown, edit it, and use Pandoc to turn it back into a perfectly formatted PDF. Two-way conversion that actually works.
You can also write custom filters in Lua or Python to programmatically alter documents during conversion. Want to automatically downgrade all your H2s to H3s? Pandoc has you covered.
MarkItDown: The LLM Feeder
MarkItDown was released by Microsoft in late 2024 to solve a very modern problem. LLMs need clean, structured text to “read” documents, but corporate data is locked inside messy formats like multi-tab Excel spreadsheets, image-heavy PowerPoints, and ZIP archives.
It’s a Python library first, CLI second. It drops into your scripts in a few lines of code, which makes it easy to wire up with LangChain, LlamaIndex, or raw API calls. The output is always Markdown. That’s it. No PDF generation, no Word docs, no EPUB. Just clean text that an AI can process.
The interesting trick is what it does with images and audio. Feed it a PDF with diagrams and MarkItDown can connect to an LLM like GPT-4o to look at the image and write a Markdown description of what it sees. It can also transcribe audio files. That’s a fundamentally different approach from Pandoc, which preserves images as files rather than describing them.
Quick Comparison
Feature Pandoc MarkItDown Primary Goal Universal document conversion Document ingestion for AI Output Formats 40+ (PDF, Word, HTML, LaTeX, etc.) Only Markdown Language Haskell (standalone CLI) Python (library-first) Image Handling Preserves and extracts image files Uses OCR/LLM Vision to describe images as text Complex Formatting Citations, bibliographies, LaTeX math, custom filters Basic structural support (headings, tables, slides) So Which One Do You Want?
Pandoc if you’re writing a book, research paper, or blog and need polished output in multiple formats. If you need to maintain citations, complex formatting, or convert files out of Markdown into something else, Pandoc is your tool.
MarkItDown if you’re building an AI agent, chatbot, or search tool and need to extract text from a pile of PDFs, Excel files, and PowerPoints. If you only care about getting raw structured text and don’t care about the visual layout of the original document, MarkItDown is purpose-built for that.
They’re not competitors. Pandoc is for publishing. MarkItDown is for feeding AI. Pick the one that matches what you’re actually trying to do.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Tools / Development
-
pgvector vs Pinecone: You Probably Don't Need a Separate Vector Database
Every time someone starts building a RAG pipeline, the same question will come up: do I need a “real” vector database like Pinecone, or can I just use pgvector with the Postgres I already have?
I can imagine teams agonizing over this decision for weeks. So maybe this will save you some time?
The Case for Staying Put
If you already have a PostgreSQL instance in your stack, adding
pgvectoris almost always the right first move.You manage one stateful service instead of two. Your existing backup strategy, monitoring, and security all stay the same. Your vector embeddings live next to your metadata, so you get ACID compliance and standard SQL joins. No syncing between two data stores. No eventual consistency headaches.
Performance? From what I found, for datasets under a few million vectors,
pgvectorwith HNSW indexes is fast. Really fast. It satisfies the latency requirements of most applications without breaking a sweat.And you’re not paying for another SaaS subscription…
When Pinecone Actually Makes Sense
Pinecone is a purpose-built vector database designed for high-dimensional data at massive scale. It’s serverless and fully managed.
If you’re dealing with hundreds of millions or billions of vectors, a specialized engine handles memory and disk I/O for similarity searches more efficiently than Postgres can. Pinecone also gives you native namespace support, metadata filtering optimized for vector search, and live index updates that are faster than re-indexing a large Postgres table.
Those are real advantages. At a certain scale.
The Decision Is Simpler Than You Think
Stay with Postgres + pgvector if:
- You want to minimize infra sprawl and moving parts
- Your vector dataset is under 5 to 10 million records
- You rely on relational joins between vectors and other business data
- You have existing observability and DBA expertise for Postgres
Consider Pinecone if:
- Your Postgres instance needs massive, expensive vertical scaling just to keep the vector index in memory
- You don’t want to tune HNSW parameters,
mmapsettings, or vacuuming schedules for large vector tables - You need sub-millisecond similarity search at a scale where Postgres starts to struggle
That is what I would use to make that decision.
Most teams are probably nowhere near the scale where Pinecone becomes necessary. They have a few hundred thousand vectors, maybe a million or two. Postgres handles that without flinching. Adding a separate managed vector database at that point is just adding operational complexity for no measurable benefit.
The trap is thinking you need to “plan ahead” for scale you don’t have yet. You can always migrate later if you actually hit the ceiling. Moving from pgvector to Pinecone is a well-documented path. But moving from two services back to one because you overengineered your stack? That’s a conversation nobody wants to have.
Start with what you have. Add complexity when the numbers force you to, not when a vendor’s marketing page makes you nervous.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / Databases
-
LangChain and LLM Routers, the Short Version
LangChain is important to know and understand in the age of agents. Also, LLM routing. They’re related but they’re not the same thing, and the distinction matters.
So lets break it down.
LangChain is the Plumbing
Out of the box, an LLM is a text-in, text-out engine. It only knows what it was trained on. That’s it. LangChain is an open-source framework that connects that engine to the outside world.
It gives you standardized tools to build pipelines:
- Models: Interfaces for talking to different LLMs (Gemini, Claude, OpenAI, whatever you’re using)
- Prompts: Templates for dynamically constructing instructions based on user input
- Memory: Letting the LLM remember past turns in a conversation
- Retrieval (RAG): Connecting the LLM to external databases, PDFs, or the internet so it can answer questions about your data
- Agents & Tools: Letting the LLM actually do things, like execute code, run a SQL query, or send an email
You could wire all of this up yourself, but LangChain gives you the standard pieces so you’re not reinventing the plumbing every time.
LLM Routers are the Traffic Controller
A router is an architectural pattern you build on top of that plumbing. Instead of sending every request through the same prompt to the same massive model, a router evaluates the request and directs it to the right destination. Simple concept, big impact.
Three reasons you’d want one:
- Cost: You don’t need a giant, expensive model to answer “Hello!” or look up a basic fact. Send simple queries to a smaller, cheaper model. Save the heavy model for complex reasoning.
- Specialization: Maybe you have one prompt for writing code and another for searching a company HR manual. The router makes sure the query hits the right expert system.
- Speed: Smaller models and direct database lookups are faster. Routing makes your whole application more responsive.
How Routing Actually Works
In LangChain, there are two main approaches:
Logical Routing uses a fast LLM to read the user’s prompt and categorize it. You tell the router LLM something like: “If the user asks about math, output MATH. If they ask about history, output HISTORY.” LangChain then branches to a specialized chain based on that output.
Semantic Routing skips the LLM entirely for the routing decision. It converts the user’s text into a vector (an array of numbers representing the meaning of the text) and compares it to predefined routes to find the closest match. This is significantly faster and cheaper than asking an LLM to make the call.
LangChain provides
RunnableBranchin LCEL (LangChain Expression Language, their declarative syntax for chaining components) for this, basically if/then/else logic for your AI pipelines. Worth digging into if you’re building with LangChain.Routing is what makes AI applications practical at scale. LangChain is one way to build it. They’re complementary, not interchangeable.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Programming / Langchain / LLM
-
Your Brain vs. a Large Language Model
We don’t fully understand the human brain. That’s just how things are. But we know enough about its structure to make some genuinely interesting comparisons to how large language models work. So let’s walk through the major components of your brain and see where the parallels land.
The Neocortex and the Transformer
The neocortex is the outer layer of your brain, responsible for the higher-order stuff: sensory perception, spatial reasoning, language. The prefrontal cortex (PFC) sits within it as the orchestrator. It handles executive function, decision-making, and complex thought.
The LLM equivalent here is the transformer architecture itself. And the PFC’s role maps surprisingly well to the attention mechanism. The attention mechanism decides what information matters most given the current context, which is essentially what your prefrontal cortex does all day.
If you’ve worked with agentic AI systems, you’ve probably seen this pattern play out directly. You typically have an orchestration agent managing specialized sub-agents, each built for a specific task. That management layer is doing PFC work, deciding which agent to activate, what context to pass along, and how to synthesize the results.
The Hippocampus and Memory
The hippocampus is your storage unit. It’s critical for forming new memories and converting short-term experiences into long-term ones. Think of it as a buffer between what just happened and what you’ll remember later.
The LLM equivalent splits into two pieces. The model weights are your long-term memory, everything learned during training. The context window is your working memory, what the model can hold in its head right now for the current conversation.
LLMs don’t natively have long-term memory. The weights are baked in during training and that’s it. But memory systems get bolted on as part of the harness, and this is where retrieval-augmented generation (RAG) comes in. RAG lets the model pull in external data to contextualize its responses, which is functionally the same thing your hippocampus does when it retrieves a stored memory to help you make sense of something new.
Synapses and Parameters
Synapses are the gaps between neurons where signals pass, chemical or electrical. The strength of those connections determines how information flows through your brain. Stronger connections mean faster, more reliable signal paths.
This maps directly to model weights and parameters. Stronger connections between data points in the model mean those patterns carry more influence over the output. When we say a model has 170 billion parameters, we’re effectively describing the synaptic density of a digital brain. It’s not a perfect analogy, but it gives you an intuitive sense of scale.
Dopamine and RLHF
Your brain’s dopamine system is its reward circuit. It fires when an outcome is better than expected, reinforcing beneficial behaviors over harmful ones. It’s how you learn that some choices are worth repeating.
The LLM equivalent is reinforcement learning from human feedback, or RLHF. During training, humans rank the model’s responses. Good answers get a mathematical reward signal that makes similar outputs more likely in the future. Bad answers get penalized. This is the alignment problem in a nutshell: teaching the model what we find valuable and useful, the same way dopamine teaches your brain what’s worth pursuing.
This is also where the analogy breaks down the most. Dopamine is intrinsic. It’s wired into your survival. You don’t choose to feel rewarded when you eat, your brain just does that. RLHF is a proxy. The model isn’t learning what’s actually helpful, it’s learning what a secondary reward model scores as helpful. The result is a system that optimizes to appear useful rather than be useful. That’s why models can be confidently wrong or agree with you when you’re clearly mistaken. The reward signal says “the human liked that,” not “that was true.”
The Basal Ganglia and Routing
The basal ganglia is your gating mechanism. It’s a group of structures involved in motor control, habit formation, and deciding which thoughts or movements should surface and which should be suppressed. It’s basically your brain’s security and routing layer.
The LLM equivalent is the routing logic in mixture-of-experts (MoE) models. Every major provider uses some degree of MoE at this point. Different parts of the network activate depending on the task at hand, which is exactly what the basal ganglia does. System prompts play a similar role too, shaping how the model decides to respond given a particular input or situation.
So What?
None of these comparisons are perfect. The brain is biological, messy, and shaped by millions of years of evolution. LLMs are mathematical, deterministic (mostly), and shaped by a few years of engineering. But the structural parallels are hard to ignore. Attention mechanisms, memory systems, reward signals, gating logic. We keep arriving at similar architectural patterns, just built differently.
I don’t think that’s a coincidence, it tells us something about what intelligence requires, regardless of whether it’s running on neurons or GPUs.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ AI / Llms / Neuroscience
-
NotebookLM Is Just RAG With a Nice UI
I’ve been watching AI YouTubers recommend NotebookLM integrations that involve authenticating your Claude instance with some random skill they built. “Download my thing, hook it up, trust me bro.” No details on how it works under the hood. No mention of why piping your credentials through someone else’s code might be a terrible idea. Let’s just gloss over that, I guess.
So here we are. Let me explain what NotebookLM actually is, because once you understand RAG, the magic disappears pretty quickly.
What Is RAG?
RAG stands for Retrieval Augmented Generation. It’s an AI framework that improves LLM accuracy by retrieving data from trusted sources before generating a response.
The LLM provides the reasoning and token generation. RAG provides specific, trusted context. Combining the two gives you general reasoning grounded in your actual data instead of whatever the model has or hallucinated from its training set.
The core pipeline looks like this:
- Take your trusted data (docs, PDFs, YouTube transcripts, whatever)
- Chunk it into pieces
- Create vector embeddings from those chunks
- Store the vectors in a database
- When you ask a question, embed the question into the same vector space
- Find the most similar chunks
- Feed those chunks into the LLM as context alongside your question
That’s it. That’s NotebookLM. Steps 1 through 6 are the retrieval half. Step 7 is where the LLM synthesizes an answer. The nice UI on top doesn’t change what’s happening underneath.
I Accidentally Built Half of It?
I was interested in the semantic embeddings portion of this pipeline and ended up building something I called Semantic Docs. It handles the retrieval half, steps 1 through 6.
You point it at a knowledge base, internal company docs, research papers, whatever you’re interested in. It chunks the content, creates vector embeddings, and stores them in a database. When you search, it creates a new embedding from your query, finds the most similar chunks, and returns those as search results.
The difference between Semantic Docs and NotebookLM is that last step. Semantic Docs gives you the relevant files and passages. It says “here’s where the answers live, go read it.” It doesn’t pipe everything through an LLM to generate a synthesized response. This is a choice, a deliberate choice, not a missing feature.
Why No Official API Is a Problem
NotebookLM doesn’t have an official API. People have reverse-engineered how it works, which means every integration you see is built on undocumented behavior that could break at any time. The AI YouTubers recommending these workflows are essentially saying “trust this unofficial thing with your data and credentials.” That should make you uncomfortable.
If you understand RAG, you can build the parts you actually need. The retrieval half is genuinely useful on its own, and you control the whole pipeline. No third-party authentication. No undocumented APIs. No wondering what happens to your data.
I’ll probably write more about RAG in the future. It’s a good topic and there’s a lot of noise to cut through. For now, just know that the next time someone tells you NotebookLM is magic, it’s really just vector search with a chat interface on top.
If you’re a developer, I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week. Or you can find me on Mastodon at @[email protected].
/ AI / Programming / Rag / Notebooklm
-
AI-Assisted vs AI-Agentic Coding
There are two ways to work (c0de) with AI tools right now. I think most people know the other one exists, but they haven’t taken the time to try it. You should know how to do both. And when to do both.
Assisted Mode
Everybody knows this one. You write some code, you get stuck, you ask a question.
How does date parsing work in Python? What’s this function do? Haven’t we built this already? I need some fucking Regex again.
The AI answers. You copy-paste or accept the suggestion. You keep going. You’re driving. The AI is in the passenger seat reading the map.
I mean, this is really useful. I’m not going to pretend it isn’t. It’s also just autocomplete with opinions. Fancy autocomplete. Smart autocomplete.
Great. You’re doing the thinking. You’re deciding what gets built and how to structure it and what order to do things in. You’re just asking for help on some of the blanks. That’s assisted mode.
Agentic Mode
This is different.
You describe what you want. You need to know how to describe what you want.
That is extremely important. Let me say that again. You need to know how to describe what you want.
You need to build an agent that understands how to interpret your description as what you want.
Sometimes it’s going to get it correct and sometimes it’s not. It’s going to go in a different direction than you wanted and you’re going to have to correct it. That’s the job now. You’re reviewing the output, the code, and how it’s producing the code. What are the gaps? You have to find the gaps and improve the agent so that it understands you better.
When I Use Which
I wish I had a clean rule for this. I don’t. That’s the vibes part.
Small or specific things can be assisted. Quick answers. Great. Easy. Move on.
Once you start wanting to touch multiple files, agentic. Major features like commands or parser changes or handler rewrites, recipes or tests. I’m not writing all that by hand. I can describe what I want way better than I can autocomplete it.
Bug fixes? Depends. If I already know where the bug is, assisted. If I don’t, agentic. Let the agent grep around and figure it out. It’s better at reading a whole codebase quickly than I am. Not better at understanding it. Better at reading it.
New features? Almost always agentic. I describe the feature, point it at similar code in the repo, and let it go.
Again, review is super important. Sometimes you have to send it back or start over or change major portions of it. And if you build a system that learns, it’ll get better along the way.
The Review Problem
Switching to agentic mode, your entire job is code review. All day, all the time, constant. That’s the human’s job. Code review.
Are you good at code review? You should get better at it. You need to get better at it.
This is not whether or not the tests pass. You need to identify possible issues and then describe tests that can check for those issues.
The nuanced bugs are the worst. And if those make it to production, you’re going to have problems.
Don’t skim the diff.
That should be the new motto. Read the code. Get better at code comprehension. It’s extremely important. You may be writing less code but you need to sure as shit understand what the code is doing and how it can be bad.
The Hybrid Reality
It’s totally fine to switch between modes depending on what you’re doing or your work session. Agentic can be way more impactful, but assisted mode is way better at helping you understand what the code is doing because you can select code blocks and easily ask questions about it.
So it’s not a toggle, it’s a spectrum. Now isn’t that funny? I’m on the spectrum of agentic development.
Where are you on the spectrum of agentic development?
So Which Is Better?
Neither. Both. It depends. Whatever, just build stuff.
Is assisted mode safer? Really? Like, does the human actually write better code this way? I don’t know. Agentic mode can be faster and you need to be super careful that it’s not gaslighting you into thinking it knows what it’s doing.
Build software for you. And when it makes sense, help out with the community stuff. Support open source.
If you’re a developer, I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week. Or you can find me on Mastodon at @[email protected].
/ AI / Development / Claude / Agents
-
Claude Opus 4.7 Is Here
Anthropic just announced Claude Opus 4.7 yesterday, and here is my take on the new model after reading the blog post and doing a bit of research on their rollout plans from previous models.
What’s New
The headline is a 13% improvement on a 93-task coding benchmark over Opus 4.6. Rakuten’s SWE-Bench saw 3x more production tasks resolved, which is the kind of real-world metric that actually matters. Benchmarks are one thing, but “can it handle my actual codebase” is another.
The big quality-of-life improvement is that Opus 4.7 is better at verifying its own output before telling you it’s done. If you’ve ever had a model confidently hand you broken code and say “there you go,” you know why this matters. It handles long-running tasks with more precision, and the instruction following is noticeably tighter.
There’s also a major vision upgrade. The new model accepts images up to 2,576 pixels on the long edge, which is more than 3x the resolution of previous Claude models. If you’re working with technical diagrams, architecture charts, or screenshots of code, that’s a real improvement.
When Can You Actually Use It?
For enterprise customers, Anthropic says Opus 4.7 is available from your cloud vendor: the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. But most of us aren’t using the API directly.
As of right now, Opus 4.7 is not yet available in Claude Code or the desktop app. It’s also not showing up in the model picker on claude.ai for Pro plan users. Anthropic’s announcement says “available today across all Claude products,” but that doesn’t seem to have fully rolled out yet for consumer plans.
Looking at previous releases, Opus 4.6 launched on February 5th and was accessible on claude.ai and the API the same day. Historically, Anthropic hasn’t gated new Opus models behind higher tiers, so there’s no reason to think Pro, Max, Team, and Enterprise won’t all get access. The question is just when. If past patterns hold, it should show up within a few days. Keep checking your model picker.
Claude Code Users
As of today, Claude Code on the stable release is still on Opus 4.6. I’m not sure if it’s available on the bleeding edge builds, but for most people it’s not there yet.
The announcement mentions a few Claude Code features coming with 4.7:
/ultrareviewis a new slash command for dedicated code review sessions. Pro and Max users get three free ultrareviews to try it out.- Auto mode has been extended to Max plan users, letting Claude make more decisions autonomously.
- The default effort level is being bumped to
xhigh(a new level betweenhighandmax), which means the model will spend more time reasoning through harder problems.
Once Opus 4.7 does show up in Claude Code, remember to check any custom agents or skills that have a model hardcoded in the frontmatter. If you’ve got
claude-opus-4-6specified in your.claude/commands/directory or agent configurations, those will keep using the old model until you update them.Anthropic also notes that Opus 4.7 follows instructions more literally than previous models. Prompts written for earlier models can sometimes produce unexpected results. So if something feels off after switching, it’s worth re-tuning your prompts.
The Tokenizer and Cost Changes
One thing to be aware of: the tokenizer has been updated. The same input text will produce 1.0 to 1.35x more tokens than before. That means your costs could go up slightly even at the same per-token pricing ($5/million input, $25/million output, unchanged from 4.6). Not a dealbreaker, but worth watching if you’re running high-volume workloads.
Pricing hasn’t changed, the coding improvements look useful, and important to know that the model ID is
claude-opus-4-7. Keep an eye on your model picker over the next few days. -
Agentic Development Trends: What's Changed in Early 2026
I’ve been following the agentic development space around Claude Code and similar tools and the last couple months have been interesting. Here’s what I’m seeing as we move through March and April 2026.
From Solo Agents to Coordinated Teams
The biggest shift is that more people are moving away from trying to build one agent that does everything. Instead, we’re seeing coordinated teams of specialized agents managed by an orchestrator, often running tasks in parallel. I think this is the more proper use of these systems, and it’s great to see the community arriving here.
If you’re curious about the different levels of working with agentic software development, I created an agentic maturity model on GitHub that goes into more detail on this progression.
Long-Running Autonomous Workflows
Early on, agents handled what were essentially one-shot tasks. Now in 2026, agents can be configured to work for days at a time, requiring only strategic oversight at key decision points. Doesn’t that sound fun? You’re still the bottleneck, but at least now you’re a strategic bottleneck.
Graph-Based Orchestration
Frameworks like LangGraph and AutoGen are converging on graph-based state management to handle the complex logic of multi-agent workflows. I think this makes sense when you consider the branching and conditional logic of real-world tasks could map naturally to graphs.
MCP Is Everywhere
MCP (Model Context Protocol) has become the industry standard for tool integration. All vendors fully support it, and there’s no sign of slowing down. Every week there are new MCP servers popping up for connecting agents to different services and tools.
Unified Agentic Stacks
The developer tooling is becoming more consistent. Cursor is becoming more like Claude Code, and Codex is becoming more like Claude Code. Maybe you see a pattern there… might tell you something about who’s setting the pace.
What is also noteable, people are experimenting with using different tools for different parts of the workflow. You might use Cursor to build the interface, Claude Code for the reasoning and main logic, and Codex for specific isolated tasks. Mix and match based on strengths.
Scheduled Agents and Routines
Claude Code recently released routines or scheduled or trigger-based automations that can run 24/7 on cloud infrastructure without needing your laptop. Microsoft with GitHub Copilot are working on similar capabilities? Cursor had something like this a while back too.
Security Gets Serious
Two things happening here. First, people are getting better at leveraging agents for security reviews and monitoring. Tasks that previously required highly specialized InfoSec expertise. You no longer need to be a hacker to find vulnerabilities; you can let your AI try to hack you.
However, the same capabilities that harden defenses can also be used for offensive attacks. We’re seeing a major push for security-first architecture as a requirement for all new applications, specifically to defend against the rise of agentic offensive attacks. Red team and blue team are both getting AI-pilled.
FinOps: Watching the Bill
Last on the list is financial operations. Inference costs now account for over half of AI cloud spending according to recent estimates. Organizations are prioritizing frameworks that offer explicit cost monitoring and cost-per-task alerts. Getting granular about how much you’re spending to solve specific problems and optimizing at the task level. I think that’s pretty interesting and something we’ll see a lot more tooling around.
The common thread across all of these trends is maturity. We’re past the “wow, an AI wrote code” phase and into “how do we make this reliable, secure, and cost-effective at scale.” That’s a good place to be.
/ DevOps / AI / Development / Claude
-
What Is an AI Agent, Actually?
We need some actual definitions. The word “agent” is getting slapped onto every product and service, and marketers aren’t doing anybody favors as they SEO-optimize for the new agentic world we live in. There’s a huge range in what these things can actually do. Here is my attempt at clarity.
The Spectrum of AI Capabilities
Chatbot / Assistant — This is a single conversation with no persistent goals and no tool use. You ask it questions, it answers from a knowledge base. Think of the little chat widget on a product page that helps you find pricing info or troubleshoot a common issue. It talks with you, and that’s about it.
LLM with Tool Use — This is what you get when you open “agent mode” in your IDE. Your LLM can read files, run commands, edit code. A lot of IDE vendors call this an agent, but it’s not really one. It’s a language model that can use tools when you ask it to. The key difference: you are still driving. You give it a task, it does that task, you give it the next one.
Agent — Given a goal, it can plan and execute multi-step workflows autonomously. By “workflow” I mean a sequence of actions that depend on each other: read a file, decide what to change, make the edit, run the tests, fix what broke, repeat. It has reasoning, memory, and some degree of autonomy in completing an objective. You don’t hand it step-by-step instructions. You describe what you want done, and it figures out how to get there.
Sub-Agent — An agent that gets dispatched by another a command or “LLM with Tool Use” to handle a specific piece of a larger task. If you’ve used Claude Code or Cursor, you know what I’m talking about. The main chat coordinator kicks off a sub-agent to go research something, review code, or run tests in parallel while it keeps working on the bigger picture. The sub-agent has its own context and tools, but it reports back to the parent. It’s not a separate autonomous agent with its own goals. It’s more like delegating a subtask.
Multi-Agent System — Multiple independent agents coordinating together, either directly or through an orchestrator. The key difference from sub-agents: these agents have their own goals and specialties. They negotiate, hand off work, and make decisions independently. Think of a system where one agent monitors your infrastructure, another handles incident response, and a third writes the postmortem. Each Agent is operating autonomously but aware of the others.
So How Is Something Like OpenClaw Different From a Chatbot?
A chatbot is designed to talk with you, similar to how you’d just talk with an LLM directly. OpenClaw is designed to work for you. It has agency. It can take actions. It’s more than just a conversation.
Obviously, how much it can do depends on what skills and plugins you enable, and what degree of risk you’re comfortable with. But here’s the interesting part: it’s proactive. It has a heartbeat mechanism that keeps it running continuously in the background. It’ll automatically check on things or take action on a schedule you specify, without you having to prompt it.
A Few Misconceptions Worth Clearing Up
OpenClaw is just one specific framework for building and orchestrating agents, but the misconceptions around it apply broadly.
“Agents have to run locally." That’s how OpenClaw works, sure. But in reality, the enterprise agents are running invisibly in the background all the time. Your agent doesn’t need to live on your laptop.
“Agents need a chat interface." Because you can talk to an agent, people assume you must have a chat interface for it to be an agent. But by definition, agents don’t require a conversation. They can just run in the background doing things. No chat window needed.
“Sub-agents are just function calls." This one trips up developers. When your agent spawns a sub-agent, it’s not the same as calling a function. The sub-agent gets its own context window, its own reasoning loop, its own tool access. It can make judgment calls the parent didn’t anticipate. That’s fundamentally different from passing arguments to a function and getting a return value.
Why Write This Down
I mainly wrote this for myself. I keep running into these terms and needing a mental model to put them in context, so as I’m thinking about building agentic systems and trying to decide what level of capability I actually need for a given problem. The process of writing it down makes those decisions somewhat easier.
-
A Concrete Definition of an AI Agent
An AI agent pursues a goal by iteratively taking actions, evaluating progress, and deciding next steps. Useful agents must be reliable, adaptive, and accurate.
/ AI / links / agent / automation
-
The Death of Clever Code
One positive product of working with Agentic tools is they rarely suggest clever code. No arcane one-liners, no “look how smart I am” abstractions. And, well, I’m here for it.
Before we continue it helps to understand a bit about how LLMs work. These models are optimized for pattern recognition. They’ve been trained on massive amounts of code and learned what patterns appear most frequently.
Clever code, by definition, is bespoke. It’s the unusual pattern, the one-off trick. There just isn’t enough training data for cleverness. The AI gravitates toward the common, readable solution instead.
Let me give you an example.
Show Me the Code
Here’s a nested ternary:
const result = a > b ? (c > d ? 'high' : 'mid') : (e > f ? 'low' : 'none');I’d be impressed if you could explain that correctly on your first try. What happens when there’s a bug in one of those conditions? Good luck debugging that.
Now here’s the same logic:
let result; if (a > b) { if (c > d) { result = 'high'; } else { result = 'mid'; } } else { if (e > f) { result = 'low'; } else { result = 'none'; } }A lot easier, right? If it’s easy to read, it’s easy to maintain. The AI tooling doesn’t struggle to read either version, but you might, and when there is a bug, explaining exactly what needs to change becomes the hard part.
Actually wait. It turns out, not all complexity is created equal.
Two Kinds of Complexity
Essential complexity is the complexity of the problem itself. If you’re building a mortgage calculator or doing tax calculations, there’s inherent complexity in understanding the domain. You can’t simplify that away, and you shouldn’t try.
Accidental complexity is the stuff you introduce. The nested ternary instead of the if/else. Five layers of abstraction for the sake of abstraction that only runs in a specific edge case. Generic utility functions where you’ve tried to cover every possible scenario, but realistically you only need two or three cases.
Ok but what about abstraction, since abstraction is where accidental complexity loves to hide?
Good Abstraction vs. Bad Abstraction
Abstraction shows up everywhere in programming, but let’s think about it in two flavors.
Good abstraction hides details the caller doesn’t need to care about. The interface clearly communicates what it does. Think
array.sort(), you look at it and immediately know what’s happening. Those dang arrays getting some sort of sorted. You know exactly what it does without caring about the implementation.Bad abstraction hides details you do need to understand in order to use it correctly. Think of a
processData()method that’s doing six different things with an internal state that’s nearly impossible to test. And splitting it intoprocessData1()throughprocessData6()doesn’t help either. That’s just moving the vegetables around on your plate which doesn’t mean you’ve actually finished dinner.AI Signals
So why does any of this matter for working with AI coding tools?
Because if the agents keep getting your code wrong, if they consistently misunderstand what a function does or there are incorrect modifications, that’s a signal.
It’s telling you that your code has some flavor of cleverness that makes it hard to reason about. Not just for the AI, but for your team, and for you six months from now.
The goal is to code where the complexity comes from the problem, not from the solution. The AI struggling with your code is like a canary in the coal mine for maintainability.
/ AI / Programming / Code-quality
-
Your AI Agent Needs a Task Manager
If you’ve spent time working with AI coding tools, you’ve probably hit the compaction wall. Suddenly, your agent knows what it’s currently working on but has completely forgotten the five other things connected to it.
This is the memory problem, and it’s a big one.
The Context Window Isn’t Enough
Your AI agent needs some sort of memory system that lives outside the context window. When you’re working on simple, one-off tasks, the chat-as-workspace approach works fine. You ask a question, you get an answer, you move on. But the moment you’re tackling a complex set of related tasks? It breaks down fast.
I’ve been thinking about this through the lens of a framework I’m calling the Agentic Maturity Model. The short version is that there are distinct levels to how teams and developers use AI agents, and moving between levels isn’t about using “better” tools, but rather it’s a shift in how you approach the work.
Four months ago, there were no real options. The good news? It seems like all the model providers recognize this is the next frontier. Memory and persistence are where I’m looking for the actual progress to happen next.
Claude Code has certainly gotten better in these areas over the last couple of months. They’ve added an auto memory feature in beta. They added a lightweight Tasks system based on a Todo system called Beads built by Steve Yegge. His key idea was that the task state should live outside the context window.
These are meaningful building blocks towards an actual working memory system that persists across sessions and survives compaction.
We’re Almost There
The tooling and harneses we built on top of the LLMs are already changing how software gets built, but where we are headed? Here is what I think:
- auto-improving memory: where the agent learns your patterns, your codebase, your preferences
- persistent task tracking that survives compaction: Tasks, todos, issues, whatever you want to call them. The point is they exist outside the conversation.
When those two pieces come together properly, the workflow for everyone will change again.
Your agent doesn’t just respond to the current prompt. It knows where it is in a larger plan, what’s been done, what’s blocked, and what’s next. That’s the difference between a helpful chatbot and an actual collaborator.
We are so close I can taste the blood in the water, oh wait, that’s mine. ☠️
-
Using Claude to Think Through a Space Elevator
When I say I wanted to understand the engineering problems behind building a space elevator, I mean I really wanted to dig in. Not just read about it. I wanted to work through the challenges, piece by piece, with actual math backing things up.
So I decided to see what Claude and I could do with this kind of problem.
Setting it Up
I have an Obsidian vault that Claude Code/CoWork has access to, and I started by asking it to help me understand the core challenges of building a space elevator. First things first: clearly state all the problems. What are the engineering hurdles? What makes this so hard?
From there, I started asking questions. Could we use an asteroid as the anchor point and manufacture the cable in space? How would we spool enough cable to reach all the way down to Earth? Would it make more sense to build up from the ground, down from orbit, or meet somewhere in the middle?
I’ll admit I made some mistakes along the way. I confused low Earth orbit with geostationary orbit at one point but Claude corrected me and explained the difference. That’s part of what makes this approach work. You’re not just passively reading; you’re actively thinking through problems and getting corrected when your mental model is off.
Backing It Up With Math
Here’s where it got really interesting. I told Claude: don’t just describe the problems. Prove them. Back up every challenge with actual math and physics calculations.
I also told it not to try cramming everything into one massive document. Write an overview document first, then create supporting documents for each problem so we could work through them individually.
So Claude started writing Python code to validate all the calculations. I hadn’t planned on that initially, but once it started writing code, I jumped in with my typical guidance. Use a package manager, write tests for all the code.
What we ended up with is a Python module covering about 12 of the hardest engineering challenges for a space elevator. There’s a script that calls into the module, runs all the math, and spits out the results. It’s not a complete formal proof of anything, but it’s a structured way to think through problems where the code can actually catch mistakes in the reasoning.
And it did catch mistakes. That’s the whole point of this approach, you’re using the calculations as a check on the thinking, not just trusting the narrative.
Working Through Problems Together
As we worked through each challenge, I kept asking clarifying questions. What about this edge case? How would we handle that constraint?
It was genuinely collaborative, me bringing curiosity and some engineering intuition, Claude bringing the ability to quickly formalize ideas into code and calculations.
The code isn’t public or anything. But the approach is what I think is worth sharing.
The Hard Part Is Still Hard
My main limiting factor is time. The math looks generally fine to me, but if I really wanted to verify everything thoroughly, I’d need to spend a lot more time with it. A mathematician or physicist who’s deeply familiar with these calculations would be much faster at spotting issues. Providing guidance like, “no, you shouldn’t use this formula here, that approach is wrong.”
I can do that work. It’s just going to take me significantly longer than someone with that specialized background.
This is what I mean when I talk about working with agentic tools on hard problems. It’s not about asking an AI for the answer. It’s about using it as a thinking partner; one that can write code, run calculations, and help you check your reasoning as you go.
For me, that’s the real power of tools like Claude. Not replacing expertise, but amplifying curiosity.
/ AI / Claude / Space / Engineering
-
Voice-to-Text in 2026: The Tools and Models Worth Knowing About
As natural language becomes a bigger part of how we build software, it’s worth looking at the state of transcription models. What’s the best way to get voice to text right now?
For a lot of people, talking to your computer is faster than typing. You can stream-of-thought your way through an idea, prompt your tools, and get things moving without your fingers being the bottleneck. If you haven’t tried it yet, it will change how you work with your machine. I’m not exaggerating.
The Tools
Here’s what people are actually using for desktop voice-to-text:
- Willow Voice — Popular choice, lots of people swear by it
- SuperWhisper — My current pick
- Wispr Flow — Another well-regarded option
- Voice Ink — Worth a look?
- Aiko — From an Open Source dev, Sindre Sorhus
- MacWhisper — Solid Mac-native option
I’ve tried several of these, and the biggest pain point for people is going to be that many require monthly subscriptions. I’ve been happy with SuperWhisper and it is worth mentioning they still have a pay for it once (Lifetime) option, so you don’t get locked into monthly payments forever. That said, Willow Voice and Wispr Flow both have strong followings.
The Models Behind the Magic
Most of these tools started with OpenAI’s Whisper, the voice model released and open-sourced back in 2022. With Whisper, you could run solid transcription locally on your own hardware.
But we’re a few years past that now, and there are some more models to choose from. Here is a summary table of the current state of the transcription models.
---Model Company Released Local Run? Used in Desktop Tools? Best For Whisper Large-v3 OpenAI Nov 2023 Yes Yes (The Standard) Multilingual accuracy (99+ langs) Whisper v3 Turbo OpenAI Oct 2024 Yes Yes (Fast Settings) Best speed-to-accuracy ratio for local use Nova-3 Deepgram Apr 2025 Self-Host Limited (API-based) Real-time agents; handling messy background noise Parakeet TDT 1.1B NVIDIA May 2025 Yes Developer-focused / CLI Ultra-low latency; significantly faster than Whisper SenseVoice-Small Alibaba July 2024 Yes Emerging (Fringe) High-precision Mandarin/English and emotion detection Canary-1B NVIDIA Oct 2025 Yes Developer-focused Beating Whisper on technical jargon & punctuation Voxtral Mini V2 Mistral Feb 2026 Yes Yes (Privacy apps) High-speed local transcription on low-VRAM devices Granite Speech 3.3 IBM Jan 2026 Yes No (Enterprise focus) Reliable technical ASR with an Apache 2.0 license Scribe v2 ElevenLabs Jan 2026 No Via API Extremely lifelike punctuation and speaker labels We’re at an interesting inflection point. You can articulate your thoughts faster by speaking than typing, its becoming a real productivity gain. It’s not just an accessabiltiy aid anymore. People who can type well enough are using these tools on a daily basis.
That’s all for now!
/ Productivity / AI / Tools / Voice
-
Your Context Window Is a Budget — Here's How to Stop Blowing It
If you’re using agentic coding tools like Claude Code, there’s one thing you should know by now: your context window is a budget, and everything you do spends it.
I’ve been thinking about how to manage the budget. As we are learning how to use sub-agents, MCP servers, and all these powerful capabilities we haven’t been thinking enough about the cost of using them. Certainly the dollars and cents matters too if you are using API access, but the raw token budget you burn through in a single session impacts us all regardless. Once it’s gone, compaction kicks in, and it’s kind of a crapshoot on whether it knows how to pick up where we left off on the new session.
Before we talk about what you can do about it, let’s talk about where your tokens go, or primarily are used.
Why Sub-Agents Are Worth It (But Not Free)
Sub-agents are one of the best things to have in agentic coding. The whole idea is that work happens in a separate context window, leaving your primary session clean for orchestration and planning. You stay focused on what needs to change while the sub-agent figures out how.
Sub-agents still burn through your session limits faster than you might expect. There are actually two limits at play here:
- the context window of your main discussion
- the session-level caps on how many exchanges you can have in a given time period.
Sub-agents hit both. They’re still absolutely worth using and working without them isn’t an option, but you need to be aware of the cost.
The MCP Server Problem
MCP servers are another area where things get interesting. They’re genuinely useful for giving agentic tools quick access to external services and data. But if you’ve loaded up a dozen or two of them? You’re paying a tax at the start of every session just to load their metadata and tool definitions. That’s tokens spent before you’ve even asked your first question.
My suspicion, and I haven’t formally benchmarked this, is that we’re headed toward a world where you swap between groups of MCP servers depending on the task at hand. You load the file system tools when you’re coding, the database tools when you’re migrating, and the deployment tools when you’re shipping. Not all of them, all the time.
There’s likley more subtle problems too. When you have overlapping MCP servers that can accomplish similar things, the agent could get confused about which tool to call. It might head down the wrong path, try something that doesn’t work, backtrack, and try something else. Every one of those steps is spending your token budget on nothing productive.
The Usual Suspects
Beyond sub-agents and MCP servers, there are the classic context window killers:
- Web searches that pull back pages of irrelevant results
- Log dumps that flood your context with thousands of lines
- Raw command output that’s 95% noise
- Large file reads when you only needed a few lines
The pattern is the same every time: you need a small slice of data, but the whole thing gets loaded into your context window. You’re paying full price for information you’ll never use.
And here’s the frustrating part — you don’t know what the relevant data is until after you’ve loaded it. It’s a classic catch-22.
Enter Context Mode
Somebody (Mert Köseoğlu - mksglu) built a really clever solution to this problem. It’s available as a Claude Code plugin called context-mode. The core idea is simple: keep raw data out of your context window.
Instead of dumping command output, file contents, or web responses directly into your conversation, context-mode runs everything in a sandbox. Only a printed summary enters your actual context. The raw data gets indexed into a SQLite database with full-text search (FTS5), so you can query it later without reloading it.
It gives Claude a handful of new tools that replace the usual chaining of bash and read calls:
- ctx_execute — Run code in a sandbox. Only your summary enters context.
- ctx_execute_file — Read and process a file without loading the whole thing.
- ctx_fetch_and_index — Fetch a URL and index it for searching, instead of pulling everything into context with WebFetch.
- ctx_search — Search previously indexed content without rerunning commands.
- ctx_batch_execute — Run multiple commands and search them all in one call.
There are also slash commands to check how much context you’ve saved in a session, run diagnostics, and update the plugin.
The approach is smart. All the data lives in a SQLite FTS5 database that you can index and search, surfacing only the relevant pieces when you need them. If you’ve worked with full-text search in libSQL or Turso, you’ll appreciate how well this maps to the problem. It’s the right tool for the job.
The benchmarks are impressive. The author reports overall context savings of around 96%. When you think about how much raw output typically gets dumped into a session, it makes sense. Most of that data was never being used anyway.
What This Means for Your Workflow
I think the broader lesson here is that context management is becoming a first-class concern for anyone doing serious work with agentic tools. It’s not just about having the most powerful model, it’s about using your token budget wisely so you can sustain longer, more complex sessions without hitting the wall.
A few practical takeaways:
- Be intentional about MCP servers. Load what you need, not everything you have.
- Use sub-agents for heavy lifting, but recognize they cost session tokens.
- Avoid dumping raw output into your main context whenever possible.
- Tools like context-mode can dramatically extend how much real work you get done per session.
We’re still early in figuring out the best practices for working with these tools. But managing your context window? That’s one of the things that separates productive sessions from frustrating ones.
Hopefully something here saves you some tokens.
/ AI / Programming / Developer-tools / Claude
-
AI-Powered Process Orchestration Across the Enterprise | Appian
Simplify digital operations with Appian’s agentic automation platform - purpose-built for enterprise growth.
/ AI / links / agent / automation / platform
-
How to Write a Good CLAUDE.md File
Every time you start a new chat session with Claude Code, it’s starting from zero knowledge about your project. It doesn’t know your tech stack, your conventions, or where anything lives. A well-written
CLAUDE.mdfile fixes that by giving Claude the context it needs before it writes a single line of code.This is context engineering, and your
CLAUDE.mdfile is one of the most important pieces of it.Why It Matters
Without a context file, Claude has to discover basic information about your project — what language you’re using, how the CLI works, where tests live, what your preferred patterns are. That discovery process burns tokens and time. A good
CLAUDE.mdfront-loads that knowledge so Claude can get to work immediately.If you haven’t created one yet, you can generate a starter file with the
/initcommand. Claude will analyze your project and produce a reasonable first draft. It’s a solid starting point, but you’ll want to refine it over time.The File Naming Problem
If you’re working on a team where people use different tools: Cursor has its own context file, OpenAI has theirs, and Google has theirs. You can easily end up with three separate context files that all contain slightly different information about the same project. That’s a maintenance headache.
It would be nice if Anthropic made the filename a configuration setting in
settings.json, but as of now they don’t. Some tools like Cursor do let you configure the default context file, so it’s worth checking.My recommendation? Look at what tools people on your team are actually using and try to standardize on one file, maybe two. I’ve had good success with the symlink approach , where you pick your primary file and symlink the others to it. So if
CLAUDE.mdis your default, you can symlinkAGENTS.mdorGEMINI.mdto point at the same file.It’s not perfect, but it beats maintaining three separate files with diverging information.
Keep It Short
Brevity is crucial. Your context file gets loaded into the context window every single session, so every line costs tokens. Eliminate unnecessary adjectives and adverbs. Cut the fluff.
A general rule of thumb that Anthropic recommends is to keep your
CLAUDE.mdunder 200 lines. If you’re over that, it’s time to trim.I recently went through this exercise myself. I had a bunch of Python CLI commands documented in my context file, but most of them I rarely needed Claude to know about.
We don’t need to list every single possible command in the context file. That information is better off in a
docs/folder or your project’s documentation. Just add a line in yourCLAUDE.mdpointing to where that reference lives, so Claude knows where to look when it needs it.Maintain It Regularly
A context file isn’t something you write once and forget about. Review it periodically. As your project evolves, sections become outdated or irrelevant. Remove them. If a section is only useful for a specific type of task, consider moving it out of the main file entirely.
The goal is to keep only the information that’s frequently relevant. Everything else should live somewhere Claude can find it on demand, not somewhere it has to read every single time.
Where to Put It
Something that’s easy to miss: you can put your project-level
CLAUDE.mdin two places../CLAUDE.md(project root)./.claude/CLAUDE.md(inside the.claudedirectory)
A common pattern is to
.gitignorethe.claude/folder. So if you don’t want to check in the context file — maybe it contains personal preferences or local paths — putting it in.claude/is a good option.Rules Files for Large Projects
If your context file is getting too large and you genuinely can’t cut more, you have another option: rules files. These go in the
.claude/rules/directory and act as supplemental context that gets loaded on demand rather than every session.You might have one rule file for style guidelines, another for testing conventions, and another for security requirements. This way, Claude gets the detailed context when it’s relevant without bloating the main file.
Auto Memory: The Alternative Approach
Something you might not be aware of is that Claude Code now has auto memory, where it automatically writes and maintains its own memory files. If you’re using Claude Code frequently and don’t want to manually maintain a context file, auto memory can be a good option.
The key thing to know is that you should generally use one approach or the other. If you’re relying on auto memory, delete the
CLAUDE.mdfile, and vice versa.Auto memory is something I’ll cover in more detail in another post, but it’s worth knowing the feature exists. Just make sure you enable it in your
settings.jsonif you want to try it.Quick Checklist
If you’re writing or revising your
CLAUDE.mdright now, here’s what I’d focus on:- Keep it under 200 lines — move detailed references to docs
- Include your core conventions — package manager, runtime, testing approach
- Document key architecture — how the project is structured, where things live
- Add your preferences — things Claude should always or never do
- Review monthly — cut what’s no longer relevant
- Consider symlinks — if your team uses multiple AI tools
- Use rules files — for detailed, task-specific context
That’s All For Now. 👋
/ AI / Programming / Claude-code / Developer-tools
-
Claude Code Skills vs Plugins: What's the Difference?
If you’ve been building with Claude Code, you’ve probably seen the terms “skill,” “plugin,” and “agent” thrown around. They’re related but distinct concepts, and understanding the difference will help you build better tooling. Let’s focus on skills versus plugins since those two are the most closely related.
Skills: Reusable Slash Commands
Skills are user-invocable slash commands, essentially reusable prompts that run directly in your main conversation. You trigger them with
/skill-nameand they execute inline. They can be workflows or common tasks that are done frequently.Skills can live inside your
.claude/skills/folder, or they can live inside a plugin (where they’re called “commands” instead). Same concept, different home.The important frontmatter you should pay attention to is the
allowed-toolsproperty. This defines which tool calls the skill can access, and there are three formats you can use:- Comma-separated names —
Bash, Read, Grep - Comma-separated with filters —
Bash(gh pr view:*), Bash(gh pr diff:*) - JSON array —
["Bash", "Glob", "Grep"]
I don’t think there’s a meaningful speed difference between them? The filtered format might take slightly longer to parse if you have a huge list, but in practice it’s negligible. Pick whichever is most readable for your use case.
The real power here is that skills can define tool calls and launch subagents. That turns a simple slash command into something that can orchestrate complex workflows.
Plugins: The Full Package
A plugin is a bigger container. It can bundle commands (skills), agents, hooks, and MCP servers together as a single distributable unit. Every plugin needs a
.claude-plugin/plugin.jsonfile; which is just a name, description, and author.Plugins are a good way to bundle agents with skills. If your workflow needs a specialized agent that gets triggered by a slash command, a plugin is a good option for that.
Pushing the Boundaries of Standalone Skills
However, I wanted to experiment with what’s actually possible using standalone skills, so I built upkeep. It turns out that you can bundle actual compiled binaries inside a skill directory and call them from the skill. That opens up a lot of possibilities.
Here’s how I did it:
- The skill has a prerequisite section that checks for a
bin/folder containing the binary - A workflow calls the binary, passing in the commands to run
- Each step defines what we expect back from the binary
You can see the full implementation in the SKILL.md file. It’s a pattern that lets you distribute real functionality, not just prompts, through the skill.
Quick Summary
- Skills are slash commands. Reusable prompts with tool access that run in your conversation.
- Plugins bundle skills, agents, hooks, and MCP servers together with a
plugin.json. - Skills are more flexible than you might expect, you can call subagents, distribute binaries, and build real workflows.
If you’re just getting started, skills are the easier entry point. When you need to package multiple pieces together or distribute agents alongside commands, that’s when you reach for a plugin.
Have fun building!
/ AI / Development / Claude-code
- Comma-separated names —
-
Claude Code Now Has Two Different Security Review Tools
If you’re using Claude Code, you might have noticed that Anthropic has been quietly building out security tooling. There are now two distinct features worth knowing about. They sound similar but do very different things, so let’s break it down.
The /security-review Command
Back in August 2025, Anthropic added a
/security-reviewslash command to Claude Code. This one is focused on reviewing your current changes. Think of it as a security-aware code reviewer for your pull requests. It looks at what you’ve modified and flags potential security issues before you merge.It’s useful, but it’s scoped to your diff. It’s not going to crawl through your entire codebase looking for problems that have been sitting there for months.
The New Repository-Wide Security Scanner
Near the end of February 2026, Anthropic announced something more ambitious: a web-based tool that scans your entire repository and operates more like a security researcher than a linter. This is the thing that will help you identify and fix security issues across your entire codebase.
First we need to look at what already exists to understand why it matters.
SAST tools — Static Application Security Testing. SAST tools analyze your source code without executing it, looking for known vulnerability patterns. They’re great at catching things like SQL injection, hardcoded credentials, or buffer overflows based on pattern matching rules.
If a vulnerability doesn’t match a known pattern, it slips through. SAST tools also tend to generate a lot of false positives, which means teams start ignoring the results.
What Anthropic built is different. Instead of pattern matching, it uses Claude to actually reason about your code the way a security researcher would. It can understand context, follow data flows across files, and identify logical vulnerabilities that a rule-based scanner would never catch. Think things like:
- Authentication bypass through unexpected code paths
- Authorization logic that works in most cases but fails at edge cases
- Business logic flaws that technically “work” but create security holes
- Race conditions that only appear under specific timing
These are the kinds of issues that usually require a human security expert to find or … real attacker.
SAST tools aren’t going away, and you should still use them. They’re fast, they catch the common stuff, and they integrate easily into CI/CD pipelines.
Also the new repository-wide security scanner isn’t out yet, so stick with what you got until it’s ready.
/ DevOps / AI / Claude-code / security
-
Ever wanted your CLAUDE.md to automatically update from your current session before the next compact? There’s a skill for that and it’s been helpful. In case you missed it, here’s a link to the skill:
/ AI / Claude-code
-
Managing Your Context Window in Claude Code
If you’re using Claude Code, there’s a feature you should know about that gives you visibility into how your context window is being used. The
/contextskill breaks everything down so you can see exactly where your tokens are going.Here’s what it shows you:
- System prompt – the base instructions Claude Code operates with
- System tools – the built-in tool definitions
- Custom agents – any specialized agents you’ve configured
- Memory files – your CLAUDE.md files and auto-memory
- Skills – any skills loaded into the session
- Messages – your entire conversation history
Messages is where you have the most control, and it’s also what grows the fastest. Every prompt you send, every response you get back, every file read, every tool output; it all shows up in your message history.
Then there’s the free space, which is what’s left for actual work before a compaction occurs. This is the breathing room Claude Code has to think, generate responses, and use tools.
You’ll also see a buffer amount that’s reserved for auto-compaction. You can’t use this space directly, it’s set aside so Claude Code has enough room to summarize the conversation and hand things off cleanly.
Why This Matters
Understanding your context usage helps you work more efficiently. A few ways to keep your context lean:
- Start fresh sessions for new tasks instead of reusing a long-running one
- Be intentional about file reads — only read what you need, not entire directories
- Use sub-agents — when you delegate work to a sub-agent, it runs in its own context window instead of yours. All those file reads, tool calls, and intermediate reasoning happen over there, and you just get the result back. It’s one of the best ways to preserve your primary context for the work that actually needs it.
- Trim your CLAUDE.md — everything in your memory files loads every session, so keep it tight
I’ll dig into sub-agents more in a future post. For now, don’t forget about
/context/ AI / Claude-code / Developer-tools