Building a Linter for Your Obsidian Vault

In software we don’t trust ourselves to keep a codebase clean by hand. We run linters. They catch the dead code, the unused import, the function nobody calls anymore, the link that points at a file that got deleted three commits ago. We’ve decided that rot is inevitable and that a machine should hunt for it on every save.

Our note vaults get no such treatment. They rot exactly the same way, orphan notes with no links in or out, dead [[wikilinks]] pointing at notes you renamed, half-distilled drafts that have sat untouched for months. The difference is nobody’s running a linter on them. So let’s write one.

The instinct, when a vault gets messy, is to reorganize the folders. Resist it. Folders force a note to live in exactly one place, which is a lie, because most ideas belong to several contexts at once. A note on caching strategy is relevant to a performance project, a systems-design reference, and a half-formed blog post, all at the same time. A folder makes you pick one and bury it.

The health of a vault lives in its links, not its hierarchy. A well-linked note is reachable from a dozen directions. An orphan, a note with zero links in or out, is functionally invisible. You will never stumble back into it. It’s dead the moment you save it. So the single most useful thing a vault linter can report is: which notes are orphans, and which links are broken.

Step one: parse the markdown, not just grep it

You could grep for [[ and call it a day, but you’ll get fooled by links inside code blocks, escaped brackets, and frontmatter. Parse it properly.

Two passes per file. First, pull the YAML frontmatter off the top with a library like python-frontmatter, which hands you the metadata as a dict and the body as a string. That’s where your status, tags, and aliases live. Second, run the body through a real markdown parser such as markdown-it-py and walk the token stream, collecting link tokens while ignoring anything inside a fenced code block.

Now you have a clean model of the vault: a dict of every note keyed by filename, each with its frontmatter and its outbound links. From there the checks are short:

  • Broken links. For every outbound link, confirm the target file exists. If it doesn’t, the note was renamed or deleted. Flag it.
  • Orphans. Build the reverse index of inbound links. Any note with no inbound and no outbound links is an orphan. Flag it.
  • Stale drafts. Any note still at status: raw whose modified time is older than thirty days. Flag it.

That’s a useful linter already, and it’s maybe sixty lines of Python. It runs in a second over a few thousand notes.

Step two: the part grep can’t do

Broken-link detection is mechanical. The interesting failure is the link that should exist and doesn’t. Two notes that are clearly about the same idea, written six months apart, that have no idea the other one exists. No string match will find those, because they don’t share words. They share meaning.

This is where a local embedding model earns its place. Run every note body through something like sentence-transformers to get a vector per note, then compute pairwise cosine similarity. Any pair that scores above a threshold but has no link between them is a suggested cross-link. The linter doesn’t create the link, it just surfaces the candidate: “these two notes are conceptually close and unconnected, did you mean to link them?”

Keep it local. The whole point of a vault is that it’s yours, and you don’t want to ship every private note to an API to find out two of them rhyme. A small embedding model on your own machine handles a personal vault without breaking a sweat, and the vectors never leave the laptop. If you’ve read how vector similarity drives retrieval, this is the same trick pointed at your own notes instead of a document corpus.

Run it like a linter, not a project

The mistake would be to build this as a grand one-time cleanup, run it once, fix everything, and never touch it again. The vault will just rot back. Treat it like the linter it is. Wire it into a weekly job, or a git pre-commit hook if your vault is a repo, and have it print a short report: three broken links, eight orphans, five suggested cross-links. You spend ten minutes acting on the list and the vault stays healthy on its own.

A note vault is a codebase. It accumulates the same entropy, drifts the same way, and benefits from the same discipline we already apply to every other pile of text files we own. We just never thought to point the tools at it. Point them.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

Sources

  • Andy Matuschak, “Evergreen notes should be densely linked” — the argument that a note’s value comes from its links, and that organizing by hierarchy fights against that.
  • Eric Holscher, “python-frontmatter” (GitHub) — library for splitting YAML frontmatter from a markdown body, used for the metadata-parsing pass.
  • markdown-it-py” (GitHub) — a CommonMark parser that exposes a token stream, used here to extract links while skipping code blocks.
  • Sentence Transformers” — framework for generating local sentence and document embeddings, used to flag conceptually similar but unlinked notes.

Python Embeddings Second brain Note-taking Knowledge management