Pandoc vs MarkItDown: Two Tools, Two Eras

Pandoc has been the gold standard for document conversion for nearly two decades. But there’s a newer tool from Microsoft called MarkItDown, and while the names sound like they do similar things, they were built for completely different reasons.

Pandoc is a universal document converter designed for human publishing. It converts almost any format into almost any other format while preserving complex typography, citations, and formatting. MarkItDown is a specialized extraction tool designed for AI. It converts various files strictly into Markdown so that LLMs and RAG pipelines can read and process the text.

Same input files, very different goals.

Pandoc: The Universal Translator

Pandoc has been around since 2006, written in Haskell, and it operates on an Abstract Syntax Tree. It reads a document, builds a complex internal model of its structure, and then translates that structure into your desired output. We’re talking 40+ output formats here. PDF, Word, HTML, LaTeX, EPUB, you name it.

Where it really shines is academic and technical writing. It natively understands LaTeX math, footnotes, bibliographies, and cross-referencing. You can turn a Word doc into Markdown, edit it, and use Pandoc to turn it back into a perfectly formatted PDF. Two-way conversion that actually works.

You can also write custom filters in Lua or Python to programmatically alter documents during conversion. Want to automatically downgrade all your H2s to H3s? Pandoc has you covered.

MarkItDown: The LLM Feeder

MarkItDown was released by Microsoft in late 2024 to solve a very modern problem. LLMs need clean, structured text to “read” documents, but corporate data is locked inside messy formats like multi-tab Excel spreadsheets, image-heavy PowerPoints, and ZIP archives.

It’s a Python library first, CLI second. It drops into your scripts in a few lines of code, which makes it easy to wire up with LangChain, LlamaIndex, or raw API calls. The output is always Markdown. That’s it. No PDF generation, no Word docs, no EPUB. Just clean text that an AI can process.

The interesting trick is what it does with images and audio. Feed it a PDF with diagrams and MarkItDown can connect to an LLM like GPT-4o to look at the image and write a Markdown description of what it sees. It can also transcribe audio files. That’s a fundamentally different approach from Pandoc, which preserves images as files rather than describing them.

Quick Comparison

Feature Pandoc MarkItDown
Primary Goal Universal document conversion Document ingestion for AI
Output Formats 40+ (PDF, Word, HTML, LaTeX, etc.) Only Markdown
Language Haskell (standalone CLI) Python (library-first)
Image Handling Preserves and extracts image files Uses OCR/LLM Vision to describe images as text
Complex Formatting Citations, bibliographies, LaTeX math, custom filters Basic structural support (headings, tables, slides)

So Which One Do You Want?

Pandoc if you’re writing a book, research paper, or blog and need polished output in multiple formats. If you need to maintain citations, complex formatting, or convert files out of Markdown into something else, Pandoc is your tool.

MarkItDown if you’re building an AI agent, chatbot, or search tool and need to extract text from a pile of PDFs, Excel files, and PowerPoints. If you only care about getting raw structured text and don’t care about the visual layout of the original document, MarkItDown is purpose-built for that.

They’re not competitors. Pandoc is for publishing. MarkItDown is for feeding AI. Pick the one that matches what you’re actually trying to do.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

/ AI / Tools / Development