Incident-response
-
What Is a Runbook and Why Should You Care?
If you’ve ever been woken up at 3 AM by a pager and stared at your screen trying to remember how the database failover works, you already know why runbooks matter. You just might not have had one yet.
A runbook is a step-by-step guide for handling a specific operational scenario. Database goes down? There’s a runbook for that. Failed deployment needs a rollback? Runbook. Routine certificate rotation? You get the idea. They range from simple markdown files to fully automated scripts where a human only needs to click “approve.”
That’s the idea anyways. The impact of having good ones versus not having CAN be massive.
Why They Matter
When something breaks in production, your brain is not at its best. Adrenaline kicks in, Slack is blowing up, and suddenly you can’t remember if you’re supposed to restart the service first or check the connection pool. A runbook takes the thinking out of the equation. You follow the steps. You restore the service. You go back to sleep.
This directly lowers your Mean Time To Recovery (MTTR). Instead of spending twenty minutes in a group call debating what to try next, you open the runbook and start executing.
Runbooks also solve the consistency problem. If five different engineers respond to the same alert five different ways, you’re rolling the dice every time. One of those approaches might cause a secondary outage. A runbook ensures everyone follows the same diagnostic and remediation path, which means fewer surprises.
And then there’s the tribal knowledge issue. Every team has that one senior engineer who knows exactly how to fix the weird thing that happens once a quarter. What happens when they’re on vacation? Or they leave the company? A runbook gets that knowledge out of their head and into a document the whole team can use.
It also makes onboarding way faster. New engineers can start handling on-call rotations with confidence instead of hoping nothing breaks on their watch.
Treat Them Like Code
This is the part a lot of teams get wrong. Runbooks shouldn’t live in a random Confluence page that hasn’t been updated since 2023. They should live in version control. Sometimes they’re kept in the repo with the code. Other times they’re kept separate. It’s up to you. It’s up to the team on where to put it.
If a developer changes how a service authenticates or connects to a database, the associated runbook needs to be updated in the same pull request. An outdated runbook is worse than no runbook at all. It sends engineers down the wrong path during an outage, which burns time and trust.
Share Early, Share Often
A runbook sitting in someone’s private folder is doing exactly nothing for your team.
Start during the draft phase. Have someone who didn’t write the runbook try to follow it. If they get confused or stuck, the runbook needs work. This is the cheapest way to find gaps.
When a new service is heading to production, the runbook should be part of the readiness review. I’d argue a service shouldn’t go live without one. And after an incident, if the runbook was wrong or didn’t exist, creating or fixing it should be a mandatory action item from the post-mortem.
One more thing. Practice them. Run game days where the team actually walks through runbooks before a real emergency happens. The worst time to discover your runbook has a missing step is when production is on fire.
So Here We Are
Runbooks aren’t glamorous. Nobody’s giving a conference talk about the beautiful runbook they wrote last quarter. But they’re the difference between a calm, methodical incident response and a panicked Slack thread full of guesses. Write them, version them, share them, and practice them. Your future self will thank you.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Sre / Runbooks / Incident-response