The Real Cost of Your Data Lake (It's Not the Storage)
If you’re sketching out a data platform on a whiteboard right now, I want you to do something. Stop calculating storage costs. They’re not the bill.
I pulled the public pricing for AWS, Azure, GCP, Databricks, and Snowflake and stacked them next to each other. Storage is the cheap part. The expensive part is everything that moves the data, and the expensive part is the part you’re least likely to model correctly when you’re picking a vendor.
Let me walk through what actually shows up on the invoice.
Raw Object Storage Is Basically Free
For hot, frequently accessed data, the big three are within a rounding error of each other:
- Azure Blob (LRS, Hot): $0.018 per GB/month
- Google Cloud Standard: $0.020 per GB/month
- AWS S3 Standard: $0.023 per GB/month (first 50 TB)
Drop into the cool tiers and AWS S3 takes the lead at $0.0125 per GB. Drop into deep archive and you’re paying $0.00099 per GB on either AWS Glacier Deep Archive or Azure Archive. That’s a tenth of a cent per gigabyte, per month, for data you almost never touch.
Good for you, but I think anyone leading with “per-GB storage cost” in a procurement deck is selling you a story. Storage capacity is roughly five percent of a typical Databricks bill. Five. The other 95% is the part nobody wants to talk about.
The Egress Trap
Ingress is free. Always. The cloud providers want your data in.
Getting it back out is where they collect.
- Azure Blob: $0.087/GB external egress
- AWS S3: $0.090/GB
- Google Cloud: $0.120/GB (but free if you stay inside Google’s ecosystem, which is the whole point of that pricing)
Then layer on API operations. A million GET requests on S3 costs about $0.40. The same million GETs on Google Cloud Storage can run closer to $5.00 because they classify operations differently. If your analytics workload is hammering small files, those API calls add up faster than the storage they’re reading.
Storing 10 TB? Maybe $200 a month. Storing 500 TB? You’re at $10,000 a month before a single byte leaves the region or a single query fires.
Databricks: Two Bills, One Headache
Databricks uses what’s commonly called a Two-Bill Model. You get one invoice from your cloud provider for the actual VMs and storage, and a separate invoice from Databricks for the software, measured in DBUs (Databricks Units).
In a typical mid-sized deployment around $18,000/month, the breakdown looks like this:
- VM compute from the cloud provider: ~55%
- Databricks DBU fees: ~30%
- Storage: ~5%
- Network egress: ~5%
The DBU rate changes based on what you’re doing. Automated jobs start at $0.15/DBU. Interactive notebooks for analysts start at $0.40/DBU. That’s not an accident. Databricks wants you running production workloads on cheap job clusters, not on the expensive all-purpose clusters your data scientists love to leave running over a weekend.
If you’re not actively pushing teams toward job clusters and ARM-based instances, you’re leaving real money on the table.
Snowflake: The Hidden Storage Multiplier
Snowflake’s pricing pitch sounds clean. Pass-through storage at $40/TB/month on-demand, dropping to $23/TB/month with a capacity commitment. Compute as Credits. Done.
Except it isn’t done. Snowflake stores data in immutable 16MB micro-partitions. Immutable. You can’t change them in place. Update a single row in a 1 TB table and Snowflake writes a new file and keeps the old one around.
Why keep the old one? Two features:
- Time Travel: query historical states of your data for up to 90 days
- Fail-Safe: a 7-day disaster recovery window you cannot turn off
This is the part that gets people. A 1 TB table that’s getting updated multiple times a day can balloon to 25 TB of billed storage because Snowflake is retaining every prior version of every micro-partition you’ve touched. Your dashboard says “1 TB table.” Your invoice says otherwise.
And compute? Virtual Warehouses bill per second, but with a 60-second minimum every single time you resume or resize. Aggressive auto-suspend sounds like a cost optimization. It’s not. If you’re spinning a warehouse up and down every 30 seconds, you’re paying the 60-second minimum every time and quietly multiplying your bill.
What I’d Actually Do
A few things I’d put on the wall before signing anything:
- Model egress, not storage. Run your worst-case query pattern through the calculator. Storage is noise.
- Lifecycle everything. Cool tier and archive pricing are 10x to 100x cheaper. If your data is older than 90 days and nobody’s queried it, it shouldn’t be in hot storage.
- For Databricks: push every recurring workload to job compute. Audit interactive cluster usage monthly.
- For Snowflake: if you have high-frequency update patterns, profile your actual storage footprint, not your logical table size. The gap will surprise you.
- For multi-cloud: don’t. Egress will eat the savings before you finish the architecture diagram.
The vendors all have a story about why their model is the cheap one. Read past the per-GB number on the slide. The bill is somewhere else.
Happy modeling.
Sources
- Databricks Pricing Explained (Dawiso) — Two-Bill Model, DBU breakdown
- Snowflake Pricing Explained (SELECT.dev) — Time Travel storage multiplier, micro-partition behavior
- Cloud & AI Storage Pricing Comparison 2026 (Finout) — AWS / Azure / GCP per-GB and tier pricing
- S3 vs GCS vs Azure Blob Storage (ai-infra-link) — Egress and API operation pricing
- Snowflake Pricing in 2026 (CloudZero) — Virtual Warehouse 60-second minimum behavior
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Cloud / Data / Snowflake / Databricks