Data

Your Data Lake Has a Permissions Problem
Consolidating every business unit’s data into one giant lakehouse sounds like a win until you realize the security model from your old data warehouse can’t scale to it. You took ten silos, each with their own access rules, and merged them into one location. Now everyone wants in, and your security team is the bottleneck.

Let me walk through three places where the cracks usually show up.

RBAC Falls Over Faster Than You Think

Role-Based Access Control is the model most teams start with. Permissions are tied to a job function. Sales reps get read access to sales tables, data engineers get write access to staging, and so on. It works fine when you have ten roles.

It does not work when you have a thousand.

Say your sales reps should only see accounts in their territory, and only accounts they personally manage. Under pure RBAC, you need a unique role for every territory-by-account-owner combination. That’s role explosion, and it’s how compliance audits become impossible and legitimate access slows to a crawl. The roles list grows faster than anyone can review it, which means stale permissions sit there forever.

The answer is Attribute-Based Access Control. Instead of asking “what role is this user in,” the system asks “what attributes does this user have, what attributes does this data have, and what’s the policy at this exact moment.” Tag a column as PII. Tag a schema as HR. Write one policy that says anyone outside the HR compliance group sees masked data when they touch a PII column. Done. That single policy replaces hundreds of bespoke roles.

This is what Unity Catalog and Starburst Galaxy are built around, and it’s the model that will scale with the data.

Column and Row Security Should Be Boring

Once you have ABAC and a real metadata catalog, column-level masking and row-level filtering become a non-event. You write a SQL expression that masks the first five digits of an SSN for lower-privileged roles. You write a row filter that silently appends WHERE region = 'user_region' to every executive’s SELECT *.

The key word is silently. The user doesn’t see a different table. They don’t have a sanitized copy. The policy is enforced at the catalog layer, so it works the same whether they’re querying through Spark, Trino, a BI dashboard, or a pipeline. One source of truth, one policy, every engine.

If you’re still maintaining separate “sanitized” copies of tables for different audiences, you’re doing it the 2015 way and you’re going to drift.

The IAM Default Problem

Most cloud services ship with default IAM roles, and a surprising number of those defaults attach AmazonS3FullAccess or something equally permissive.

SageMaker does it. The Ray autoscaler role does it. There are more.

Picture the failure mode. An attacker compromises some peripheral app, maybe a forgotten Jupyter notebook, maybe a misconfigured Lambda. That workload has an IAM role attached because that’s how cloud workloads talk to S3 without hardcoded credentials. The attacker inherits the role. And because the role has full S3 access, they’re not constrained to the bucket the application actually uses. They can enumerate every bucket in the entire account.

That’s how a single compromised container becomes a full data lake breach. Researchers call it a bucket monopoly attack. I call it the most predictable incident in the industry.

The fix is not glamorous. Stop using s3:* in any policy. Write resource-scoped policies that name the exact buckets and prefixes a workload needs. Audit the default roles every cloud service hands you and replace them. Use Security Lake or Detective to flag cross-service API calls that don’t match normal patterns. None of this is fun. All of it is necessary.

And Then There’s the Agent Problem

The new wrinkle is that humans are no longer the primary consumers of your data. Autonomous agents are. They issue more queries, hit more tables, and move faster than any human team.

Long-lived credentials and static roles don’t fit that workload. The pattern emerging is Just-In-Time entitlements, where an agent gets a narrow, ephemeral permission for the duration of a single execution thread, then loses it. Pair that with declarative policy metadata baked into the data assets themselves, so the agent knows what it’s allowed to do with a dataset before it ever runs the query.

We’re early on this. Most organizations are still working through the basics, and that’s fine. But if you’re designing access controls today, design them assuming the next thing hitting your lake isn’t a person.

What to Actually Do

If you’re auditing your own data lake security, the order I’d work in:
1. Find every IAM role with a wildcard permission. Replace them.
2. Move from RBAC to ABAC at the catalog layer. Stop creating new roles.
3. Pull your data lake off the public internet. PrivateLink, private endpoints, IP allowlists for the legacy stuff that can’t move.
4. Then start thinking about agents.
The lakehouse pitch is unification. The lakehouse reality is that unification multiplies the cost of every bad permission. Get the basics right before you bolt on anything fancy.

Sources
- AWS Default IAM Roles Found to Enable Lateral Movement (The Hacker News) — SageMaker / Ray autoscaler default roles, bucket monopoly attacks
- What Is Fine-Grained Data Access Control? (TrustLogix) — RBAC role explosion, ABAC fundamentals
- Core concepts for ABAC (Databricks Unity Catalog docs) — Tag-driven policy enforcement
- Top 12 Data Governance Predictions for 2026 (Hyperight) — Just-in-time entitlements, declarative policy metadata
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 5, 2026 / DevOps / security / Cloud / Data

Enjoyed this?
The Real Cost of Your Data Lake (It's Not the Storage)
If you’re sketching out a data platform on a whiteboard right now, I want you to do something. Stop calculating storage costs. They’re not the bill.

I pulled the public pricing for AWS, Azure, GCP, Databricks, and Snowflake and stacked them next to each other. Storage is the cheap part. The expensive part is everything that moves the data, and the expensive part is the part you’re least likely to model correctly when you’re picking a vendor.

Let me walk through what actually shows up on the invoice.

Raw Object Storage Is Basically Free

For hot, frequently accessed data, the big three are within a rounding error of each other:
- Azure Blob (LRS, Hot): $0.018 per GB/month
- Google Cloud Standard: $0.020 per GB/month
- AWS S3 Standard: $0.023 per GB/month (first 50 TB)
Drop into the cool tiers and AWS S3 takes the lead at $0.0125 per GB. Drop into deep archive and you’re paying $0.00099 per GB on either AWS Glacier Deep Archive or Azure Archive. That’s a tenth of a cent per gigabyte, per month, for data you almost never touch.

Good for you, but I think anyone leading with “per-GB storage cost” in a procurement deck is selling you a story. Storage capacity is roughly five percent of a typical Databricks bill. Five. The other 95% is the part nobody wants to talk about.

The Egress Trap

Ingress is free. Always. The cloud providers want your data in.

Getting it back out is where they collect.
- Azure Blob: $0.087/GB external egress
- AWS S3: $0.090/GB
- Google Cloud: $0.120/GB (but free if you stay inside Google’s ecosystem, which is the whole point of that pricing)
Then layer on API operations. A million GET requests on S3 costs about $0.40. The same million GETs on Google Cloud Storage can run closer to $5.00 because they classify operations differently. If your analytics workload is hammering small files, those API calls add up faster than the storage they’re reading.

Storing 10 TB? Maybe $200 a month. Storing 500 TB? You’re at $10,000 a month before a single byte leaves the region or a single query fires.

Databricks: Two Bills, One Headache

Databricks uses what’s commonly called a Two-Bill Model. You get one invoice from your cloud provider for the actual VMs and storage, and a separate invoice from Databricks for the software, measured in DBUs (Databricks Units).

In a typical mid-sized deployment around $18,000/month, the breakdown looks like this:
- VM compute from the cloud provider: ~55%
- Databricks DBU fees: ~30%
- Storage: ~5%
- Network egress: ~5%
The DBU rate changes based on what you’re doing. Automated jobs start at $0.15/DBU. Interactive notebooks for analysts start at $0.40/DBU. That’s not an accident. Databricks wants you running production workloads on cheap job clusters, not on the expensive all-purpose clusters your data scientists love to leave running over a weekend.

If you’re not actively pushing teams toward job clusters and ARM-based instances, you’re leaving real money on the table.

Snowflake: The Hidden Storage Multiplier

Snowflake’s pricing pitch sounds clean. Pass-through storage at $40/TB/month on-demand, dropping to $23/TB/month with a capacity commitment. Compute as Credits. Done.

Except it isn’t done. Snowflake stores data in immutable 16MB micro-partitions. Immutable. You can’t change them in place. Update a single row in a 1 TB table and Snowflake writes a new file and keeps the old one around.

Why keep the old one? Two features:
- Time Travel: query historical states of your data for up to 90 days
- Fail-Safe: a 7-day disaster recovery window you cannot turn off
This is the part that gets people. A 1 TB table that’s getting updated multiple times a day can balloon to 25 TB of billed storage because Snowflake is retaining every prior version of every micro-partition you’ve touched. Your dashboard says “1 TB table.” Your invoice says otherwise.

And compute? Virtual Warehouses bill per second, but with a 60-second minimum every single time you resume or resize. Aggressive auto-suspend sounds like a cost optimization. It’s not. If you’re spinning a warehouse up and down every 30 seconds, you’re paying the 60-second minimum every time and quietly multiplying your bill.

What I’d Actually Do

A few things I’d put on the wall before signing anything:
- Model egress, not storage. Run your worst-case query pattern through the calculator. Storage is noise.
- Lifecycle everything. Cool tier and archive pricing are 10x to 100x cheaper. If your data is older than 90 days and nobody’s queried it, it shouldn’t be in hot storage.
- For Databricks: push every recurring workload to job compute. Audit interactive cluster usage monthly.
- For Snowflake: if you have high-frequency update patterns, profile your actual storage footprint, not your logical table size. The gap will surprise you.
- For multi-cloud: don’t. Egress will eat the savings before you finish the architecture diagram.
The vendors all have a story about why their model is the cheap one. Read past the per-GB number on the slide. The bill is somewhere else.

Happy modeling.

Sources
- Databricks Pricing Explained (Dawiso) — Two-Bill Model, DBU breakdown
- Snowflake Pricing Explained (SELECT.dev) — Time Travel storage multiplier, micro-partition behavior
- Cloud & AI Storage Pricing Comparison 2026 (Finout) — AWS / Azure / GCP per-GB and tier pricing
- S3 vs GCS vs Azure Blob Storage (ai-infra-link) — Egress and API operation pricing
- Snowflake Pricing in 2026 (CloudZero) — Virtual Warehouse 60-second minimum behavior
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 4, 2026 / DevOps / Cloud / Data / Snowflake / Databricks

Enjoyed this?
The Data Lakehouse Won. Now Pick a Table Format.
If you’ve been ignoring the data infrastructure conversation for the last few years, here’s where we landed in 2026: the data lakehouse won. The data warehouse vendors will fight about it for another decade, but the architectural argument is over.

Let me back up.

The Quick History

At the bottom of every modern data stack is a cloud storage bucket. S3, Azure Blob, GCS. Pick your hyperscaler. A bucket is dumb on purpose. It stores files cheaply and durably and doesn’t care what’s in them. No schemas, no transactions, no relational anything. Just objects.

When you dump raw logs, IoT telemetry, and CSV exports into a bucket without any organizing layer, congratulations, you have a data lake. Cheap, flexible, and almost completely useless for analytics until someone builds a pipeline to make sense of it.

The traditional answer to that mess was a data warehouse. Snowflake, Redshift, BigQuery, the whole gang. You force your data through ETL, conform it to a strict schema, and pay a premium to keep it sitting in the vendor’s proprietary storage format. You get fast SQL, ACID transactions, and a vendor lock-in problem so severe that exporting your data becomes a major friction point.

The lakehouse is what happens when someone finally says: what if we kept the cheap object storage, but added the warehouse features as a layer on top?

What a Lakehouse Actually Is

The trick is decoupling. Storage stays in your bucket. Compute is whatever engine you point at it. Metadata lives in an open table format that turns a pile of Parquet files into something that behaves like a real database table.

One copy of the data. Multiple engines can query it. Schema evolution, time travel, ACID transactions, all without copying everything into a proprietary system. From what I’ve read, teams that move from a pure warehouse to a lakehouse are able to cut storage costs noticeably in the process, and they stop fighting their ML team about getting access to the same data.

That’s the pitch, and it’s a good one. The hard part is picking your table format.

The Four Formats Worth Knowing

Apache Iceberg

Iceberg is the one to bet on if you care about not getting locked in. It came out of Netflix and even Snowflake and Databricks have been forced to support it. The metadata is hierarchical, which sounds boring but matters: it lets query engines skip enormous chunks of irrelevant data without listing directories one by one. Iceberg also handles partition evolution gracefully, so you can change your partitioning strategy without rewriting petabytes of history.

If I’m starting a new lakehouse in 2026 and I don’t have a strong reason to pick something else, it’s Iceberg.

Delta Lake

Delta is what Databricks ships and what everyone using Spark already knows. It uses an append-only transaction log in a _delta_log directory, and it’s beautifully integrated with the Databricks platform. Z-Ordering, native Spark performance, the whole ecosystem.

If your team lives inside Databricks, Delta is the obvious answer. If you don’t, the calculus is harder, because Delta’s openness has improved a lot but it still feels most at home in the Databricks world.

Apache Hudi

Hudi came out of Uber and it was built for one thing: high-frequency upserts. If your problem is Change Data Capture, streaming ingestion, or constant record-level updates, Hudi is probably your answer. It gives you two storage modes. Copy-on-Write rewrites files on update so reads stay fast. Merge-on-Read writes deltas and reconciles at query time, which is what you want when writes are heavy and reads are flexible.

Hudi is the right pick when your pipeline is full of UPSERT and you can’t afford to rewrite large files every time something changes.

Apache Paimon

Paimon is the newest of the four and it’s worth keeping an eye on. It came from the Flink world and uses an LSM-tree style organization, which is what databases like RocksDB use under the hood. The whole point is unifying batch and streaming in a single format. If you’re doing real-time event-driven work and don’t want to maintain a separate streaming and batch stack, Paimon is interesting.

It’s not the safe choice yet, but it’s the one I’d watch most closely over the next two years.

So Which One?

Honestly, the answer depends less on the format and more on which ecosystem you’re already in.
- Mostly Spark and Databricks? Delta.
- Streaming-heavy with constant upserts? Hudi.
- Real-time event-driven and willing to bet on newer tech? Paimon.
- Anything else, or you want to keep your options open? Iceberg.
The format wars have mostly converged. Most major engines support multiple formats now, and the gap between them on raw query performance has shrunk. The choice is more about operational fit than performance ceilings.

The lakehouse pattern itself is the real story. The format is just plumbing.

I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
May 3, 2026 / Data / Infrastructure / Lakehouse / Iceberg

Enjoyed this?
What Would Minimum Wage Be If It Kept Up With Housing?
I pulled the data and verified the math. The answer is… not great.

The Numbers We’re Working With

In 1950, the federal minimum wage was $0.75 per hour. A median owner-occupied single-family home cost $7,354.

In 2026, the federal minimum wage is $7.25 per hour — unchanged since 2009. The median U.S. family home price is $429,129.

These numbers come from multiple independent sources. Let’s see what happens when we put them side by side.

The Home-Labor Index

I’m using a simple metric here: how many hours of minimum-wage work does it take to buy a median home? No mortgages, no interest rates, no down payments; just raw labor hours versus home price.

1950:
- $7,354 ÷ $0.75/hr = 9,805 hours
- At 40 hrs/wk × 52 wks = 2,080 hrs/yr
- That’s 4.71 years of full-time minimum-wage work
2026:
- $429,129 ÷ $7.25/hr = 59,191 hours
- Same 2,080 hrs/yr
- That’s 28.46 years of full-time minimum-wage work
In 1950, a minimum-wage worker needed under 5 years of gross income to cover a median home. In 2026, that same worker needs over 28 years. The ratio has gotten roughly six times worse.

So What Should Minimum Wage Be?

If we wanted to preserve the same home-purchasing power that a minimum-wage worker had in 1950, we can work backwards:
- 2026 Median Home Price ÷ 1950 Home-Labor Index
- $429,129 ÷ 9,805 hours = $43.77 per hour
We can verify this another way. The 1950 ratio was 4.714 years of income to buy a home. To maintain that ratio in 2026:
- $429,129 ÷ 4.714 = $91,029/yr required income
- $91,029 ÷ 2,080 hours = $43.76/hr
Both methods land in the same place. To have the same relationship between minimum wage and housing that existed in 1950, the federal minimum wage would need to be roughly $43.77 per hour

You don’t need a PhD to look at these numbers and see the wage gap disparity. The gap between wages at the bottom and the cost of the most basic economic asset, a home, has grown dramatically. That the gap exists isn’t debatable.

The federal minimum wage has been $7.25 since 2009. That’s 17 years without an increase. Meanwhile, median home prices have roughly doubled in that same period.

If we cared about the citizens, we’d need to 6x the minmum wage while also working at more affordable housing for the middle class.
Apr 13, 2026 / Economics / Housing / Minimum wage / Data

Enjoyed this?

Data

Your Data Lake Has a Permissions Problem

RBAC Falls Over Faster Than You Think

Column and Row Security Should Be Boring

The IAM Default Problem

And Then There’s the Agent Problem

What to Actually Do

Sources

The Real Cost of Your Data Lake (It's Not the Storage)

Raw Object Storage Is Basically Free

The Egress Trap

Databricks: Two Bills, One Headache

Snowflake: The Hidden Storage Multiplier

What I’d Actually Do

Sources

The Data Lakehouse Won. Now Pick a Table Format.

The Quick History

What a Lakehouse Actually Is

The Four Formats Worth Knowing

Apache Iceberg

Delta Lake

Apache Hudi

Apache Paimon

So Which One?

What Would Minimum Wage Be If It Kept Up With Housing?

The Numbers We’re Working With

The Home-Labor Index

So What Should Minimum Wage Be?