Cloud
-
Your Data Lake's Vulnerability Problem Is Really an Identity Problem
I’ve been reading through the post-mortems on the last few years of data lake breaches, and the pattern is depressing. We keep blaming the platforms. We should be blaming ourselves.
Let me give you an example.
The Snowflake Breach Wasn’t a Snowflake Breach
In mid-2024, at least 165 organizations got hit through their Snowflake instances. AT&T lost over 50 billion call records. Ticketmaster, Santander, Advance Auto Parts. The headlines wrote themselves: Snowflake hacked.
Except Snowflake wasn’t hacked. Mandiant, CrowdStrike, and Snowflake all reached the same conclusion in their forensics. No zero-day. No flaw in the cryptographic platform. No internal compromise of Snowflake’s corporate network. No brute-force attacks against API limits.
What actually happened? UNC5537, a financially motivated group also tracked as Scattered Spider and ShinyHunters, walked through the front door with valid stolen credentials. Those credentials were harvested over years by commodity infostealer malware (VIDAR, LUMMA, REDLINE) running on the personal laptops of third-party contractors. The same laptops these contractors used for gaming and pirated software also held the keys to their clients' enterprise data lakes.
One contractor laptop. Multiple enterprise environments compromised. That’s the actual story.
79.7% of the accounts UNC5537 used had prior credential exposure. Some had been valid and un-rotated since November 2020.
The Two Doors They Walked Through
The first attack vector was the SSO side door. Plenty of victim organizations had a perfectly fine enterprise IdP enforcing strong passwords and MFA. They just forgot to make SSO mandatory. A local authentication pathway was left active alongside it. Attackers logged in directly with stolen local credentials, completely bypassing the IdP, and the MFA requirement never fired.
The second was credential stuffing against inactive, orphaned, and demo accounts belonging to former employees. Nobody audits those. Nobody enforces MFA on those. So they don’t get protected by the controls that exist on the production accounts.
Once inside, the kill chain was almost boring.
SHOW TABLESto enumerate.CREATE TEMPORARY STAGEto make an ephemeral staging area that disappears when the session ends, erasing forensic evidence.COPY INTOwithGZIPcompression to keep the payload small enough that volumetric alarms didn’t trigger.GETto pull it down to a VPS in some offshore jurisdiction. Done.No IP allowlisting was in place anywhere. The connections from Mullvad and PIA exit nodes were treated with the same trust as an employee on the corporate VPN.
The Bucket Problem Hasn’t Gone Away Either
Alongside the identity attacks, the boring stuff keeps working. Misconfigured S3 buckets are still the most reliable way to expose a data lake. In late 2024, an open bucket used as a shared network drive was found containing raw customer data, cryptographic keys, and secrets. In 2025, a US healthcare provider left millions of patient records readable for weeks before anyone noticed.
Then there’s Codefinger. In January 2025, that group used compromised AWS credentials to access S3 buckets and then weaponized AWS’s own Server-Side Encryption with Customer-Provided Keys (SSE-C) to ransomware the data in place. They didn’t even need to exfiltrate it. They just encrypted it with a key the victim didn’t have and demanded Bitcoin.
That’s a native cloud feature being turned against you because somebody granted too many permissions to a service account.
The Boring Conclusions Are the Important Ones
Identity is the perimeter now. The encryption-at-rest story we’ve been telling ourselves for a decade is irrelevant when the attacker authenticates as a real user. Stop treating SSO as optional. Stop leaving local auth paths open next to it. Enforce MFA on every account, including the demo and service accounts you forgot about.
Your data lake should not be reachable from the public internet. Route everything through PrivateLink or the equivalent in your cloud. Allowlist the IPs that should be touching analytical workloads, and don’t make exceptions for “just this one contractor.”
And as you start handing access to AI agents, remember that static roles aren’t going to cut it. Just-in-time entitlements and contextual access control are the only way you’re going to keep up with autonomous systems making queries on your behalf.
The data lake industry spent years arguing about table formats, vendor lock-in, and egress fees. Meanwhile, attackers were just collecting passwords from gaming laptops and walking in.
Fix the doors first.
Sources
- UNC5537 Targets Snowflake Customer Instances (Mandiant / Google Cloud) — Forensic analysis, kill chain, infostealer attribution
- Snowflake Data Breach: Lessons Learned (AppOmni) — SSO side door, MFA bypass mechanics
- Major AWS S3 Bucket Breach Exposes Data (NHIMG) — Codefinger SSE-C ransomware tactic
- Misconfigured Cloud Assets: How Attackers Find Them (CybelAngel) — Recent open-bucket exposure incidents
- 5 Key Lessons from the Snowflake Data Breach (Tanium) — Defensive posture summary
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
Your Data Lake Has a Permissions Problem
Consolidating every business unit’s data into one giant lakehouse sounds like a win until you realize the security model from your old data warehouse can’t scale to it. You took ten silos, each with their own access rules, and merged them into one location. Now everyone wants in, and your security team is the bottleneck.
Let me walk through three places where the cracks usually show up.
RBAC Falls Over Faster Than You Think
Role-Based Access Control is the model most teams start with. Permissions are tied to a job function. Sales reps get read access to sales tables, data engineers get write access to staging, and so on. It works fine when you have ten roles.
It does not work when you have a thousand.
Say your sales reps should only see accounts in their territory, and only accounts they personally manage. Under pure RBAC, you need a unique role for every territory-by-account-owner combination. That’s role explosion, and it’s how compliance audits become impossible and legitimate access slows to a crawl. The roles list grows faster than anyone can review it, which means stale permissions sit there forever.
The answer is Attribute-Based Access Control. Instead of asking “what role is this user in,” the system asks “what attributes does this user have, what attributes does this data have, and what’s the policy at this exact moment.” Tag a column as
PII. Tag a schema asHR. Write one policy that says anyone outside the HR compliance group sees masked data when they touch a PII column. Done. That single policy replaces hundreds of bespoke roles.This is what Unity Catalog and Starburst Galaxy are built around, and it’s the model that will scale with the data.
Column and Row Security Should Be Boring
Once you have ABAC and a real metadata catalog, column-level masking and row-level filtering become a non-event. You write a SQL expression that masks the first five digits of an SSN for lower-privileged roles. You write a row filter that silently appends
WHERE region = 'user_region'to every executive’sSELECT *.The key word is silently. The user doesn’t see a different table. They don’t have a sanitized copy. The policy is enforced at the catalog layer, so it works the same whether they’re querying through Spark, Trino, a BI dashboard, or a pipeline. One source of truth, one policy, every engine.
If you’re still maintaining separate “sanitized” copies of tables for different audiences, you’re doing it the 2015 way and you’re going to drift.
The IAM Default Problem
Most cloud services ship with default IAM roles, and a surprising number of those defaults attach
AmazonS3FullAccessor something equally permissive.SageMaker does it. The Ray autoscaler role does it. There are more.
Picture the failure mode. An attacker compromises some peripheral app, maybe a forgotten Jupyter notebook, maybe a misconfigured Lambda. That workload has an IAM role attached because that’s how cloud workloads talk to S3 without hardcoded credentials. The attacker inherits the role. And because the role has full S3 access, they’re not constrained to the bucket the application actually uses. They can enumerate every bucket in the entire account.
That’s how a single compromised container becomes a full data lake breach. Researchers call it a bucket monopoly attack. I call it the most predictable incident in the industry.
The fix is not glamorous. Stop using
s3:*in any policy. Write resource-scoped policies that name the exact buckets and prefixes a workload needs. Audit the default roles every cloud service hands you and replace them. Use Security Lake or Detective to flag cross-service API calls that don’t match normal patterns. None of this is fun. All of it is necessary.And Then There’s the Agent Problem
The new wrinkle is that humans are no longer the primary consumers of your data. Autonomous agents are. They issue more queries, hit more tables, and move faster than any human team.
Long-lived credentials and static roles don’t fit that workload. The pattern emerging is Just-In-Time entitlements, where an agent gets a narrow, ephemeral permission for the duration of a single execution thread, then loses it. Pair that with declarative policy metadata baked into the data assets themselves, so the agent knows what it’s allowed to do with a dataset before it ever runs the query.
We’re early on this. Most organizations are still working through the basics, and that’s fine. But if you’re designing access controls today, design them assuming the next thing hitting your lake isn’t a person.
What to Actually Do
If you’re auditing your own data lake security, the order I’d work in:
- Find every IAM role with a wildcard permission. Replace them.
- Move from RBAC to ABAC at the catalog layer. Stop creating new roles.
- Pull your data lake off the public internet. PrivateLink, private endpoints, IP allowlists for the legacy stuff that can’t move.
- Then start thinking about agents.
The lakehouse pitch is unification. The lakehouse reality is that unification multiplies the cost of every bad permission. Get the basics right before you bolt on anything fancy.
Sources
- AWS Default IAM Roles Found to Enable Lateral Movement (The Hacker News) — SageMaker / Ray autoscaler default roles, bucket monopoly attacks
- What Is Fine-Grained Data Access Control? (TrustLogix) — RBAC role explosion, ABAC fundamentals
- Core concepts for ABAC (Databricks Unity Catalog docs) — Tag-driven policy enforcement
- Top 12 Data Governance Predictions for 2026 (Hyperight) — Just-in-time entitlements, declarative policy metadata
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
The Real Cost of Your Data Lake (It's Not the Storage)
If you’re sketching out a data platform on a whiteboard right now, I want you to do something. Stop calculating storage costs. They’re not the bill.
I pulled the public pricing for AWS, Azure, GCP, Databricks, and Snowflake and stacked them next to each other. Storage is the cheap part. The expensive part is everything that moves the data, and the expensive part is the part you’re least likely to model correctly when you’re picking a vendor.
Let me walk through what actually shows up on the invoice.
Raw Object Storage Is Basically Free
For hot, frequently accessed data, the big three are within a rounding error of each other:
- Azure Blob (LRS, Hot): $0.018 per GB/month
- Google Cloud Standard: $0.020 per GB/month
- AWS S3 Standard: $0.023 per GB/month (first 50 TB)
Drop into the cool tiers and AWS S3 takes the lead at $0.0125 per GB. Drop into deep archive and you’re paying $0.00099 per GB on either AWS Glacier Deep Archive or Azure Archive. That’s a tenth of a cent per gigabyte, per month, for data you almost never touch.
Good for you, but I think anyone leading with “per-GB storage cost” in a procurement deck is selling you a story. Storage capacity is roughly five percent of a typical Databricks bill. Five. The other 95% is the part nobody wants to talk about.
The Egress Trap
Ingress is free. Always. The cloud providers want your data in.
Getting it back out is where they collect.
- Azure Blob: $0.087/GB external egress
- AWS S3: $0.090/GB
- Google Cloud: $0.120/GB (but free if you stay inside Google’s ecosystem, which is the whole point of that pricing)
Then layer on API operations. A million GET requests on S3 costs about $0.40. The same million GETs on Google Cloud Storage can run closer to $5.00 because they classify operations differently. If your analytics workload is hammering small files, those API calls add up faster than the storage they’re reading.
Storing 10 TB? Maybe $200 a month. Storing 500 TB? You’re at $10,000 a month before a single byte leaves the region or a single query fires.
Databricks: Two Bills, One Headache
Databricks uses what’s commonly called a Two-Bill Model. You get one invoice from your cloud provider for the actual VMs and storage, and a separate invoice from Databricks for the software, measured in DBUs (Databricks Units).
In a typical mid-sized deployment around $18,000/month, the breakdown looks like this:
- VM compute from the cloud provider: ~55%
- Databricks DBU fees: ~30%
- Storage: ~5%
- Network egress: ~5%
The DBU rate changes based on what you’re doing. Automated jobs start at $0.15/DBU. Interactive notebooks for analysts start at $0.40/DBU. That’s not an accident. Databricks wants you running production workloads on cheap job clusters, not on the expensive all-purpose clusters your data scientists love to leave running over a weekend.
If you’re not actively pushing teams toward job clusters and ARM-based instances, you’re leaving real money on the table.
Snowflake: The Hidden Storage Multiplier
Snowflake’s pricing pitch sounds clean. Pass-through storage at $40/TB/month on-demand, dropping to $23/TB/month with a capacity commitment. Compute as Credits. Done.
Except it isn’t done. Snowflake stores data in immutable 16MB micro-partitions. Immutable. You can’t change them in place. Update a single row in a 1 TB table and Snowflake writes a new file and keeps the old one around.
Why keep the old one? Two features:
- Time Travel: query historical states of your data for up to 90 days
- Fail-Safe: a 7-day disaster recovery window you cannot turn off
This is the part that gets people. A 1 TB table that’s getting updated multiple times a day can balloon to 25 TB of billed storage because Snowflake is retaining every prior version of every micro-partition you’ve touched. Your dashboard says “1 TB table.” Your invoice says otherwise.
And compute? Virtual Warehouses bill per second, but with a 60-second minimum every single time you resume or resize. Aggressive auto-suspend sounds like a cost optimization. It’s not. If you’re spinning a warehouse up and down every 30 seconds, you’re paying the 60-second minimum every time and quietly multiplying your bill.
What I’d Actually Do
A few things I’d put on the wall before signing anything:
- Model egress, not storage. Run your worst-case query pattern through the calculator. Storage is noise.
- Lifecycle everything. Cool tier and archive pricing are 10x to 100x cheaper. If your data is older than 90 days and nobody’s queried it, it shouldn’t be in hot storage.
- For Databricks: push every recurring workload to job compute. Audit interactive cluster usage monthly.
- For Snowflake: if you have high-frequency update patterns, profile your actual storage footprint, not your logical table size. The gap will surprise you.
- For multi-cloud: don’t. Egress will eat the savings before you finish the architecture diagram.
The vendors all have a story about why their model is the cheap one. Read past the per-GB number on the slide. The bill is somewhere else.
Happy modeling.
Sources
- Databricks Pricing Explained (Dawiso) — Two-Bill Model, DBU breakdown
- Snowflake Pricing Explained (SELECT.dev) — Time Travel storage multiplier, micro-partition behavior
- Cloud & AI Storage Pricing Comparison 2026 (Finout) — AWS / Azure / GCP per-GB and tier pricing
- S3 vs GCS vs Azure Blob Storage (ai-infra-link) — Egress and API operation pricing
- Snowflake Pricing in 2026 (CloudZero) — Virtual Warehouse 60-second minimum behavior
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Cloud / Data / Snowflake / Databricks
-
Serverless and Edge Computing: A Practical Guide
Serverless and edge computing have transformed how we deploy and scale web applications. Instead of managing servers, you write functions that automatically scale from zero to millions of users.
Edge computing takes this further by running code geographically close to users for minimal latency. Let’s break down how these technologies work and when you’d actually want to use them.
What is Serverless?
Serverless doesn’t mean “no servers”, it means you don’t manage them. The provider handles infrastructure, scaling, and maintenance. You just write “functions”.
The functions are stateless, auto-scaling, and you only pay for execution time. So what’s the tradeoff? Well there are several but the first one is Cold starts. The first request after idle time is slower because the container needs to spin up.
The serverless Platforms as a Service are sticky, preventing you easily moving to another platform.
They are stateless, meaning each invocation is independent and doesn’t retain any state between invocations.
In some cases, they don’t run Node, so they behave much differently when building locally, which complicates development and testing.
Each request is handled by a NEW function and so you can imagine that if you have a site that gets a lot of traffic, and makes a lot of requests, this will lead to expensive hosting bills or you playing the pauper on social media.
Traditional vs. Serverless vs. Edge
Think of it this way:
- Traditional servers are always running, always costing you money, and you handle all the scaling yourself. Great for predictable, high-traffic workloads. Lots of different options for hosting and scaling.
- Serverless (AWS Lambda, Vercel Functions, GCP Functions) spins up containers on demand and kills them when idle. Auto-scales from zero to infinity. Cold starts around 100-500ms.
- Edge (Cloudflare Workers, Vercel Edge) uses V8 Isolates instead of containers, running your code in 200+ locations worldwide. Cold starts under 1ms.
Cost Projections
Here’s how the costs break down at different scales:
Requests / Month ~RPS (Avg) AWS Lambda Cloudflare Workers VPS / K8s Cluster Winner 1 Million 0.4 $0.00 (Free Tier) $0.00 (Free Tier) $40–$60 (Min HA Setup) Serverless 10 Million 4.0 ~$12 ~$5 $40–$100 Serverless 100 Million 40 ~$120 ~$35 $80–$150 Tie / Workers 500 Million 200 ~$600 ~$155 $150–$300 VPS / Workers 1 Billion 400 ~$1,200+ ~$305 $200–$400 VPS / EC2 The Hub and Spoke Pattern
Also called the citadel pattern, this is where serverless and traditional infrastructure stop competing and start complementing each other. The idea is simple: keep a central hub (your main application running on containers or a VPS) and offload specific tasks to serverless “spokes” at the edge.
Your core API, database connections, and stateful logic stay on traditional infrastructure where they belong. But image resizing, auth token validation, A/B testing, geo-routing and rate limiting all move to edge functions that run close to the user.
When to Use Serverless
- Unpredictable or spiky traffic — APIs that go from 0 to 10,000 requests in minutes (webhooks, event-driven workflows)
- Lightweight, stateless tasks — image processing, PDF generation, sending emails, data transformation
- Low-traffic side projects — anything that sits idle most of the time and you don’t want to pay for an always-on server… and you don’t know how to setup a Coolify server.
- Edge logic — geolocation routing, header manipulation, request validation before it hits your origin
When to Use Containers / VPS
- Sustained high traffic — once you’re consistently above ~100M requests/month, a VPS is cheaper (see the table above)
- Stateful workloads — WebSocket connections, long-running processes, anything that needs to hold state between requests
- Database-heavy applications — connection pooling and persistent connections don’t play well with serverless cold starts
- Complex applications — monoliths or microservices that need shared memory, background workers, or cron jobs
The Hybrid Approach
The best architectures often use both. It depends on your specific use case and requirements. It depends on the team, the budget, and the complexity of your application.
Knowing the tradeoffs is the difference between a seasoned developer and a junior. It’s important that you make the right decisions based off your needs and constraints.
Good luck and godspeed!
/ DevOps / Development / Serverless / Cloud