Your Data Lake's Vulnerability Problem Is Really an Identity Problem

I’ve been reading through the post-mortems on the last few years of data lake breaches, and the pattern is depressing. We keep blaming the platforms. We should be blaming ourselves.

Let me give you an example.

The Snowflake Breach Wasn’t a Snowflake Breach

In mid-2024, at least 165 organizations got hit through their Snowflake instances. AT&T lost over 50 billion call records. Ticketmaster, Santander, Advance Auto Parts. The headlines wrote themselves: Snowflake hacked.

Except Snowflake wasn’t hacked. Mandiant, CrowdStrike, and Snowflake all reached the same conclusion in their forensics. No zero-day. No flaw in the cryptographic platform. No internal compromise of Snowflake’s corporate network. No brute-force attacks against API limits.

What actually happened? UNC5537, a financially motivated group also tracked as Scattered Spider and ShinyHunters, walked through the front door with valid stolen credentials. Those credentials were harvested over years by commodity infostealer malware (VIDAR, LUMMA, REDLINE) running on the personal laptops of third-party contractors. The same laptops these contractors used for gaming and pirated software also held the keys to their clients' enterprise data lakes.

One contractor laptop. Multiple enterprise environments compromised. That’s the actual story.

79.7% of the accounts UNC5537 used had prior credential exposure. Some had been valid and un-rotated since November 2020.

The Two Doors They Walked Through

The first attack vector was the SSO side door. Plenty of victim organizations had a perfectly fine enterprise IdP enforcing strong passwords and MFA. They just forgot to make SSO mandatory. A local authentication pathway was left active alongside it. Attackers logged in directly with stolen local credentials, completely bypassing the IdP, and the MFA requirement never fired.

The second was credential stuffing against inactive, orphaned, and demo accounts belonging to former employees. Nobody audits those. Nobody enforces MFA on those. So they don’t get protected by the controls that exist on the production accounts.

Once inside, the kill chain was almost boring. SHOW TABLES to enumerate. CREATE TEMPORARY STAGE to make an ephemeral staging area that disappears when the session ends, erasing forensic evidence. COPY INTO with GZIP compression to keep the payload small enough that volumetric alarms didn’t trigger. GET to pull it down to a VPS in some offshore jurisdiction. Done.

No IP allowlisting was in place anywhere. The connections from Mullvad and PIA exit nodes were treated with the same trust as an employee on the corporate VPN.

The Bucket Problem Hasn’t Gone Away Either

Alongside the identity attacks, the boring stuff keeps working. Misconfigured S3 buckets are still the most reliable way to expose a data lake. In late 2024, an open bucket used as a shared network drive was found containing raw customer data, cryptographic keys, and secrets. In 2025, a US healthcare provider left millions of patient records readable for weeks before anyone noticed.

Then there’s Codefinger. In January 2025, that group used compromised AWS credentials to access S3 buckets and then weaponized AWS’s own Server-Side Encryption with Customer-Provided Keys (SSE-C) to ransomware the data in place. They didn’t even need to exfiltrate it. They just encrypted it with a key the victim didn’t have and demanded Bitcoin.

That’s a native cloud feature being turned against you because somebody granted too many permissions to a service account.

The Boring Conclusions Are the Important Ones

Identity is the perimeter now. The encryption-at-rest story we’ve been telling ourselves for a decade is irrelevant when the attacker authenticates as a real user. Stop treating SSO as optional. Stop leaving local auth paths open next to it. Enforce MFA on every account, including the demo and service accounts you forgot about.

Your data lake should not be reachable from the public internet. Route everything through PrivateLink or the equivalent in your cloud. Allowlist the IPs that should be touching analytical workloads, and don’t make exceptions for “just this one contractor.”

And as you start handing access to AI agents, remember that static roles aren’t going to cut it. Just-in-time entitlements and contextual access control are the only way you’re going to keep up with autonomous systems making queries on your behalf.

The data lake industry spent years arguing about table formats, vendor lock-in, and egress fees. Meanwhile, attackers were just collecting passwords from gaming laptops and walking in.

Fix the doors first.

Sources


I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].

/ DevOps / security / Cloud / Data-lake