Infrastructure
-
The Data Lakehouse Won. Now Pick a Table Format.
If you’ve been ignoring the data infrastructure conversation for the last few years, here’s where we landed in 2026: the data lakehouse won. The data warehouse vendors will fight about it for another decade, but the architectural argument is over.
Let me back up.
The Quick History
At the bottom of every modern data stack is a cloud storage bucket. S3, Azure Blob, GCS. Pick your hyperscaler. A bucket is dumb on purpose. It stores files cheaply and durably and doesn’t care what’s in them. No schemas, no transactions, no relational anything. Just objects.
When you dump raw logs, IoT telemetry, and CSV exports into a bucket without any organizing layer, congratulations, you have a data lake. Cheap, flexible, and almost completely useless for analytics until someone builds a pipeline to make sense of it.
The traditional answer to that mess was a data warehouse. Snowflake, Redshift, BigQuery, the whole gang. You force your data through ETL, conform it to a strict schema, and pay a premium to keep it sitting in the vendor’s proprietary storage format. You get fast SQL, ACID transactions, and a vendor lock-in problem so severe that exporting your data becomes a major friction point.
The lakehouse is what happens when someone finally says: what if we kept the cheap object storage, but added the warehouse features as a layer on top?
What a Lakehouse Actually Is
The trick is decoupling. Storage stays in your bucket. Compute is whatever engine you point at it. Metadata lives in an open table format that turns a pile of Parquet files into something that behaves like a real database table.
One copy of the data. Multiple engines can query it. Schema evolution, time travel, ACID transactions, all without copying everything into a proprietary system. From what I’ve read, teams that move from a pure warehouse to a lakehouse are able to cut storage costs noticeably in the process, and they stop fighting their ML team about getting access to the same data.
That’s the pitch, and it’s a good one. The hard part is picking your table format.
The Four Formats Worth Knowing
Apache Iceberg
Iceberg is the one to bet on if you care about not getting locked in. It came out of Netflix and even Snowflake and Databricks have been forced to support it. The metadata is hierarchical, which sounds boring but matters: it lets query engines skip enormous chunks of irrelevant data without listing directories one by one. Iceberg also handles partition evolution gracefully, so you can change your partitioning strategy without rewriting petabytes of history.
If I’m starting a new lakehouse in 2026 and I don’t have a strong reason to pick something else, it’s Iceberg.
Delta Lake
Delta is what Databricks ships and what everyone using Spark already knows. It uses an append-only transaction log in a
_delta_logdirectory, and it’s beautifully integrated with the Databricks platform. Z-Ordering, native Spark performance, the whole ecosystem.If your team lives inside Databricks, Delta is the obvious answer. If you don’t, the calculus is harder, because Delta’s openness has improved a lot but it still feels most at home in the Databricks world.
Apache Hudi
Hudi came out of Uber and it was built for one thing: high-frequency upserts. If your problem is Change Data Capture, streaming ingestion, or constant record-level updates, Hudi is probably your answer. It gives you two storage modes. Copy-on-Write rewrites files on update so reads stay fast. Merge-on-Read writes deltas and reconciles at query time, which is what you want when writes are heavy and reads are flexible.
Hudi is the right pick when your pipeline is full of
UPSERTand you can’t afford to rewrite large files every time something changes.Apache Paimon
Paimon is the newest of the four and it’s worth keeping an eye on. It came from the Flink world and uses an LSM-tree style organization, which is what databases like RocksDB use under the hood. The whole point is unifying batch and streaming in a single format. If you’re doing real-time event-driven work and don’t want to maintain a separate streaming and batch stack, Paimon is interesting.
It’s not the safe choice yet, but it’s the one I’d watch most closely over the next two years.
So Which One?
Honestly, the answer depends less on the format and more on which ecosystem you’re already in.
- Mostly Spark and Databricks? Delta.
- Streaming-heavy with constant upserts? Hudi.
- Real-time event-driven and willing to bet on newer tech? Paimon.
- Anything else, or you want to keep your options open? Iceberg.
The format wars have mostly converged. Most major engines support multiple formats now, and the gap between them on raw query performance has shrunk. The choice is more about operational fit than performance ceilings.
The lakehouse pattern itself is the real story. The format is just plumbing.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ Data / Infrastructure / Lakehouse / Iceberg
-
How Kong Actually Works in Kubernetes
At some point with microservices in Kubernetes, basic Ingress routing stops being enough. Kong is interesting router that I would like to try in the future.
It’s an API Gateway built on top of NGINX and OpenResty. It operates at the infrastructure layer, managing the actual HTTP traffic flowing into your cluster. Drop it into a Kubernetes environment and it acts as an Ingress Controller. It does that job really well.
The Ingress Controller Problem
We should review what an ingress controller is. In case you’re familiar, unfamiliar with its job in Kubernetes. An
Ingressresource is just a set of routing rules. “Send traffic forapi.example.com/v1to theuser-servicepod.” Kubernetes doesn’t actually route traffic itself. It needs a controller to read those rules and move the packets.The Kong Ingress Controller (KIC) runs as a pod inside your cluster. It watches the Kubernetes API server for changes to Ingress resources, Services, and Endpoints. When someone deploys a new app and creates an Ingress rule, KIC picks it up, translates the Kubernetes config into Kong’s native format, and reloads the proxy. No manual intervention.
How Traffic Actually Flows
When external traffic hits your cluster, the path looks like this:
- External Load Balancer forwards traffic to the Kong proxy pods
- Kong evaluates the incoming request against its routing table (headers, paths, hostnames)
- Plugins execute before routing, handling cross-cutting concerns at the edge instead of inside your application code
- Upstream routing sends traffic directly to Pod IPs, bypassing
kube-proxyfor better performance
That plugin step is where Kong really earns its keep. Rate limiting, API key auth, mTLS, request transformation. All of that happens at the gateway layer so your services don’t have to think about it.
CRDs Make It Actually Useful
Standard Kubernetes Ingress is pretty limited. Host-based routing, path-based routing, and that’s about it. Kong extends this with Custom Resource Definitions:
- KongPlugin lets you attach behaviors to routes or services. Deploy a manifest to enforce rate limits, require API keys, or add mTLS to a specific endpoint.
- KongConsumer manages user identities and credentials directly in Kubernetes, so you can tie routing rules or rate limits to specific clients.
This means your API gateway configuration lives right alongside your application manifests. Version controlled, reviewable, deployable through your normal CI/CD pipeline.
Skip the Database
Kong used to require PostgreSQL or Cassandra to store its routing config. In modern Kubernetes deployments, you almost always run it in DB-less mode instead.
Why? Kubernetes already has
etcdas its source of truth for cluster state. Running a second database just for the API gateway adds overhead and failure modes you don’t need. In DB-less mode, Kong stores its configuration entirely in memory. The Ingress Controller reads state from Kubernetes and pushes updates to the proxy dynamically.This is one of those decisions that sounds minor but changes everything about how you operate Kong. No database backups to worry about. No schema migrations. Your gateway config is just Kubernetes manifests managed through GitOps.
Observability at the Edge
Sitting at the edge of the cluster, Kong is perfectly positioned to capture metrics, logs, and traces. With the right plugins, it exports traffic data (latency, status codes, request volumes) directly into whatever observability stack you’re running.
You get visibility across your entire microservice architecture without instrumenting every individual service.
Kong isn’t the only Ingress controller out there, but the combination of plugin architecture, DB-less mode, and CRD-based configuration makes it a solid choice if you need more than basic routing. If you’re already running Kubernetes and find yourself writing the same auth and rate-limiting logic across multiple services, moving that to the gateway layer is worth your time.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Kubernetes / Kong / Infrastructure