DevOps
-
Running Terraform in Your Existing CI Pipeline
The previous post made the case that HCP Terraform’s per-resource pricing model has gotten structurally hostile to modern infrastructure patterns. (The earlier posts in this series argued that OpenTofu is the no-regrets default for new infrastructure, and walked through when to skip Terraform entirely in favor of cloud-native tooling.) The natural follow-up: if you don’t want to pay the commercial orchestration tax, can you run Terraform or OpenTofu properly inside your existing CI/CD? The answer is yes, but the gap between “it works” and “it works well” requires some deliberate architecture. This post is about how to close that gap.
There are three pieces: where the state lives, how the pipeline authenticates to your cloud, and what handles the orchestration concerns (locking, PR commentary, drift detection) that TACOs sell as their core value. Each one has a sensible 2026 answer that doesn’t involve paying anyone.
State Management in GitLab
If you’re on GitLab, the entire state-management problem is solved natively. GitLab ships an HTTP backend for Terraform and OpenTofu state on every tier including Free. You don’t need to provision an S3 bucket. You don’t need a DynamoDB lock table. You don’t need to figure out KMS. The state file is encrypted in transit and at rest, locking is handled by GitLab’s project-scoped role-based access control, and there’s a native UI under Operate > Terraform states that shows you version history and lets you roll back if something corrupts.
The pipeline pattern uses a backend block like this:
terraform { backend "http" {} }Combined with the
gitlab-tofu(orgitlab-terraform) CLI wrapper in your.gitlab-ci.yml, which dynamically configures the HTTP backend at runtime using the per-job${CI_JOB_TOKEN}. The wrapper avoids passing backend credentials via-backend-configarguments (which cache in pipeline logs) and handles authentication automatically.The RBAC story is also worth pointing out, because it’s exactly what TACOs charge thousands of dollars to replicate: the GitLab project’s role model becomes the IaC permissions model. Developers can read state and run
tofu plan -lock=false. Maintainers and Owners can lock state and runtofu apply. The audit log is the GitLab activity feed. No additional configuration, no additional vendor.For GitLab shops, this is the single highest-leverage decision in the entire IaC stack: stop paying for state management when your VCS gives it to you for free.
Secretless Authentication on GitHub Actions
On GitHub Actions, the equivalent problem is authentication. Historically, every Terraform-on-Actions tutorial told you to put a long-lived AWS access key in GitHub Secrets. That’s the worst possible pattern. A compromised repository, a malicious third-party action, or a leaked log line gives the attacker permanent, unscoped access to your cloud.
The 2026 answer is OpenID Connect with cloud-side trust policies. The pipeline gets ephemeral, short-lived credentials per job, scoped to the specific repository and branch that initiated the run. Nothing persists.
For AWS: configure GitHub’s OIDC provider (
token.actions.githubusercontent.com) as an identity provider in IAM. Create an IAM role with a trust policy that conditionally allows assumption based on JWT claims likesub(subject) andaud(audience). The workflow usesaws-actions/configure-aws-credentialsto exchange a GitHub-issued JWT for temporary AWS credentials viaAssumeRoleWithWebIdentity. The trust policy can be scoped to a specific repository, a specific branch (mainonly), or even a specific environment (production).For GCP: the equivalent is Workload Identity Federation. You create a Workload Identity Pool that trusts GitHub’s OIDC provider, configure attribute mapping that validates the token claims (e.g., requiring
assertion.repository == "company/infra-prod"), and grant the pool’s principal the ability to impersonate a specific GCP service account. The officialgoogle-github-actions/authaction handles the token exchange.Both patterns produce credentials that expire when the job ends, can’t be exfiltrated to long-term storage, and leave a clean audit trail in your cloud’s IAM logs. There is no good reason to use long-lived cloud credentials in CI in 2026.
What CI Doesn’t Give You for Free
Native CI/CD solves the cost problem. It does not, by itself, solve every operational problem that commercial TACOs address. There are three real gaps worth knowing about:
State locking and race conditions. Standard CI/CD systems are designed for concurrent runs because that’s what application code wants. Infrastructure code wants the opposite. If two PRs merge at the same time and both trigger
tofu apply, you have two concurrent processes racing to mutate the same state file. With GitLab’s HTTP backend or an external lock backend like DynamoDB, the lock will prevent corruption but one job will fail with a confusing error. Without it, you get state corruption. You need some queuing logic, either custom or via an orchestrator.PR plan commentary. TACOs post the output of
terraform plandirectly into the PR so reviewers can see what’s about to change before merging. In raw CI/CD this requires a third-party action (terraform-plan-pr-commenterand similar), parsing of the CLI output, handling of PR comment character limits, and securely passing the binary plan file as a workflow artifact from the plan stage to the apply stage. None of this is hard, but it’s a real chunk of YAML you have to maintain.Cost estimation. TACOs include built-in cost estimation on every plan. Adding this to your own pipeline means picking up a third-party FinOps or IaC cost-analysis tool (there are several worth comparing), running it against your plan output, parsing the JSON, comparing against budget thresholds, and posting deltas into PRs. None of that is hard, but it’s another bit of integration to own.
You can build all of this yourself. Plenty of teams do. The question is whether maintaining the bash and YAML is cheaper than using an open-source orchestrator designed for exactly this problem.
Open-Source Tools to Layer In
None of these are drop-in replacements for HCP Terraform or Spacelift. They solve specific problems CI/CD doesn’t handle on its own, and you compose them based on what’s actually missing from your setup.
Tool What It Solves Best For Atlantis PR-based workflow automation, plan/apply via PR comments, PR-level locking Teams that want TACO-style PR workflow but on their own server Digger Same PR workflow + locking, but the IaC actually runs inside your existing CI runners Teams with secretless OIDC pipelines who don’t want to maintain a separate server Terramate Multi-stack monorepo orchestration, git-based change detection, parallel execution Teams whose Terraform has grown into hundreds of stacks Atlantis is the original PR-automation tool, accepted into the CNCF Sandbox in June 2024. It deploys as a Golang binary or container, listens for VCS webhooks, and runs Terraform on its own server. The architecture is showing its age. It’s stateful, single-threaded, granting it persistent privileged cloud access creates a high-value target, and the maintenance velocity has slowed. If you’re already running it and it works, fine. For new setups, the case for Digger is usually stronger.
Digger is a thinner orchestration layer. It coordinates Terraform jobs but runs them inside your existing GitHub Actions or GitLab runners, using OIDC for cloud authentication. The orchestrator backend itself never sees state, plan output, or cloud credentials. This is the right pattern if you’ve already built secretless OIDC pipelines and want PR-workflow automation without introducing another long-lived privileged component.
Terramate solves a different problem: scaling Terraform across many stacks in a monorepo. It parses your Git history to determine which stacks changed, then runs
planandapplyonly on those, in parallel. For a repo with 200 stacks and a PR that touches one, you skip the 199 unnecessary plans. It also has a code-generation system that reduces HCL boilerplate. Terramate Cloud adds dashboards and drift detection without requiring access to cloud credentials. If your IaC repo has gotten unwieldy, Terramate is the tool for it. It’s a complement to Atlantis or Digger, not a substitute.The Recommendation
The full picture for escaping commercial TACOs in 2026:
- State: GitLab’s native HTTP backend if you’re on GitLab. S3 + DynamoDB (or OpenTofu state encryption + S3) if you’re on GitHub.
- Auth: OIDC for AWS, Workload Identity Federation for GCP. Never long-lived secrets.
- PR workflow: Digger if you want PR automation that runs inside your existing CI. Atlantis if you’re already running it. Skip this layer entirely if your team is small enough that PRs serialize naturally.
- Stack management: Terramate if you have a large monorepo. Otherwise, not needed.
- Cost estimation: Pick a third-party FinOps or IaC cost-analysis tool and wire it into your plan stage.
The total monetary cost of this stack is the price of your existing CI/CD minutes, which you’re already paying. The total time cost is on the order of one to two weeks of platform-engineering time to set up properly, plus ongoing maintenance proportional to how much you customize.
For most organizations under 300 engineers, that’s cheaper than HCP Terraform Standard or Premium. For larger organizations, the calculus depends on how much custom platform work you’re willing to absorb versus how much you want a vendor to handle.
This wraps the series. Four posts in: OpenTofu as the no-regrets default engine, the scenarios where cloud-native tools beat Terraform entirely, the HCP pricing model that’s pushing teams to find alternatives, and now the CI-native path that lets you skip commercial orchestration. The throughline is the same as every post in this blog about platform engineering: there isn’t a single open-source tool that drops in for HCP Terraform or Spacelift. You’re assembling a stack from focused pieces (state backend + auth + maybe PR automation + maybe stack management), accepting some operational tax in exchange for not paying the SaaS premium. For most teams under 300 engineers, that tradeoff is worth it.
Sources
- GitLab-managed Terraform/OpenTofu state — GitLab Docs
- How to Manage Terraform State with GitLab — Spacelift
- Using Terraform to connect GitHub Actions and AWS with OIDC — Thiago Salvatore
- Deploy Terraform resources to AWS using GitHub Actions via OIDC
- Configure Workload Identity Federation with deployment pipelines — GCP Docs
- Terraform Deployment to GCP Using GitHub Actions and Workload Identity Federation
- Atlantis vs. Terraform Cloud / Terraform Enterprise — Spacelift
- Digger and Atlantis: key differences
- Terramate: Turn Your IaC into a Lightning-Fast Platform
- How to Implement Cost Checks in Terraform CI/CD Pipelines — OneUptime
- Terraform Plan PR Commenter (GitHub Action)
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Github / Infrastructure / Opentofu / Terraform / Cicd / Gitlab
-
HCP Terraform's Per-Resource Pricing Is a Trap
The first post in this series argued OpenTofu is the no-regrets default for new infrastructure. The previous post mapped out when to skip cloud-agnostic IaC entirely. This one is about what happens to organizations that picked Terraform years ago, built their orchestration around HCP Terraform (formerly Terraform Cloud), and are now opening renewal quotes that have doubled or tripled year-over-year.
The short version: HashiCorp’s 2024 pivot to Resource Under Management (RUM) billing penalizes the architectural patterns the DevOps community spent a decade adopting. Modular code, ephemeral environments, and granular resources are all things you were supposed to do with Terraform. They now cost real money under the new pricing model. And the legacy free tier that grandfathered teams into a more sustainable cost structure hit end-of-life on March 31, 2026.
If you’re still on HCP Terraform in 2026, you need to understand the math.
How the New Pricing Works
The 2024 RUM model bills based on the peak number of resources tracked in your
terraform.tfstatefiles, measured hourly. The Free tier covers up to 500 resources with a single concurrent run. Above that, you’re on Pay-As-You-Go tiers:Tier Per-resource cost Concurrency What you get Free $0 (first 500) 1 Basic VCS, remote state Essentials ~$0.10/month 1 Basic provisioning, no SSO Standard ~$0.47/month 3 Up to 5 policy checks, cost estimation, limited RBAC Premium ~$0.99/month 10 Full governance, unlimited policies, SSO, audit logs On paper, $0.47 per resource per month looks negligible. The math goes sideways quickly because of three things.
Why “Resources” Is a Footgun
1. Granularity inflation. A single logical Terraform module produces dozens of underlying resources. An AWS VPC module isn’t one billable resource. It’s the VPC plus every subnet, every route table, every route table association, every IAM policy attachment, every security group rule, every DNS record. A widely-shared Reddit post by user
notoriousbpgdescribes a team whose HCP Terraform bill was about to jump from $0 to over $15,000 a year, because 80% of the resources under management were GraphQL operation mappings to data sources, while the actual AWS infrastructure they cared about cost only $8,000. They were paying more for orchestration than for the infrastructure being orchestrated.2. Idle workspaces. RUM billing doesn’t distinguish between active and inactive infrastructure. The proof-of-concept workspace someone spun up last quarter and never destroyed is still on your bill. The staging environment that was deprecated in favor of ephemeral environments is still on your bill. Industry telemetry suggests 30–40% of an average organization’s RUM cost is for completely idle infrastructure nobody has bothered to
terraform destroy.3. Hourly peak billing on ephemeral resources. HCP Terraform bills based on peak hourly resource count. If your integration test pipeline spins up infrastructure that exists for five minutes and is then torn down, you’re billed as if it existed for the full hour. This is a direct tax on the modern GitOps workflow patterns Terraform itself spent years promoting. The more ephemeral environments you use, the more punitive the billing becomes.
The compounding effect is severe. Another account describes cloning a 600-resource production workspace to create a pre-production environment. The resource count doubles to 1,200. The annual cost goes from ~$122 to ~$858, a 7x increase for what’s architecturally a trivial change. Multiply that across every environment, every test fixture, every modular abstraction, and the renewal quote stops being theoretical.
The Alternative TACOs
Once organizations work through the RUM math and realize the bill is structurally unsustainable, the obvious move is to look at alternative orchestration platforms. The three serious contenders, with very different pricing models:
Platform Pricing Model Entry / Mid-Tier Cost What It Does Differently Spacelift Resources + runs + seats $1,500–$3,500/mo Multi-tool (Pulumi, K8s manifests, Terragrunt). OPA policies. Custom runners. Cross-stack dependencies. env0 Per-user ~$50/user/mo Predictable user-based pricing. Strong TTL/ephemeral environment story. Scalr Per-user ~$50/user/mo Familiar Terraform Cloud UI replacement. Lower entry price than Spacelift. ControlMonkey Fixed plan (users + assets) $800/mo (Startup: 10 users, 5k assets, 500 deploys) One-click Terraform import, automatic drift remediation, daily cloud-config backups, built-in compliance. Spacelift is the choice for complex platform engineering teams. It supports Terraform, OpenTofu, Pulumi, Terragrunt, and Kubernetes manifests in one platform, handles cross-stack dependencies, and bakes OPA policy enforcement into the runtime. The catch is that its pricing still factors in managed resources, so the bill scales with infrastructure size, just less aggressively than HCP.
env0 and Scalr both flipped to user-based pricing specifically as a response to RUM. A 15-engineer team managing 3,000 resources pays roughly the same on env0 as a 15-engineer team managing 500. The price is bounded by headcount, not infrastructure complexity. This is the right model for teams whose resource counts have ballooned because they followed the “do everything as code” advice and now have hundreds of granular Terraform-managed entities they don’t want to pay per-unit fees on.
When to Pay for Any Commercial TACO
The harder question is whether the commercial orchestration layer is worth its multi-thousand-dollar monthly bill at all. The features TACOs sell (state locking, PR-level plan output, policy enforcement, drift detection, audit logging) are all things you can build into your own CI/CD pipeline. The question is whether building and maintaining that pipeline is cheaper than paying the SaaS fee.
For most teams under ~50 engineers, the answer is no. The SaaS fee buys polish and convenience, but the underlying capabilities are available in GitLab’s native state management or in GitHub Actions with the right open-source orchestrator. For larger teams, the calculus shifts: the cost of a dedicated platform engineer maintaining a custom CI/CD pipeline starts to approach the cost of a commercial license, and the operational predictability of a managed platform becomes valuable.
But the days of HCP Terraform being the obvious default for everyone above the free tier are over. The RUM model made the math too punishing for too many real-world architectures.
The next and final post in this series gets into the actual mechanics of running Terraform/OpenTofu inside your existing CI/CD: GitLab’s native state backend, GitHub Actions with OIDC/Workload Identity Federation for secretless deploys, and the open-source orchestrators (Atlantis, Digger, Terramate) that close the gap between raw YAML and a real platform.
Sources
- Terraform Cloud / Enterprise Pricing — Tiers Overview 2026 — Spacelift
- Terraform Cloud Pricing Guide: Tiers, Costs, and Optimization Tips — ControlMonkey
- 10 Best Terraform Cloud Alternatives & Competitors In 2026 — ControlMonkey
- Continuing HCP Terraform’s enhanced free tier experience — HashiCorp
- Terraform Cloud Pricing Explained: Resource-Based Guide (2026) — Firefly
- Spacelift Software Pricing & Plans 2026 — Vendr
- Terraform Cloud Pricing: A Complete Guide (2026) — env0
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Infrastructure / Terraform / Pricing / Hashicorp
-
When You Should Skip Terraform Entirely
The last post in this series made the case that OpenTofu is the no-regrets default for new infrastructure projects. That’s true for the broad case of cloud-agnostic or multi-cloud setups where HCL parity, provider breadth, and a Linux Foundation governance model matter.
It’s also not the whole story. There are at least three common scenarios where the right answer in 2026 isn’t Terraform or OpenTofu. It’s the cloud-native tool the hyperscaler ships with its platform. AWS has CloudFormation and the CDK. Azure has Bicep. GCP has Config Connector. Each one is technically superior to Terraform inside its own ecosystem, and each one removes a category of operational pain that Terraform inflicts.
If you reflexively reach for Terraform every time, you’re probably overpaying in complexity for a multi-cloud option you’ll never exercise.
The Small AWS-Native Startup: Use CDK
If your engineering team is small, you’re shipping a SaaS product, and you’re 100% on AWS, you should probably ignore Terraform entirely. The right tool is the AWS Cloud Development Kit, layered on top of CloudFormation.
The fundamental win is that CloudFormation eliminates state management. There is no
terraform.tfstatefile. No S3 bucket to provision. No DynamoDB lock table. No state-encryption configuration to figure out. The state lives in the AWS control plane, AWS manages locking and consistency, and your CI pipeline doesn’t need to know about any of that. For a small team, that’s a meaningful operational tax you don’t pay.The CDK is the part that makes this pleasant. It lets you define infrastructure in TypeScript, Python, Java, C#, or Go; so the languages your application engineers already know. There’s no HCL learning curve, no Sentinel policy DSL, no jq-in-bash to manipulate plan output. You write code, the CDK synthesizes CloudFormation templates, CloudFormation provisions the infrastructure.
The objection people raise is “what if you go multi-cloud later?” In practice, most SaaS startups don’t. They get acquired, they pivot, or they grow large enough to have a dedicated platform team that does the migration deliberately. Optimizing for a hypothetical multi-cloud future that 90% of teams will never need is the textbook definition of premature abstraction. If you’re an AWS-native startup with fewer than 50 engineers and no concrete plans to leave AWS, the cost of running Terraform-as-multi-cloud-insurance is higher than the cost of a future migration that probably won’t happen.
The Azure Enterprise: Bicep, Unless You Need More
For organizations heavily invested in Microsoft’s stack, so Azure for compute, Azure DevOps for CI/CD and EntraID for identity; Bicep is the technically correct choice for most workloads.
Bicep is Azure’s domain-specific language for infrastructure, designed as a replacement for the verbose ARM JSON templates everyone hated. Like CloudFormation, it’s stateless. You submit a desired-state Bicep file to the ARM control plane and ARM reconciles. No state file, no remote backend, no risk of corruption. Authentication is whatever RBAC permissions the deploying identity already has, with no provider credential configuration required.
Bicep also gets day-zero feature support for new Azure capabilities. When Microsoft ships a new service, you can use it in Bicep the same day. The Terraform AzureRM provider has historically lagged by weeks or months, occasionally longer.
The catch is scope. Bicep manages Azure. That’s the entire surface area. Larger organizations tend to need management of things outside Azure too: GitHub repositories and branch protection, EntraID groups, Datadog monitors, PagerDuty escalation policies, whatever SaaS services your platform touches. Bicep has no answer for any of that.
That leaves two paths. The first is a hybrid: Bicep for Azure, separate tools for everything else, accept the cost of context-switching and the inability to express cross-domain dependencies in a single deployment. The second is Terraform or OpenTofu for everything, accepting the heavier operational tax of stateful IaC, in exchange for one tool that can do all of it. Neither is wrong; they’re different tradeoffs against the same constraint.
The decision rule: if you’re managing only Azure resources, use Bicep. If you have cross-domain provisioning needs and you’d rather not maintain two parallel IaC stacks, Terraform (or OpenTofu) earns its keep.
The GCP/Kubernetes Shop: Hybrid by Design
For organizations heavily committed to Google Cloud and running most workloads on GKE, the right architecture isn’t either/or. It’s a hybrid that uses Terraform for the foundation and Config Connector for the application layer.
Config Connector is a GCP-shipped Kubernetes add-on. It lets you manage GCP resources — Cloud SQL instances, Pub/Sub topics, storage buckets, service accounts — as standard Kubernetes Custom Resources. You write a YAML manifest, you
kubectl apply, and a controller in the cluster reconciles the real-world GCP resource to match.The differentiator is continuous reconciliation. Terraform is episodic: it checks state at
planandapplytime, and the rest of the time your infrastructure is unmonitored. If someone clicks around in the GCP console and manually changes a setting, Terraform won’t notice until the next pipeline run. Config Connector runs a controller loop that polls continuously. Manual drift gets reverted in real time.The right architectural boundary:
- Platform layer (Terraform/OpenTofu): VPCs, subnets, foundational IAM, the GKE clusters themselves. These are slow-moving, security-critical, and you want a deliberate pipeline approval flow for them.
- Application layer (Config Connector): Application-specific buckets, databases, service accounts, Pub/Sub topics. Application teams own these via the same YAML manifests they use for their pods, with the same GitOps workflow they already understand.
This pattern gives platform teams strict guardrails on the foundation while letting application developers self-serve the resources their services need, without filing a Terraform PR every time they want a new bucket.
The Decision Rule
The honest version of all of this: Terraform/OpenTofu is the right answer when you need cross-domain or cross-cloud governance. For everything else, the cloud-native tool is usually less work, more current with the platform, and avoids the operational tax of state management.
A reasonable decision tree:
- Single-cloud, small team, AWS: AWS CDK + CloudFormation.
- Single-cloud, single-domain, Azure: Bicep.
- GCP with heavy Kubernetes use: Hybrid — Terraform/OpenTofu for foundation, Config Connector for application resources.
- Multi-cloud, or cross-domain platform engineering (GitHub + cloud + identity + monitoring): OpenTofu.
The mistake I think most teams are making is to default to Terraform because it’s the tool the senior engineer learned in their last job. The platform-engineering pitch … “we’ll standardize on Terraform so we can move to any cloud later” is correct in theory but almost never exercised in practice. If your team isn’t using the cross-cloud capability today, you’re paying for an insurance policy you’ll never collect on.
Next post in this series digs into the other side of that calculation: what HCP Terraform actually costs in 2026, and why even teams that need cloud-agnostic IaC are looking for the exit from the commercial orchestration platforms.
Sources
- Bicep Vs Terraform: Choosing The Best IaC Tool For Azure — Synextra
- Terraform vs Bicep vs ARM Templates 2026 Compared — Exodata
- Comparing Terraform and Bicep — Microsoft Learn
- Terraform vs Bicep vs ARM: Lessons from the Trenches — Vaibhav Gujral
- How to Use the GCP Config Connector with Terraform — OneUptime
- How Config Connector compares for infrastructure management — Google Cloud Blog
- Are Terraform’s days numbered? — Alistair Grew
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
OpenTofu Is the No-Regrets Default for 2026 Infrastructure
Hashicorp’s adoption of the Business Source License in late 2023 was a defensive business decision. Companies like Spacelift, env0, and Scalr were building paid commercial platforms on top of MPL-licensed Terraform, capturing significant revenue from an ecosystem Hashicorp was largely funding. The same pattern played out with Redis Labs facing AWS ElastiCache, Elastic facing Amazon OpenSearch, and MongoDB facing the cloud hyperscalers before its move to the SSPL. The BSL is a rational corporate play: keep the core open enough to preserve mindshare, restrict the terms enough that pure resellers can’t extract value without engaging commercially. From the standpoint of a publicly traded company with a board to answer to, it made sense.
But it also broke a tacit contract. Hashicorp had spent a decade positioning Terraform as infrastructure’s
git. Neutral, ubiquitous, irreplaceable. A license that lets a single vendor change the terms when the shareholder math demands it is not neutral, and a large portion of the community decided they weren’t comfortable with that risk. The Linux Foundation forked the last MPL-licensed Terraform release and shipped it as OpenTofu. Two years later, OpenTofu has crossed 10 million downloads, holds HCL parity with Terraform, supports the same provider ecosystem (AWS, Azure, GCP, Kubernetes, everything), and ships features Terraform itself doesn’t have.For greenfield infrastructure in 2026, OpenTofu is the no-regrets default. For existing Terraform codebases, the migration is mostly a binary swap. The reasons to still pay for Terraform are mostly inertia. Let me explain.
The Migration Is Mostly Free
The technical case for “stay on Terraform” essentially doesn’t exist. OpenTofu reads the same HCL. It produces the same execution plans. It maintains the same state file format. It interfaces with the same providers, including the ones Hashicorp wrote, because the provider API was never the part Hashicorp tried to lock down.
To migrate a non-trivial Terraform codebase to OpenTofu, you do roughly this:
- Swap
terraformfortofuin your CI binary install step. - Update any pipeline scripts that hardcoded the binary name.
- Run
tofu init -migrate-stateonce. - Run
tofu planand confirm it produces an empty diff against the existing state.
There are edge cases, like modules pinned to specific Terraform-version constraints or providers that gated features on the Hashicorp-only registry. But for the vast majority of codebases, the migration is a one-afternoon job, including the PR review and the team announcement.
What you get in exchange is governance under the Linux Foundation, an active multi-vendor contributor base, no future license surprises, and a really nice to have feature not in Terraform currently: native state encryption.
State Encryption Is the Real Reason
Terraform state files have a property nobody enjoys discussing. They contain everything sensitive about your infrastructure, and they store it in plaintext.
That’s not a misconfiguration. That’s the design. The
terraform.tfstateJSON file holds resource IDs, ARNs, network topology, credentials surfaced as outputs, RDS connection strings, and any sensitive value a module decided to track. When you use S3 or Azure Blob as a remote backend, you get encryption at rest, meaning the cloud provider’s storage layer is encrypted. The state itself, the thing your CI pipeline downloads and uploads on every run, is plaintext JSON. Anyone with read access to the bucket (your CI runner, your laptop, anything assuming the role) gets the cleartext.OpenTofu solves this with native, client-side state encryption introduced as a first-class feature. The state is encrypted by the engine before it leaves the machine. The remote backend never sees plaintext at all. The configuration looks like this:
terraform { encryption { key_provider "aws_kms" "primary" { kms_key_id = "arn:aws:kms:us-east-1:..." region = "us-east-1" key_spec = "AES_256" } method "aes_gcm" "primary" { keys = key_provider.aws_kms.primary } state { method = method.aes_gcm.primary } plan { method = method.aes_gcm.primary } } }Three pieces. A key provider (AWS KMS, GCP KMS, OpenBao, or a local passphrase via pbkdf2), an encryption method (AES-GCM is the standard pick), and explicit targets for state, plan, or both.
The migration path from existing plaintext state requires a fallback block. OpenTofu refuses to read plaintext once encryption is enabled, which is the right default, but it means you need to tell it “this one time, read the legacy state and re-encrypt it.” After one successful apply, you remove the fallback and you’re done.
Terraform doesn’t have this. Hashicorp’s official answer is still “use a backend that encrypts at rest and audit your IAM policies carefully.” Which is fine, until your CI logs the state diff into a third-party observability tool, or someone runs
terraform showover a Slack screenshare, or an attacker gets a transient role to your backend bucket. The threat model OpenTofu’s encryption closes is the threat model that matters.The AI Wrinkle
There’s a meta-argument unfolding alongside all of this: AI is making the choice of execution engine less important.
Industry telemetry says 71% of cloud teams have seen an exponential increase in IaC volume from generative AI. The thing AI is generating, in most cases, is HCL, which is the lingua franca for both Terraform and OpenTofu. As the volume of AI-authored infrastructure grows, the role of HCL shifts from “the language engineers write” toward “the intermediate representation an agent emits.” Manual HCL authoring is on track to become a niche skill in the same way hand-tuning compiler output is a niche skill.
In that world, the execution engine is plumbing. The valuable layer is everything around it: state management, drift detection, policy enforcement, cost guardrails, audit trails. Which is exactly the layer where vendor lock-in does the most damage and where open governance matters most. The AI argument doesn’t undercut the OpenTofu case. It reinforces it.
What To Do
If you’re starting a new infrastructure project, use OpenTofu. There is no good reason to start a 2026 greenfield project on a single-vendor BSL-licensed engine when the Linux Foundation-governed open-source alternative is right there, with full HCL parity, the same provider ecosystem, and features Terraform doesn’t have.
If you have an existing Terraform codebase, schedule the migration. It’s a one-afternoon job per repo. Get state encryption while you’re at it.
If you’re heavily integrated with HCP Terraform, this is the harder case. The migration off the proprietary HCP features (Sentinel policies, the registry, the integrated dashboards) is real work. But it’s also the case where you have the most to lose. HCP Terraform’s pricing model has gotten aggressively worse, and OpenTofu’s existence means you have actual leverage in the next renewal conversation. The next post in this series digs into exactly what HCP pricing looks like in 2026 and why so many organizations are getting six-figure renewal quotes for infrastructure they were paying $20K for two years ago.
This is the first of a four-part series on the 2026 IaC landscape. Up next: cloud-native vs cloud-agnostic tooling, and when to use AWS CDK, Bicep, or Config Connector instead of Terraform/OpenTofu at all.
Sources
- 2026 IaC Predictions: What Cloud Leaders Must Prepare For ControlMonkey
- Terraform vs OpenTofu in 2026: Should You Stay or Switch?
- Terraform or OpenTofu in 2026? Here’s What I Actually Think Jae Wook Kim
- OpenTofu vs Terraform in 2026: Is the Fork Finally Worth It? Mechcloud Academy
- OpenTofu vs. Terraform: A Practical Guide for Enterprise Infrastructure Teams env0
- State and Plan Encryption OpenTofu docs
- How to Use OpenTofu State Encryption OneUptime
- State Encryption with OpenTofu Ned in the Cloud
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Infrastructure / Opentofu / Terraform
- Swap
-
A Dotfiles Manager That Snapshots Every Change
Managing dotfiles in 2026 is a solved problem in the same way that managing your own backups is a solved problem: there are five tools for it, all of them work, all of them require you to set up some plumbing first, and once you’re set up you still don’t have a great answer to “I just broke my shell config, get me back to yesterday.”
The conventional answer is some combination of: a git repo for your
~/.zshrcand friends, a symlink script (orstow, orchezmoi, oryadm), and the discipline to remember to commit after every change. The setup is a one-time hassle. The “wait, what did I change?” recovery story is not great. And if you want to sync across machines, you’ve now got opinions about remote repos, SSH keys on a fresh box, and which order things have to happen in.I wanted something different, so not a configuration framework, but a record of every change to the files I care about, in a place I can roll back from, with the lowest possible setup cost.
That’s what dfm is.
What It Does
dfmis a single static Go binary. You point it at the files you want to track (~/.zshrc, anything under~/.config/, whatever), and every time one of them changes it takes a content-addressed snapshot. The snapshots live on disk in~/.local/share/dotfiles/backups/. A small state database (SQLite locally, or libSQL via Turso if you want cross-machine sync) records which file maps to which snapshot at which point in time.You can roll back. You can diff against an old snapshot. You can see when you last touched a file. And because every snapshot is content-addressed, you never re-store the same bytes twice — switching themes in
~/.zshrcten times costs the size of two configs, not ten.The other half is the backup story.
dfm initwalks you through cloning (or creating, viagh) a private GitHub repo that mirrors your tracked files plus their history. The point isn’t to make you adopt a new git workflow. It’s that pulling your config onto a fresh machine should be one command, and recovering fromrm -rfshould never have a “well, hopefully my last commit was recent” caveat.Why Setup Is the Hard Part
The reason people don’t audit their dotfiles is the same reason people don’t back up their laptops: the setup is annoying, and the payoff is theoretical until it isn’t.
dfm initis a six-step interactive wizard. It detects aTURSO_DATABASE_URLenv var if you’ve got one, offers sensible defaults for everything else, lets you opt in to tracking~/.zshrcimmediately, and writes a single config file with the right permissions. Re-run it on an existing config and it pre-fills every prompt with your current value, so the cost of changing your mind later is also low.--yesaccepts every default for scripted setup.If that sounds boring, that’s the point. Boring is what makes a tool actually get used.
The AI Bit
There’s an optional AI integration.
dfm suggest <file>asks a local AI CLI (Claude Code by default, configurable) to propose an improvement to one of your tracked files, returns the proposal as a unified diff, and stores it as a pending suggestion.dfm apply <id>reviews the diff and applies it, with a fresh snapshot first, so you can roll back if the suggestion turns out to be wrong.I’m exited to try this feature out, because I’m sure there is something i"m doing wrong. The “Look at my
~/.zshrcand tell me what I could clean up” is useful feature that doesn’t require me copy and pasting or granting read or write access to my entire home directory.Where to Get It
github.com/llbbl/dotfiles-manager. Pre-built binaries for darwin and linux on arm64/amd64. Current version, as of writing, is v1.4.0.
If you’ve been meaning to actually back up your dotfiles and the friction has stopped you, this is the post where I tell you the friction is solvable.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / Go / Dotfiles
-
Your AI Coding Agent Can Read Every Secret on Your Machine
Every developer running an AI coding agent has handed that agent the keys to their machine. Not metaphorically. Literally. The agent runs as your user. It can read every file you can read, execute every command you can execute, and hit every API your stored credentials authorize.
For most workflows, that’s the point. You want the agent to read your code, modify your project, ship your work. But there’s a quieter implication: the agent can also read your
.envfiles. It can invoke your secret-management tooling. It can grep forAPI_KEY=across your home directory. And nothing in the agent stack says “wait, you didn’t ask for this.”Same-UID isolation isn’t isolation. It’s the absence of isolation labeled politely.
The usual answer to “keep secrets safe from your coding agent” is: don’t store them where the agent can find them. Use a cloud secret manager. Rotate aggressively. These are good practices, and for local development, they’re often impractical. The agent is going to encounter secrets whether or not your security-best-practices doc approves.
So over the last week, I built an audit subsystem into lsm, my Local Secrets Manager. The whole thing is designed to answer one forensic question: did anything weird touch my secrets last night?
The Threat Model
A defense without a threat model is theater, so let me be specific.
The threat isn’t a sophisticated remote attacker. lsm is public, open-source code. The threat isn’t a buggy lsm either; bugs happen, and the user can read the source.
The threat is the agent layer running adjacent to lsm. Coding agents have legitimate access to a wide swath of your filesystem. They’re imperfect at intent inference. They sometimes get prompt-injected. They sometimes run in the background while you’re asleep. When an agent calls
lsm get prod DATABASE_URL, the action is indistinguishable from you doing the same thing. The audit log’s job is to make those calls retrospectively distinguishable.A secondary threat is an agent covering its tracks. If something reads a secret and then edits the audit log to erase the evidence, the log is worse than useless.
What Got Built
The audit subsystem records every access as a structured event: a sequence number, a timestamp, the action, the app and environment, an
Actorblock describing the calling process, and two cryptographic fields linking each event to the previous one.The
Actorblock was the interesting design problem. It captures parent process ID, parent process name, TTY device path (or empty if there’s no terminal), current working directory, an agent marker derived from environment variables that tools like Claude Code, Cursor, Aider, and Continue set, and the calling user ID. Every field is captured every time. Noomitempty. UID zero is a real, meaningful value, and silently dropping it would be a footgun.Events land in a hash-chained JSONL file at
~/.lsm/audit.jsonl. Each row carries the SHA-256 of the previous row plus its own body. If anyone edits, inserts, or deletes a row in the middle, the next row’sprevno longer matches andlsm audit verifysurfaces the break.The chain doesn’t catch tail truncation. If you chop off the end of the file, what’s left is internally consistent. A sidecar file storing the last expected hash is the obvious fix, and I deliberately rejected it. lsm is public code. Any local attacker who knows about the sidecar can rewrite both files in lockstep. Tail-truncation detection is deferred to the off-machine path: when events ship to a remote stack, the last hash naturally lives somewhere the local attacker doesn’t control.
Reading the Log
Three commands cover the read side.
lsm audit taildoes what you’d expect.lsm audit show <seq>prints a single event.lsm audit queryis the workhorse, with every field as a filterable dimension:--app,--env,--event,--parent-comm,--agent-marker,--tty present|absent,--since,--until. Output is JSONL when piped and columnar text when interactive.Then there’s
lsm audit suspicious, which runs four hard-coded detectors in one pass:- Outside hours. Events whose timestamps fall outside 07:00–23:00. The 3 a.m. canary.
- Burst. More than N events from a single parent process within a sliding window. The runaway-agent canary.
- New parent_comm. Process names not seen in the prior 30 days. The “what is this new thing” canary.
- Non-interactive, no agent. No TTY, no recognized agent marker. The “what is even running this” canary.
A single event can stack reasons. A 3 a.m. burst from an unknown parent is unambiguously interesting.
The detector doesn’t learn baselines, doesn’t call out to an ML model, doesn’t require a service. High-signal patterns are obvious patterns, and obvious patterns are well-served by hard-coded predicates.
Shipping Events Off the Box
If you already run an observability stack, lsm can ship audit events over OTLP (the OpenTelemetry wire protocol). Three design choices matter here.
The local file sink is always authoritative. The remote sink is a mirror, not a replacement. An lsm operation never fails because the remote endpoint is down.
Redaction is allowlist-based. App and environment names are HMAC-hashed with a per-host salt before becoming labels. The TTY device path is dropped and replaced with a
tty_present: true/falseboolean. Secret values,cwd,hash,prev, and the schema version never leave the host. Secret names are replaced withkey_present: truemarkers; the remote observer can see that a key was accessed, never which key.Events whose name starts with
audit.(chain failures, suspicious matches, sink drops) are always local. Telling a remote attacker that local integrity has been compromised is counterproductive.What’s Still Open
The most important non-feature: no command in lsm emits events yet.
setdoesn’t log.getdoesn’t log.deletedoesn’t log. The plumbing is complete, the calls are not wired in. Each emit site needs careful thought about which fields are appropriate, whether the event should be local-only, and how it interacts with sensitive operations. That’s the next chunk of work.The agent-coding era is normalizing a model where AI tools have wide-ranging access to developer machines. The premise that the agent operates as a fully-trusted local user is unlikely to change soon. Managing the risk means visibility. It means being able to answer “what touched my secrets last night” with a record the agent couldn’t silently rewrite.
The code is at github.com/llbbl/lsm. The full design lives in
docs/observability.md.I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / security
-
Buying Supply Chain Security in 2026: A Vendor Map
The last post was for solo developers and people without a security budget. This one is for everyone else: the platform engineers, the security leads, and the directors who are getting pitched by four different supply chain security vendors a week and need to figure out which, if any, of them are worth signing a contract with.
The honest answer is that the vendor landscape in 2026 is overheated. Every SCA company is now also a malicious-package firewall company. Every malicious-package firewall company is also pitching AI-native remediation. The pricing pages are mostly “Contact Sales.” And underneath all of it, the actual problem these tools solve splits cleanly into three layers, and you should know which layer you’re buying.
The Three Layers
Layer 1: Update automation. Dependabot (free, GitHub-native) and Renovate (free, more configurable) generate pull requests when new versions of your dependencies are released. They don’t find vulnerabilities. They just shrink the window where you’re running outdated code. Dependabot is the right answer for most teams under 50 engineers. Renovate is what you reach for when you’re tired of triaging 80 individual PRs a week and want grouped updates with auto-merge based on community confidence signals. Neither costs anything. Both should be on.
Layer 2: Software Composition Analysis (SCA). Parses your lockfiles, matches dependencies against CVE databases, tells you what’s vulnerable. The open-source side of this is fully mature: Trivy, Grype, OWASP Dependency-Check, and OWASP Dependency-Track collectively cover most of what you’d pay Snyk for ten years ago. Dependency-Track in particular is a serious tool. It ingests CycloneDX and SPDX SBOMs, tracks portfolio-wide risk, and integrates EPSS scoring. If you self-host it, the bill is zero.
The thing the commercial vendors actually sell at this layer is reachability analysis. A vulnerability in a transitive dependency that you import but never actually call is technically a CVE in your inventory. Realistically it’s noise. Snyk, Endor Labs, and Mend.io all build call-graph analysis that determines whether a vulnerable code path is actually invoked by your application. Endor Labs claims their reachability reduces actionable alerts by 90 to 95%. That number is marketing, but the underlying capability is real, and it’s the single biggest differentiator between commercial SCA and the open-source stack.
Layer 3: Malicious package firewalls. This is the layer that didn’t exist five years ago. Tools like Socket, Phylum, Endor Labs, and Sonatype Repository Firewall sit between your developers and the public registries and analyze package behavior before installation. Socket evaluates 70+ behavioral indicators: does the package read OAuth tokens from disk, does it use
marshal.loadsto self-deobfuscate, does it inject into HTTP headers. This is the only layer that defends against zero-day malicious packages, because SCA fundamentally can’t. There’s no CVE for “this package was uploaded ten minutes ago and steals AWS keys.”What This Actually Costs
The pricing pages tell you most of what you need to know about who each vendor is for.
Vendor Pricing Who it’s for Dependabot Free Everyone on GitHub Socket Free up to 1000 scans/mo, Team $25/dev/mo, Business $50/dev/mo Developers who want low-friction zero-day protection Snyk Free tier (100-300 tests/mo per product), Team $25/dev/mo (5-10 dev cap), Ignite ~$105/dev/mo, Enterprise custom Teams that want SCA + SAST + IDE integration in one bundle Endor Labs Custom (free tier for small OSS teams) Orgs drowning in CVE noise; multi-language including C/C++ and Rust Mend.io $300-$1000/dev/year Enterprise environments that want consolidated dashboards Sonatype $6K-$150K+ in bundled tiers Large regulated enterprises that need a centralized artifact gateway Phylum Custom enterprise Teams that want programmatic policy via Open Policy Agent Two patterns stand out. Socket and Snyk are product-led growth plays with transparent per-developer pricing, predictable as you scale, accessible at the lower end. Sonatype, Mend.io, and Phylum are enterprise sales motions with significant minimums and multi-month implementation cycles. Endor Labs sits awkwardly in the middle (mid-market and enterprise deals) with credible reachability claims that are hard to replicate with open source.
The Real Cost of “Free”
The argument for going all-in on open source, Dependabot plus Trivy plus Dependency-Track plus maybe Socket’s free tier, looks compelling on the spreadsheet. The honest math is more complicated.
Running this stack at a 100-engineer organization requires somebody to maintain the Dependency-Track server, tune the rulesets to keep false positives from drowning your security team, manually triage alerts that have no reachability context, and respond to the inevitable “is this critical CVE actually exploitable in our environment?” questions from leadership. Realistic estimates put that workload around 20 to 30 hours per week — call it half an FTE of senior engineering time, which fully-loaded lands in the low six figures per year. That’s not zero, and it’s the line item that “we’ll just use open source” plans consistently leave out of the spreadsheet.
The flip side is the Endor Labs ROI pitch: 90% noise reduction means 9 fewer FTEs needed for triage in a 300-dev org, which they price at roughly $1.5M in saved salary against a five-figure license. That’s a vendor calculation, so take it with the appropriate salt. But the underlying logic that alert noise has real labor cost is correct, and it’s the part most “we’ll just use open source” plans underestimate.
What I’d Actually Recommend
For a team of 5 to 50 engineers: Dependabot or Renovate on, Socket’s free tier or paid Team plan for firewall coverage, and
npm audit/pip-audit/cargo-auditrunning in CI. Total spend: $0 to roughly $1,500/month at the high end. This is the configuration that covers 80% of the threat for a small fraction of what a Snyk or Mend contract costs.For 50 to 300 engineers: the math starts favoring a paid SCA platform with reachability. Snyk if you also want SAST in the same tool. Endor Labs if you have a polyglot codebase (especially anything with C++ or Rust) and severe alert fatigue. Keep Socket or Phylum as a separate firewall layer. The firewall vendors are still meaningfully better at malicious-package detection than the SCA vendors who bolted it on.
For 300+ engineers in a regulated industry: you probably need Sonatype or JFrog as a centralized proxy whether you want them or not, because compliance demands a single audited path from developer to registry. Bundle it with Endor Labs or Mend for the reachability layer.
What I would not do is buy the platform pitch, the “one tool for SCA + SAST + secrets + container scanning + firewall + AI remediation.” Those bundles exist because the vendors want a bigger contract, not because the unified product is actually best-of-breed at any single thing. The companies winning each individual layer (Socket for firewalls, Endor Labs for reachability, Trivy for open-source SCA) are doing so by being focused.
Closing the Series
Four posts in: the threat model, the per-ecosystem mitigations, local isolation for the budget-constrained, and now the commercial landscape for everyone else. The unifying thesis across all of them is that supply chain security is not solved by a single tool or a single layer. It’s a stack. Lockfiles at the bottom, audit tooling above that, behavioral analysis on top, isolation as the last line of defense. The right composition depends on who you are and how much risk you can afford to absorb. If your stack right now is “we trust the registry,” you are the threat model.
Sources
- Supply Chain Security Tool Selection Framework - SoftwareSeni
- Endor Labs vs Snyk: SCA, SAST, and Containers Compared
- Malware Package Firewall: Block Threats Before They Hit Your Code
- Socket Pricing
- Introducing Socket Firewall
- Snyk Software Pricing & Plans 2026 - Vendr
- Endor Labs Pricing
- Mend.io Pricing
- Sonatype Nexus Pricing Guide 2026 - CloudRepo
- Open Source vs Commercial SCA Tools Comparison - Safeguard
- OWASP Dependency-Track
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / security / Tooling / Supply-chain
-
Sandboxing AI Agents Without Buying Anything
The previous post (and the one before it) covered the threat model and the per-ecosystem mitigations: lockfiles,
--ignore-scripts,cargo-audit, Trusted Publishing. All of that helps. None of it answers the question that keeps me up at night, which is: what happens when an AI agent on my laptop installs a malicious package, and the malicious package was the literal point of the operation?This is the new shape of the threat. You’re not getting compromised because you typed
npm installwrong. You’re getting compromised because Claude or Cursor confidently invented a package name that didn’t exist, an attacker registered it five hours ago, and the agent ranpip install hallucinated-thingon your behalf without asking. The agent has shell access. Your SSH keys are right there. Your~/.aws/credentialsfile is right there. The entire premise of giving an AI agent the ability to just figure it out depends on it being able to execute untrusted code at the speed of conversation, which is also the worst possible threat model.If you’re a solo developer, an open-source maintainer, or a startup with no budget for Socket or Endor Labs licenses (more on those next post), the answer isn’t a commercial firewall. The answer is local isolation, and the tools have gotten dramatically better in the last 18 months.
Containers as the Baseline
The minimum viable isolation in 2026 is don’t run untrusted code as your user on your host OS. The cleanest way to do that on macOS or Linux is a devcontainer, a fully described, reproducible Linux environment that VS Code, Cursor, and the Claude Code CLI all natively support. You give the agent the container as its sandbox. Project files mount in. SSH keys, AWS credentials, and the rest of your home directory don’t.
The container runtime matters. Docker Desktop on macOS is a memory pig, 3 to 4 GB resident at idle, with sluggish startup times that make iterative work miserable. OrbStack is the obvious replacement: free for personal use, native Apple Silicon, dynamically allocates memory instead of reserving fixed blocks, and benchmarks show container startup times around 0.2 seconds versus Docker Desktop’s multi-second cold starts. If Docker Desktop is eating half your RAM before you even start Claude Code, OrbStack will give you that memory back.
The thing to internalize, though, is that a container is not a security boundary by default. It’s a deployment mechanism that happens to have isolation properties when configured correctly. Misconfigured developer containers have been implicated in some of the largest crypto-industry breaches of the last few years. The pattern: a container running with privileged flags, or mounting the wrong host directory, turns into a path straight to the host. Containers help. They don’t save you from yourself.
The configuration mistakes that void the isolation:
- Mounting
~/.sshinto the container so the agent cangit push. Now any process inside the container can read your SSH keys. - Mounting your entire home directory as a convenience. Now everything is accessible.
- Running with
--privilegedor sharing the host’s Docker socket. Container escape becomes trivial. - Letting the agent run
sudoinside the container. The container’s root can chain to host kernel exploits.
Least privilege, applied seriously. The agent gets the project directory and nothing else. If it needs to commit, it pushes through a credential helper that lives on the host, not by mounting your SSH keys.
Lighter-Weight Sandboxes
Spinning up a full container for every test this snippet the LLM wrote interaction is too heavy. There’s a middle layer worth knowing about.
Python. Pyodide compiles CPython to WebAssembly, which means Python code runs in a deny-by-default memory sandbox with no filesystem or network access unless you explicitly grant it. Works great for evaluating LLM-generated snippets, struggles with C extensions and heavy dependencies. safe-py-runner is the pragmatic alternative: it runs Python in a restricted subprocess with timeouts, memory limits, and I/O marshaling. No container needed. For code that absolutely cannot touch your machine, remote V8-isolate services like Deno Sandbox boot pre-snapshotted Python environments in the cloud and air-gap execution entirely.
Rust. The
build.rsproblem from the last post has no first-class solution yet, but on Linux you can wrapcargo buildin Landlock, a kernel feature available on 5.13+ that lets unprivileged processes restrict their own filesystem access. Combined with seccomp-bpf for syscall filtering and cgroups v2 for resource limits, you can run a build script that genuinely cannot read your SSH keys or open arbitrary network sockets. Projects like sandbox-rs wrap these primitives into something usable without writing your own seccomp filters. None of this works on macOS without a Linux VM in the way, which is another reason OrbStack plus a devcontainer is the path of least resistance for most people.The Mindset Shift
The honest version of all of this: if you’re running AI agents locally, you have to assume they will eventually install something malicious. Not might. Will. The question is whether the blast radius is the contents of one project directory inside a container, or every credential on your machine plus your entire git history. That gap is what isolation buys you.
Containers, Landlock, WASM sandboxes, none of these are particularly hard to set up. They’re just things most developers haven’t bothered with because the threat model didn’t feel real. After Shai-Hulud, faster_log, and a year of watching AI agents
pip installwhatever they invent, the threat model is real.Next post I’ll wrap up the series with the commercial side: Socket, Snyk, Endor Labs, Mend, Sonatype, the pricing comparison, and the actual ROI math for whether any of it makes sense for teams below 50 developers.
Sources
- State of Dependency Management 2025 — Endor Labs
- Securing AI Coding Assistants: A Total Cost Analysis — Endor Labs
- A step closer to isolation — devcontainer-wizard — The Red Guild
- OrbStack vs Docker Desktop: Performance Facts for Mac
- Apple Containers vs Docker Desktop vs OrbStack benchmark
- How to Safely Run AI Agents Like Cursor and Claude Code Inside a DevContainer
- DevContainers for Secure AI: Isolated & Scalable
- safe-py-runner: Secure Python execution for LLM Agents
- mcp-run-python — Pydantic
- How to Run Rust Binaries Without Root Using Sandboxing — OneUptime
- sandbox-rs
- Explore sandboxed build scripts — Rust Project Goals
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / security / Supply-chain / Containers
- Mounting
-
Python and Rust Have the Same Supply Chain Problem as NPM
Last post I walked through the threat model for supply chain attacks and dug into the NPM ecosystem specifically: postinstall scripts,
npm ci, pnpm’s release-age cooldown. The same structural problems exist in Python and Rust, but the failure modes are different and the tooling has evolved in some surprising directions. Worth understanding both, because if you write any backend code in 2026 you’re probably touching at least one of these ecosystems.Python: setup.py Is a Remote Code Execution Primitive
The thing most Python developers don’t appreciate is that
pip installruns arbitrary code by default. Not after install. During install. If a package ships asetup.py, that file is executed in a Python interpreter the moment pip resolves the dependency. Whatever the author wrote, including reading~/.aws/credentials, scraping environment variables, or opening a reverse shell, runs as your user with full filesystem access.This is the part that confuses people coming from other ecosystems:
venvandvirtualenvdon’t help. They isolate Python package versions to avoid conflicts. They are not a security boundary. A package installed inside a virtualenv has the exact same privileges as the user who ranpip install. None of this is a bug, exactly. It’s just an artifact ofsetup.pybeing a regular Python script that pip has always been willing to execute.The defense-in-depth stack for Python looks like this:
Stop using pip. I mean it. pip is the worst package manager in mainstream use today and it is the single biggest reason Python’s supply chain story is a disaster. It has no native lockfile.
requirements.txtis a shopping list, not a lockfile; it tells pip what to fetch, not what you actually got last time. Runpip install -r requirements.txttwice on two different days and you can get two different dependency trees, because pip resolves transitive deps fresh every time against whatever happens to be on PyPI in that moment. Builds aren’t reproducible. Hashes aren’t verified by default. There’s no separation between “what I asked for” and “what was actually resolved.”Every other ecosystem solved this a decade ago. npm has
package-lock.json. Cargo hasCargo.lock. Bundler hasGemfile.lock. pip has vibes.The
--require-hashesflag exists, technically, but it’s duct tape on a broken design. You have to generate the hashes with a separate tool (pip-tools), maintain them by hand, and remember to pass the flag on every install. Nobody does this in practice. The Python Packaging Authority spent fifteen years insisting pip was fine while every other community built proper lockfile-based managers.Use uv or Poetry. Both produce real lockfiles with SHA-256 hashes for every direct and transitive dependency, both make installs reproducible by default, both are dramatically faster than pip. uv in particular is the obvious default for new projects in 2026, it’s a drop-in replacement that’s roughly 10-100x faster and treats the lockfile as a first-class artifact instead of an afterthought. Hash verification isn’t a flag you have to remember. It’s how the tool works.
This doesn’t protect you from a malicious package you pinned on day one. But it does slam the door on silent registry tampering, makes “what’s actually deployed?” a question with an answer, and gets you out of the pip swamp.
pip-auditfor known vulnerabilities. Scans your environment or requirements file against the OSV database, PyPA advisories, and GitHub advisories. Run it in CI. Combined with a real lockfile you get a tight loop: pin exact versions, scan those versions for CVEs, fail the build if anything critical shows up.Trusted Publishing (OIDC). If you maintain a package on PyPI, get rid of your long-lived API token and switch to OIDC-based publishing. Your CI runner generates ephemeral, short-lived tokens scoped to a specific repository, branch, and workflow. Leaked PyPI tokens have been the source of multiple high-profile compromises. Trusted Publishing makes the credential effectively un-leakable because it doesn’t exist as a persistent secret.
The thing I’d actually call out, though, is that none of the Python tooling addresses the
setup.pyexecution problem at install time. Hash pinning verifies you got the right bytes. It doesn’t tell you those bytes aren’t malicious. For that you’re back to either sandboxing the install (Docker, devcontainers) or trusting the registry’s malware detection, which lags by hours to days.Rust: The Safety Guarantees Stop at the Compiler
Rust’s reputation for safety is real, but it’s a property of the compiled language, not the supply chain. The borrow checker doesn’t help you when the crate you’re depending on exfiltrates your SSH key during
cargo build.The mechanism is
build.rs. Crates can include a build script that runs before the compiler, with full user privileges. Procedural macros do the same thing at compile time. In both cases, the code can read files, open network sockets, do whatever it wants. A maliciousbuild.rsis effectively an unsandboxedunsafeblock that bypasses code review because nobody reads build scripts. The Rust core team has been discussing sandboxing for years, but nothing has shipped.This isn’t theoretical. Two examples from the last six months:
- September 2025:
faster_logandasync_printlnwere caught scraping Ethereum and Solana private keys at runtime and exfiltrating them to Cloudflare workers. - March 2026:
chrono_anchor,dnp3times, andtime-sync, all masquerading as time utilities, were transmitting.envfile contents to threat actors.
Both clusters used compromised GitHub OAuth credentials to push under legitimate-looking namespaces. crates.io authenticates via GitHub, so a phished GitHub account is a phished crates.io account.
The defensive tooling is actually better than what most ecosystems have:
Tool What it does cargo-auditScans Cargo.lockagainst the RustSec Advisory Database. Run in CI.cargo-denyLints the dependency graph. Block specific crates, enforce license policies, restrict registries. cargo-crevDecentralized “web of trust” where developers cryptographically sign crate reviews. Elegant, but heavy lift in practice. cargo-vetMozilla’s pragmatic answer to crev. Centralized audit records per org, with the ability to import audits from peer orgs (Google, Mozilla, Embark) instead of re-auditing every transitive dep yourself. If you’re picking one to start with,
cargo-auditis the easy baseline. It’snpm auditfor Rust and you should be running it in CI yesterday.cargo-denyis the next step up. It lets you actually enforce policy, which is what you want once you’ve usedcargo-auditlong enough to be tired of triaging the same warnings.cargo-vetis the interesting one for any team beyond about five engineers. The insight is that you don’t actually need to audit every crate. You just need to know that someone you trust did. By importing audit records from Mozilla and Google, a small team can effectively delegate the audit work for hundreds of common dependencies without running anything themselves. It’s the closest thing the Rust ecosystem has to a working trust network, and it works because the cryptographic overhead lives at the org level instead of being pushed onto individual developers.The Pattern Across All Three Ecosystems
NPM, PyPI, and crates.io all share the same fundamental design flaw: package installation executes attacker-controlled code by default. NPM has
postinstall. Python hassetup.py. Rust hasbuild.rsand proc macros. Different files, same problem.The mitigations also rhyme. Lock your versions to specific hashes. Run an audit tool in CI. Where possible, prevent install-time execution entirely (
--ignore-scripts, pre-built wheels, sandboxed build scripts when they finally land in Cargo). Where you can’t, isolate the install with devcontainers, ephemeral CI runners, anything that contains the blast radius when a dependency turns out to be hostile.Next post I’ll get into the isolation side specifically: devcontainers, OrbStack, Landlock, and the practical question of how a solo developer with no security budget actually keeps their laptop from getting owned by an AI agent that just
pip installed a hallucinated package name.Sources
- Securing Package Managers: Why NPM, PyPI, and Cargo Are High-Value Targets
- Defense in Depth: A Practical Guide to Python Supply Chain
- PyPI Security: How to Safely Install Python Packages
- Rust Supply Chain Security — Managing crates.io Risk
- crates.io: Malicious crates faster_log and async_println
- Five Malicious Rust Crates and AI Bot Exploit CI/CD Pipelines
- About RustSec Advisory Database
- cargo-vet FAQ
- Auditing Rust Crates Effectively (arXiv)
- Explore sandboxed build scripts — Rust Project Goals
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Python / security / Rust / Supply-chain
- September 2025:
-
Your Software Is Mostly Strangers' Code
Modern applications aren’t really written anymore. They’re assembled. Seventy to ninety percent of a typical proprietary codebase is open-source code pulled from public registries, NPM, PyPI, crates.io, maintained by thousands of people you’ve never met. Every
npm installis an act of implicit trust extended to strangers, and that trust model has quietly become the weakest link in most security architectures.Attackers figured this out a long time ago. Compromising one popular package gives you a blast radius that phishing campaigns can only dream about: CI pipelines, developer laptops, production workloads, client devices, all simultaneously. SolarWinds. The XZ Utils backdoor. The Shai-Hulud worm, which self-propagated through 170+ npm and PyPI packages by hijacking GitHub Actions OIDC tokens and quietly minted new publish credentials as it spread. The ByBit developer compromise. These aren’t outliers anymore. They’re the shape of the threat.
I want to dig into the mechanics of how this actually happens, and then look at the ecosystem most developers touch every day: NPM.
How Supply Chain Attacks Actually Work
The first thing to understand is that supply chain attacks aren’t really “vulnerabilities” in the classic sense. A buffer overflow is an accidental weakness. A malicious package is intentional code, written to steal credentials, drop a reverse shell, or exfiltrate environment variables the moment it lands on your machine. Traditional appsec tools were built to find the former. They are largely blind to the latter.
The attack patterns cluster into a few categories.
Typosquatting. Publish
axoisand wait for someone to fat-fingeraxios. Sounds trivial, but it works constantly because developers install packages at high velocity and rarely double-check spelling.Dependency confusion. If your company has an internal package called
corp-auth, an attacker publishes a public package with the same name and a higher version number. Many package managers default to “highest version wins,” and your build pulls the public one instead of your internal one.Maintainer hijacking. Compromise a real maintainer through phishing, credential stuffing, or a missing 2FA setup, and push a poisoned update to a package that already has millions of weekly downloads. The Axios compromise in March 2026 followed exactly this pattern. The XZ Utils backdoor was a slower variant. The attacker spent months building trust as a “helpful” co-maintainer before slipping a backdoor into the build.
The thing that makes all of this so effective is the automation downstream. Unpinned versions, auto-merging update bots, transitive dependencies five layers deep. Once a malicious version hits the registry, it propagates fast.
Why NPM Is the Highest-Stakes Ecosystem
NPM serves tens of billions of downloads a week. A typical JavaScript project today pulls in well over a thousand transitive dependencies. Ten years ago that number was in the dozens. The dependency graph is just structurally enormous, and it’s getting worse.
The specific architectural problem in NPM is the lifecycle script, specifically
postinstall. NPM lets package authors define scripts inpackage.jsonthat run automatically when the package is installed. This was designed for legitimate reasons: compiling native bindings, configuring environments. But it also means arbitrary shell commands execute on your machine the moment you typenpm install. No code review. No second thought. Just immediate execution as your user.There are a few practical mitigations, and they’re worth knowing whether you’re a solo developer or running platform security at a large org.
Disable lifecycle scripts. Either pass
--ignore-scriptsad hoc, or set it globally:npm config set ignore-scripts trueThis breaks some legitimate packages (esbuild, bcrypt, anything compiling native code). To manage that, tools like
can-i-ignore-scriptsscan yournode_modulesand generate an allowlist of packages that genuinely need scripts to run. Frameworks like@lavamoat/allow-scriptsformalize this with a deterministic config you can check into the repo.Use
npm ciin CI, notnpm install. This is non-negotiable for production builds.npm installwill happily resolve newer minor versions inside your semver ranges and rewritepackage-lock.json.npm cirefuses to do that. If the lockfile doesn’t match exactly, the install fails. That’s the behavior you want when the question is “did anything change that I didn’t approve.”Consider switching to pnpm. pnpm 10+ has been quietly building some of the best structural defenses in the ecosystem. Postinstall scripts are off by default and require an explicit
allowBuildslist.blockExoticSubdepsprevents transitive deps from resolving via random Git URLs or tarballs.The killer feature, though, is
minimumReleaseAge. As of pnpm v11 (May 2026), the default is 1440 minutes, so pnpm simply refuses to resolve any package version less than 24 hours old. Most malicious packages get pulled from the registry within hours of being detected. A 24-hour cooldown turns the community into your early warning system, with no behavioral analysis or commercial tooling needed.That last one is the single highest-leverage change you can make as an individual developer. It costs nothing, it doesn’t break your workflow, and it neutralizes most day-zero registry malware before it ever reaches you.
Next post I’ll dig into the Python and Rust sides of this. pip’s
setup.pyexecution problem, Rust’sbuild.rsissue, and the surprisingly mature auditing toolchain the Rust community has built aroundcargo-audit,cargo-deny, andcargo-vet.Sources
- Securing Package Managers: Why NPM, PyPI, and Cargo Are High-Value Targets
- Defending Against NPM Supply Chain Attacks: A Practical Guide
- NPM Ignore Scripts Best Practices
- Mitigating supply chain attacks
- Get safe and remain productive with can-i-ignore-scripts
- The Landscape of Malicious Open Source Packages: 2025 Mid-Year Threat Report
- The Evolving Software Supply Chain Attack Surface
- Introducing OpenSSF’s Malicious Packages Repository
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / security / javascript / Npm / Supply-chain
-
Semantic Docs Spring Update: Astro 6, Auto-Releases, npm
The last two months on Semantic Docs have mostly been maintenance work, but a few things I wanted to talk about. I pushed through a major framework upgrade, swapped out a vendored library for a real published package, and finally automated the release pipeline. Five tagged releases later, here’s where we are.
The Headlines
- Upgraded to Astro 6
- Switched from a vendored logger to the published
logan-loggernpm package - Shipped an auto-release workflow driven by Conventional Commits
- Three rounds of dependency updates plus a security-focused sweep
- Five tagged releases,
v1.3.3throughv1.5.0
Astro 6
The Astro 6 upgrade was easy. Semantic Docs runs a hybrid setup, static article pages plus a server-rendered search endpoint, and that part barely needed any attention. Most of the work was in the dependency layout, not the application code.
One note if you’re forking or syncing this theme: if you’re upgrading from
v1.3.5or earlier (anything pre-Astro-6, which landed inv1.4.0), delete yournode_modulesand your lockfile and do a clean install. Skip that step and you’ll get weird errors that look like your code is broken when it’s really just leftover state.A Real npm Package Instead of a Vendored Logger
For a while, the project was using a logger I wrote to experiment with publishing to both npm and JSR. It was a useful exercise. I wanted to see what a clean foundational package looked like across both registries, and I think it turned out well.
But for this repo, I wanted consistency over experimentation. So I swapped the vendored copy for the published
logan-loggernpm package. Behavior is the same, the surface area is the same, it’s just back on the npm registry.Automated Releases
I’ve liked using Conventional Commits to drive automated releases. When a PR merges to main, the workflow figures out the next version from the commit messages, tags it, and publishes a GitHub release with a generated changelog.
The commit type determines the version bump.
feat:bumps the minor,fix:bumps the patch, breaking changes bump the major. The changelog falls out of the same metadata. More automation here the better.If you’ve been on the fence about Conventional Commits, this is the use case that sold me.
What’s Next: Embedding Quality
The reference implementation uses TEI for search embeddings, and that’s been fine. But “fine” is not the same as “good,” and I want to actually compare quality across providers before I commit to anything long term.
Two I want to test:
- Jina (now owned by Elastic)
- Mistral, which has been putting out genuinely strong embedding models
The goal is to run the same corpus through each, evaluate the search results, and figure out which one earns a highlight. Whatever I learn from that work will get folded back into the open source Semantic Docs repo so anyone running their own instance can make an informed choice instead of just trusting my defaults.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Open-source / Astro / Semantic-docs
-
SAST vs AI PR Review: Two Tools, Different Jobs
If you have worked in DevSecOps, you might be wondering if AI pull request review tools are going to replace traditional SAST scanners. Short answer: no. Longer answer: they’re solving different problems, and if you’re picking one over the other, you might be making a mistake.
Here is how I think about it.
SAST is the Compliance Gatekeeper
Static Application Security Testing tools, think Semgrep, SonarQube, Checkmarx, Fortify, parse your source code (usually into an Abstract Syntax Tree) and hunt for known vulnerability patterns. They don’t run the code. They just read it and “pattern-match” against rules.
The focus here is security, compliance, and strict rule enforcement. SAST is the automated gatekeeper that makes sure your code clears the OWASP Top 10 bar before it merges.
What SAST does well:
- It’s deterministic. If a rule matches a pattern, the engine flags it every single time. Run it twice on the same code, get the same result.
- It satisfies auditors. Frameworks like PCI-DSS, SOC 2, and HIPAA expect documented secure-development practices, and a formal SAST scanner is the easiest way to produce that evidence. AI agents don’t count here, at least not yet.
- It can do real taint analysis. Enterprise tools can track untrusted input from the moment it enters your app to the moment it hits a dangerous sink.
Where SAST falls down:
- The false positive rate is brutal. Rigid rules with no context means a lot of noise. Developer fatigue is real, and once your team starts ignoring scanner output, you’ve lost the game.
- It can’t see your business logic. A SAST tool has no idea what your application is supposed to do, so it can’t tell you when the logic itself is broken.
- Comprehensive scans are slow. Hours on large codebases isn’t unusual, though Semgrep has been doing good work on this front.
AI PR Agents are the Peer Reviewer
Tools like CodeRabbit, Qodo, Greptile, GitHub Copilot Code Review, Cursor Bugbot, and Claude Code (set up as a review skill) plug into your version control and read the PR diff with the surrounding code context. They behave less like a scanner and more like a colleague who actually read your changes.
The focus is developer productivity, code quality, logic bugs, and contextual feedback.
What they do well:
- They understand intent. LLMs can reason about why the code is changing, not just whether it matches a rule. That’s a different category of feedback.
- The signal-to-noise ratio is good. When an AI flags something, it usually comes with an explanation that makes sense. Less noise, more useful comments.
- They suggest fixes. Not just “this is wrong” but “here’s a diff you can apply.” That’s huge for actually closing the loop on review feedback.
- The scope is broader. Architecture, performance, style, security, all in one pass.
Where they fall down:
- They’re non-deterministic. Same vulnerability, two PRs, two different outcomes. That’s not a bug, that’s how LLMs work, and it’s why auditors don’t trust them.
- They don’t satisfy compliance. No auditor is going to accept “the AI looked at it” as a substitute for a formal scanner.
- Hallucinations happen. Invented issues, misread intent, suggestions that refactor things that didn’t need refactoring. You still need a human filtering the output.
The Quick Comparison
Feature SAST AI PR Review Primary Goal Security & Compliance Code Quality & Productivity Analysis Method Deterministic rules & AST Non-deterministic LLMs Business Logic Blind Context-aware False Positives Often high Usually low Compliance Proof Accepted as evidence Not accepted Feedback Loop Dashboard / CI output PR comments / chat The Lines Are Starting to Blur
The interesting thing happening right now is convergence from both directions.
On the SAST side, tools like DryRun Security are pitching themselves as “AI-native SAST,” trying to keep the deterministic backbone while using LLMs to filter out the false positives that make traditional scanners painful to live with.
On the AI agent side, CodeRabbit and Greptile keep getting better at catching real security vulnerabilities, not just style issues. They’re slowly creeping into territory that used to belong exclusively to SAST.
This is going somewhere, but it’s not there yet.
Where to Start Your Evaluation
Treat them as complementary, not competitive.
For SAST, evaluate against your audit footprint, the languages in your codebase, and how much false-positive triage your team can absorb. Semgrep, SonarQube, Checkmarx, and Fortify all sit in different price-and-friction zones, and the right one depends on what your business actually needs to prove.
For AI PR review, evaluate based on how it fits your existing review workflow, what languages and frameworks it understands well, and the signal-to-noise ratio in practice on your codebase. CodeRabbit, Qodo, Greptile, Copilot Code Review, Bugbot, and a Claude Code review skill all approach the problem differently.
If you pick one category and skip the other, you’re either passing compliance with mediocre code review, or getting great review feedback while failing your next audit. Neither is a win.
The AI tools aren’t replacing SAST. They’re filling in the gap SAST was never designed to cover.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / security
-
Your Data Lake's Vulnerability Problem Is Really an Identity Problem
I’ve been reading through the post-mortems on the last few years of data lake breaches, and the pattern is depressing. We keep blaming the platforms. We should be blaming ourselves.
Let me give you an example.
The Snowflake Breach Wasn’t a Snowflake Breach
In mid-2024, at least 165 organizations got hit through their Snowflake instances. AT&T lost over 50 billion call records. Ticketmaster, Santander, Advance Auto Parts. The headlines wrote themselves: Snowflake hacked.
Except Snowflake wasn’t hacked. Mandiant, CrowdStrike, and Snowflake all reached the same conclusion in their forensics. No zero-day. No flaw in the cryptographic platform. No internal compromise of Snowflake’s corporate network. No brute-force attacks against API limits.
What actually happened? UNC5537, a financially motivated group also tracked as Scattered Spider and ShinyHunters, walked through the front door with valid stolen credentials. Those credentials were harvested over years by commodity infostealer malware (VIDAR, LUMMA, REDLINE) running on the personal laptops of third-party contractors. The same laptops these contractors used for gaming and pirated software also held the keys to their clients' enterprise data lakes.
One contractor laptop. Multiple enterprise environments compromised. That’s the actual story.
79.7% of the accounts UNC5537 used had prior credential exposure. Some had been valid and un-rotated since November 2020.
The Two Doors They Walked Through
The first attack vector was the SSO side door. Plenty of victim organizations had a perfectly fine enterprise IdP enforcing strong passwords and MFA. They just forgot to make SSO mandatory. A local authentication pathway was left active alongside it. Attackers logged in directly with stolen local credentials, completely bypassing the IdP, and the MFA requirement never fired.
The second was credential stuffing against inactive, orphaned, and demo accounts belonging to former employees. Nobody audits those. Nobody enforces MFA on those. So they don’t get protected by the controls that exist on the production accounts.
Once inside, the kill chain was almost boring.
SHOW TABLESto enumerate.CREATE TEMPORARY STAGEto make an ephemeral staging area that disappears when the session ends, erasing forensic evidence.COPY INTOwithGZIPcompression to keep the payload small enough that volumetric alarms didn’t trigger.GETto pull it down to a VPS in some offshore jurisdiction. Done.No IP allowlisting was in place anywhere. The connections from Mullvad and PIA exit nodes were treated with the same trust as an employee on the corporate VPN.
The Bucket Problem Hasn’t Gone Away Either
Alongside the identity attacks, the boring stuff keeps working. Misconfigured S3 buckets are still the most reliable way to expose a data lake. In late 2024, an open bucket used as a shared network drive was found containing raw customer data, cryptographic keys, and secrets. In 2025, a US healthcare provider left millions of patient records readable for weeks before anyone noticed.
Then there’s Codefinger. In January 2025, that group used compromised AWS credentials to access S3 buckets and then weaponized AWS’s own Server-Side Encryption with Customer-Provided Keys (SSE-C) to ransomware the data in place. They didn’t even need to exfiltrate it. They just encrypted it with a key the victim didn’t have and demanded Bitcoin.
That’s a native cloud feature being turned against you because somebody granted too many permissions to a service account.
The Boring Conclusions Are the Important Ones
Identity is the perimeter now. The encryption-at-rest story we’ve been telling ourselves for a decade is irrelevant when the attacker authenticates as a real user. Stop treating SSO as optional. Stop leaving local auth paths open next to it. Enforce MFA on every account, including the demo and service accounts you forgot about.
Your data lake should not be reachable from the public internet. Route everything through PrivateLink or the equivalent in your cloud. Allowlist the IPs that should be touching analytical workloads, and don’t make exceptions for “just this one contractor.”
And as you start handing access to AI agents, remember that static roles aren’t going to cut it. Just-in-time entitlements and contextual access control are the only way you’re going to keep up with autonomous systems making queries on your behalf.
The data lake industry spent years arguing about table formats, vendor lock-in, and egress fees. Meanwhile, attackers were just collecting passwords from gaming laptops and walking in.
Fix the doors first.
Sources
- UNC5537 Targets Snowflake Customer Instances (Mandiant / Google Cloud) — Forensic analysis, kill chain, infostealer attribution
- Snowflake Data Breach: Lessons Learned (AppOmni) — SSO side door, MFA bypass mechanics
- Major AWS S3 Bucket Breach Exposes Data (NHIMG) — Codefinger SSE-C ransomware tactic
- Misconfigured Cloud Assets: How Attackers Find Them (CybelAngel) — Recent open-bucket exposure incidents
- 5 Key Lessons from the Snowflake Data Breach (Tanium) — Defensive posture summary
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
Your Data Lake Has a Permissions Problem
Consolidating every business unit’s data into one giant lakehouse sounds like a win until you realize the security model from your old data warehouse can’t scale to it. You took ten silos, each with their own access rules, and merged them into one location. Now everyone wants in, and your security team is the bottleneck.
Let me walk through three places where the cracks usually show up.
RBAC Falls Over Faster Than You Think
Role-Based Access Control is the model most teams start with. Permissions are tied to a job function. Sales reps get read access to sales tables, data engineers get write access to staging, and so on. It works fine when you have ten roles.
It does not work when you have a thousand.
Say your sales reps should only see accounts in their territory, and only accounts they personally manage. Under pure RBAC, you need a unique role for every territory-by-account-owner combination. That’s role explosion, and it’s how compliance audits become impossible and legitimate access slows to a crawl. The roles list grows faster than anyone can review it, which means stale permissions sit there forever.
The answer is Attribute-Based Access Control. Instead of asking “what role is this user in,” the system asks “what attributes does this user have, what attributes does this data have, and what’s the policy at this exact moment.” Tag a column as
PII. Tag a schema asHR. Write one policy that says anyone outside the HR compliance group sees masked data when they touch a PII column. Done. That single policy replaces hundreds of bespoke roles.This is what Unity Catalog and Starburst Galaxy are built around, and it’s the model that will scale with the data.
Column and Row Security Should Be Boring
Once you have ABAC and a real metadata catalog, column-level masking and row-level filtering become a non-event. You write a SQL expression that masks the first five digits of an SSN for lower-privileged roles. You write a row filter that silently appends
WHERE region = 'user_region'to every executive’sSELECT *.The key word is silently. The user doesn’t see a different table. They don’t have a sanitized copy. The policy is enforced at the catalog layer, so it works the same whether they’re querying through Spark, Trino, a BI dashboard, or a pipeline. One source of truth, one policy, every engine.
If you’re still maintaining separate “sanitized” copies of tables for different audiences, you’re doing it the 2015 way and you’re going to drift.
The IAM Default Problem
Most cloud services ship with default IAM roles, and a surprising number of those defaults attach
AmazonS3FullAccessor something equally permissive.SageMaker does it. The Ray autoscaler role does it. There are more.
Picture the failure mode. An attacker compromises some peripheral app, maybe a forgotten Jupyter notebook, maybe a misconfigured Lambda. That workload has an IAM role attached because that’s how cloud workloads talk to S3 without hardcoded credentials. The attacker inherits the role. And because the role has full S3 access, they’re not constrained to the bucket the application actually uses. They can enumerate every bucket in the entire account.
That’s how a single compromised container becomes a full data lake breach. Researchers call it a bucket monopoly attack. I call it the most predictable incident in the industry.
The fix is not glamorous. Stop using
s3:*in any policy. Write resource-scoped policies that name the exact buckets and prefixes a workload needs. Audit the default roles every cloud service hands you and replace them. Use Security Lake or Detective to flag cross-service API calls that don’t match normal patterns. None of this is fun. All of it is necessary.And Then There’s the Agent Problem
The new wrinkle is that humans are no longer the primary consumers of your data. Autonomous agents are. They issue more queries, hit more tables, and move faster than any human team.
Long-lived credentials and static roles don’t fit that workload. The pattern emerging is Just-In-Time entitlements, where an agent gets a narrow, ephemeral permission for the duration of a single execution thread, then loses it. Pair that with declarative policy metadata baked into the data assets themselves, so the agent knows what it’s allowed to do with a dataset before it ever runs the query.
We’re early on this. Most organizations are still working through the basics, and that’s fine. But if you’re designing access controls today, design them assuming the next thing hitting your lake isn’t a person.
What to Actually Do
If you’re auditing your own data lake security, the order I’d work in:
- Find every IAM role with a wildcard permission. Replace them.
- Move from RBAC to ABAC at the catalog layer. Stop creating new roles.
- Pull your data lake off the public internet. PrivateLink, private endpoints, IP allowlists for the legacy stuff that can’t move.
- Then start thinking about agents.
The lakehouse pitch is unification. The lakehouse reality is that unification multiplies the cost of every bad permission. Get the basics right before you bolt on anything fancy.
Sources
- AWS Default IAM Roles Found to Enable Lateral Movement (The Hacker News) — SageMaker / Ray autoscaler default roles, bucket monopoly attacks
- What Is Fine-Grained Data Access Control? (TrustLogix) — RBAC role explosion, ABAC fundamentals
- Core concepts for ABAC (Databricks Unity Catalog docs) — Tag-driven policy enforcement
- Top 12 Data Governance Predictions for 2026 (Hyperight) — Just-in-time entitlements, declarative policy metadata
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
The Real Cost of Your Data Lake (It's Not the Storage)
If you’re sketching out a data platform on a whiteboard right now, I want you to do something. Stop calculating storage costs. They’re not the bill.
I pulled the public pricing for AWS, Azure, GCP, Databricks, and Snowflake and stacked them next to each other. Storage is the cheap part. The expensive part is everything that moves the data, and the expensive part is the part you’re least likely to model correctly when you’re picking a vendor.
Let me walk through what actually shows up on the invoice.
Raw Object Storage Is Basically Free
For hot, frequently accessed data, the big three are within a rounding error of each other:
- Azure Blob (LRS, Hot): $0.018 per GB/month
- Google Cloud Standard: $0.020 per GB/month
- AWS S3 Standard: $0.023 per GB/month (first 50 TB)
Drop into the cool tiers and AWS S3 takes the lead at $0.0125 per GB. Drop into deep archive and you’re paying $0.00099 per GB on either AWS Glacier Deep Archive or Azure Archive. That’s a tenth of a cent per gigabyte, per month, for data you almost never touch.
Good for you, but I think anyone leading with “per-GB storage cost” in a procurement deck is selling you a story. Storage capacity is roughly five percent of a typical Databricks bill. Five. The other 95% is the part nobody wants to talk about.
The Egress Trap
Ingress is free. Always. The cloud providers want your data in.
Getting it back out is where they collect.
- Azure Blob: $0.087/GB external egress
- AWS S3: $0.090/GB
- Google Cloud: $0.120/GB (but free if you stay inside Google’s ecosystem, which is the whole point of that pricing)
Then layer on API operations. A million GET requests on S3 costs about $0.40. The same million GETs on Google Cloud Storage can run closer to $5.00 because they classify operations differently. If your analytics workload is hammering small files, those API calls add up faster than the storage they’re reading.
Storing 10 TB? Maybe $200 a month. Storing 500 TB? You’re at $10,000 a month before a single byte leaves the region or a single query fires.
Databricks: Two Bills, One Headache
Databricks uses what’s commonly called a Two-Bill Model. You get one invoice from your cloud provider for the actual VMs and storage, and a separate invoice from Databricks for the software, measured in DBUs (Databricks Units).
In a typical mid-sized deployment around $18,000/month, the breakdown looks like this:
- VM compute from the cloud provider: ~55%
- Databricks DBU fees: ~30%
- Storage: ~5%
- Network egress: ~5%
The DBU rate changes based on what you’re doing. Automated jobs start at $0.15/DBU. Interactive notebooks for analysts start at $0.40/DBU. That’s not an accident. Databricks wants you running production workloads on cheap job clusters, not on the expensive all-purpose clusters your data scientists love to leave running over a weekend.
If you’re not actively pushing teams toward job clusters and ARM-based instances, you’re leaving real money on the table.
Snowflake: The Hidden Storage Multiplier
Snowflake’s pricing pitch sounds clean. Pass-through storage at $40/TB/month on-demand, dropping to $23/TB/month with a capacity commitment. Compute as Credits. Done.
Except it isn’t done. Snowflake stores data in immutable 16MB micro-partitions. Immutable. You can’t change them in place. Update a single row in a 1 TB table and Snowflake writes a new file and keeps the old one around.
Why keep the old one? Two features:
- Time Travel: query historical states of your data for up to 90 days
- Fail-Safe: a 7-day disaster recovery window you cannot turn off
This is the part that gets people. A 1 TB table that’s getting updated multiple times a day can balloon to 25 TB of billed storage because Snowflake is retaining every prior version of every micro-partition you’ve touched. Your dashboard says “1 TB table.” Your invoice says otherwise.
And compute? Virtual Warehouses bill per second, but with a 60-second minimum every single time you resume or resize. Aggressive auto-suspend sounds like a cost optimization. It’s not. If you’re spinning a warehouse up and down every 30 seconds, you’re paying the 60-second minimum every time and quietly multiplying your bill.
What I’d Actually Do
A few things I’d put on the wall before signing anything:
- Model egress, not storage. Run your worst-case query pattern through the calculator. Storage is noise.
- Lifecycle everything. Cool tier and archive pricing are 10x to 100x cheaper. If your data is older than 90 days and nobody’s queried it, it shouldn’t be in hot storage.
- For Databricks: push every recurring workload to job compute. Audit interactive cluster usage monthly.
- For Snowflake: if you have high-frequency update patterns, profile your actual storage footprint, not your logical table size. The gap will surprise you.
- For multi-cloud: don’t. Egress will eat the savings before you finish the architecture diagram.
The vendors all have a story about why their model is the cheap one. Read past the per-GB number on the slide. The bill is somewhere else.
Happy modeling.
Sources
- Databricks Pricing Explained (Dawiso) — Two-Bill Model, DBU breakdown
- Snowflake Pricing Explained (SELECT.dev) — Time Travel storage multiplier, micro-partition behavior
- Cloud & AI Storage Pricing Comparison 2026 (Finout) — AWS / Azure / GCP per-GB and tier pricing
- S3 vs GCS vs Azure Blob Storage (ai-infra-link) — Egress and API operation pricing
- Snowflake Pricing in 2026 (CloudZero) — Virtual Warehouse 60-second minimum behavior
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Cloud / Data / Snowflake / Databricks
-
AI Code Reviewers Won't Save You
Dropping an AI reviewer into your pull request pipeline is just a band-aid. Tools like CodeRabbit or Greptile are great for catching syntax errors or basic anti-patterns, but they can’t assess architectural intent or domain-specific business logic. They’re spell-checkers for code. Useful, sure. But nobody ever said “our codebase is solid because we run spell check.”
AI doesn’t change your engineering baseline. It just accelerates it. If your foundational guardrails are weak, agentic tools will help your team generate technical debt at unprecedented speeds. So the real question isn’t “how do we review AI code?” It’s “how do we build systems that prevent slop from ever reaching production?”
Shift Left, Hard
When engineers use agents to scaffold a new Go service or spin up a SvelteKit frontend, they’re inevitably pulling in generated dependencies or utilizing unfamiliar libraries. Models hallucinate packages. They suggest insecure patterns with total confidence.
Your CI pipeline needs to be ruthless before a human ever looks at the code. Aggressive SAST and SCA should automatically block PRs that introduce vulnerable dependencies or hardcoded secrets. If the agent generates slop, the pipeline rejects it instantly. No discussion.
Make the Agents Write the Tests
Agents are incredibly eager to generate feature code, but humans are historically lazy about writing the tests for it. The influx of AI-generated code means human reviewers can’t possibly step through every logic branch manually.
So flip the script. Use the agentic tools to build the guardrails themselves. Mandate that any generated feature code must be accompanied by generated, human-verified unit tests. If an agent writes a sprawling TypeScript function, the build should fail if the test coverage doesn’t meet a strict threshold. You’re already using AI to write the code. Use it to prove the code works, too.
Context Boundaries Matter
Bloated AI output often happens because the model is given too much context or allowed to generate too much at once. Heavyweight IDEs with aggressive multi-file auto-completion can easily create cascading messes across a codebase.
Define strict architectural boundaries and API contracts upfront. Agents should be tasked with solving small, well-defined, modular problems. “Write a function that parses this specific JSON schema” is a good prompt. “Build the backend” is not. The tighter the scope, the less room for generated nonsense.
Observability Is Your Safety Net
You can’t catch all generated slop at the PR level. Some of it only reveals itself under load. An agent might write a technically correct query that causes an N+1 database issue, or introduce a subtle memory leak that passes all unit tests.
Your ultimate safety net is what happens at runtime. You need an airtight observability stack to trust the velocity AI brings. Logs, distributed tracing, metrics, all feeding into dashboards your team actually watches. When generated code hits staging, you need the immediate telemetry to spot performance regressions before they reach production.
Redefine the Human Review
Because AI makes the “typing” part of coding trivial, the human code review needs to fundamentally shift. Reviewers should no longer be looking for missing semicolons. They should be asking: “Does this component fit our architecture?” and “Did the agent over-engineer this solution?”
Train your senior engineers to review for intent and systemic impact. That’s the stuff AI genuinely can’t do yet. Leave the syntax checking to the robots.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Software-development / Code-review
-
What Is a Runbook and Why Should You Care?
If you’ve ever been woken up at 3 AM by a pager and stared at your screen trying to remember how the database failover works, you already know why runbooks matter. You just might not have had one yet.
A runbook is a step-by-step guide for handling a specific operational scenario. Database goes down? There’s a runbook for that. Failed deployment needs a rollback? Runbook. Routine certificate rotation? You get the idea. They range from simple markdown files to fully automated scripts where a human only needs to click “approve.”
That’s the idea anyways. The impact of having good ones versus not having CAN be massive.
Why They Matter
When something breaks in production, your brain is not at its best. Adrenaline kicks in, Slack is blowing up, and suddenly you can’t remember if you’re supposed to restart the service first or check the connection pool. A runbook takes the thinking out of the equation. You follow the steps. You restore the service. You go back to sleep.
This directly lowers your Mean Time To Recovery (MTTR). Instead of spending twenty minutes in a group call debating what to try next, you open the runbook and start executing.
Runbooks also solve the consistency problem. If five different engineers respond to the same alert five different ways, you’re rolling the dice every time. One of those approaches might cause a secondary outage. A runbook ensures everyone follows the same diagnostic and remediation path, which means fewer surprises.
And then there’s the tribal knowledge issue. Every team has that one senior engineer who knows exactly how to fix the weird thing that happens once a quarter. What happens when they’re on vacation? Or they leave the company? A runbook gets that knowledge out of their head and into a document the whole team can use.
It also makes onboarding way faster. New engineers can start handling on-call rotations with confidence instead of hoping nothing breaks on their watch.
Treat Them Like Code
This is the part a lot of teams get wrong. Runbooks shouldn’t live in a random Confluence page that hasn’t been updated since 2023. They should live in version control. Sometimes they’re kept in the repo with the code. Other times they’re kept separate. It’s up to you. It’s up to the team on where to put it.
If a developer changes how a service authenticates or connects to a database, the associated runbook needs to be updated in the same pull request. An outdated runbook is worse than no runbook at all. It sends engineers down the wrong path during an outage, which burns time and trust.
Share Early, Share Often
A runbook sitting in someone’s private folder is doing exactly nothing for your team.
Start during the draft phase. Have someone who didn’t write the runbook try to follow it. If they get confused or stuck, the runbook needs work. This is the cheapest way to find gaps.
When a new service is heading to production, the runbook should be part of the readiness review. I’d argue a service shouldn’t go live without one. And after an incident, if the runbook was wrong or didn’t exist, creating or fixing it should be a mandatory action item from the post-mortem.
One more thing. Practice them. Run game days where the team actually walks through runbooks before a real emergency happens. The worst time to discover your runbook has a missing step is when production is on fire.
So Here We Are
Runbooks aren’t glamorous. Nobody’s giving a conference talk about the beautiful runbook they wrote last quarter. But they’re the difference between a calm, methodical incident response and a panicked Slack thread full of guesses. Write them, version them, share them, and practice them. Your future self will thank you.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Sre / Runbooks / Incident-response
-
What Temporal Actually Does (And Why You'd Want It)
Building a multi-step process across microservices usually goes something like this. You wire up a message queue, add retry logic, build a state machine backed by a Postgres
statuscolumn, throw in some cron jobs, and pray. It sounds complicated because it is.Temporal is an open-source “durable execution” system that replaces all of that duct tape with a single, opinionated framework. Lets break it down.
Workflows and Activities
Temporal splits your application into two concepts:
- Workflows are your business logic, written in standard code (Go, Python, TypeScript) using a Temporal SDK. They must be deterministic. They define the order of operations, branching, loops, and error handling.
- Activities are the actual tasks your services perform. HTTP requests, database writes, external API calls. Activities are where the non-deterministic, real-world work happens.
When a workflow runs, it executes on your own worker services. Every time it schedules an activity, starts a timer, or completes a step, the Temporal Server records that event internally. If the worker crashes, another worker picks it up, replays the workflow’s event history to the exact point of failure, and resumes. No data loss. No half-finished state.
All of all the things that you would have to build yourself simplified Into A framework that handles it for you.
What It Replaces
Without something like Temporal, teams generally land in one of two camps:
- Choreography (event-driven): Services emit and listen to events through a message broker like Kafka or RabbitMQ. Highly decoupled, sure. But in practice it turns into a pinball machine. There’s no single place to understand the flow of a business transaction. Debugging becomes detective work across dozens of services and topics.
- Ad-hoc orchestration: You build a custom state machine with a database, message queues, background workers, and cron jobs. Then you write a ton of boilerplate for retries, dead-letter queues, and idempotency. Every team ends up building a slightly different version of this, and none of them are great.
Temporal gives you the reliability of a custom state machine without making you build and maintain one.
Why It’s Worth Looking At
A few things stand out:
- Durable sleep. A workflow can execute
sleep(30_DAYS). Temporal suspends the execution, frees the worker’s resources, and wakes it back up a month later exactly where it left off. Hard to do with a cron job. - Built-in resiliency. Exponential backoffs, timeouts, and retry policies are configured on the activity invocation. You’re not writing custom
whileloops andtry/catchblocks to handle network jitter. - Centralized observability. Instead of piecing together distributed traces or searching through logs to figure out why step 4 of 7 failed, the Temporal UI shows the exact execution state of every workflow. Inputs, outputs, errors, all in one place.
- Code over configuration. Unlike AWS Step Functions or YAML-heavy tools like Airflow, you write workflows in a real programming language. You can unit test them, store them in version control, and run them through your normal CI/CD pipeline.
That last point is worth reading and thinking through again. If your orchestration logic lives in code, it gets all the benefits code gets. Reviews, tests, refactoring, IDE support. Visual workflow builders look great in demos, but they don’t scale the way code does.
Should You Use It?
Temporal isn’t free in terms of operational complexity. You’re running the Temporal Server (or paying for Temporal Cloud), and your team needs to understand the replay model and determinism constraints. It’s not something you bolt on to a simple CRUD app.
But if you’re managing distributed transactions with queues, cron jobs, and hand-rolled state machines, Temporal is worth a serious look. It takes the hardest parts of that problem and makes them someone else’s. Durability, retries, observability. All handled.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
-
pgvector vs Pinecone: You Probably Don't Need a Separate Vector Database
Every time someone starts building a RAG pipeline, the same question will come up: do I need a “real” vector database like Pinecone, or can I just use pgvector with the Postgres I already have?
I can imagine teams agonizing over this decision for weeks. So maybe this will save you some time?
The Case for Staying Put
If you already have a PostgreSQL instance in your stack, adding
pgvectoris almost always the right first move.You manage one stateful service instead of two. Your existing backup strategy, monitoring, and security all stay the same. Your vector embeddings live next to your metadata, so you get ACID compliance and standard SQL joins. No syncing between two data stores. No eventual consistency headaches.
Performance? From what I found, for datasets under a few million vectors,
pgvectorwith HNSW indexes is fast. Really fast. It satisfies the latency requirements of most applications without breaking a sweat.And you’re not paying for another SaaS subscription…
When Pinecone Actually Makes Sense
Pinecone is a purpose-built vector database designed for high-dimensional data at massive scale. It’s serverless and fully managed.
If you’re dealing with hundreds of millions or billions of vectors, a specialized engine handles memory and disk I/O for similarity searches more efficiently than Postgres can. Pinecone also gives you native namespace support, metadata filtering optimized for vector search, and live index updates that are faster than re-indexing a large Postgres table.
Those are real advantages. At a certain scale.
The Decision Is Simpler Than You Think
Stay with Postgres + pgvector if:
- You want to minimize infra sprawl and moving parts
- Your vector dataset is under 5 to 10 million records
- You rely on relational joins between vectors and other business data
- You have existing observability and DBA expertise for Postgres
Consider Pinecone if:
- Your Postgres instance needs massive, expensive vertical scaling just to keep the vector index in memory
- You don’t want to tune HNSW parameters,
mmapsettings, or vacuuming schedules for large vector tables - You need sub-millisecond similarity search at a scale where Postgres starts to struggle
That is what I would use to make that decision.
Most teams are probably nowhere near the scale where Pinecone becomes necessary. They have a few hundred thousand vectors, maybe a million or two. Postgres handles that without flinching. Adding a separate managed vector database at that point is just adding operational complexity for no measurable benefit.
The trap is thinking you need to “plan ahead” for scale you don’t have yet. You can always migrate later if you actually hit the ceiling. Moving from pgvector to Pinecone is a well-documented path. But moving from two services back to one because you overengineered your stack? That’s a conversation nobody wants to have.
Start with what you have. Add complexity when the numbers force you to, not when a vendor’s marketing page makes you nervous.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / AI / Programming / Databases
-
How Kong Actually Works in Kubernetes
At some point with microservices in Kubernetes, basic Ingress routing stops being enough. Kong is interesting router that I would like to try in the future.
It’s an API Gateway built on top of NGINX and OpenResty. It operates at the infrastructure layer, managing the actual HTTP traffic flowing into your cluster. Drop it into a Kubernetes environment and it acts as an Ingress Controller. It does that job really well.
The Ingress Controller Problem
We should review what an ingress controller is. In case you’re familiar, unfamiliar with its job in Kubernetes. An
Ingressresource is just a set of routing rules. “Send traffic forapi.example.com/v1to theuser-servicepod.” Kubernetes doesn’t actually route traffic itself. It needs a controller to read those rules and move the packets.The Kong Ingress Controller (KIC) runs as a pod inside your cluster. It watches the Kubernetes API server for changes to Ingress resources, Services, and Endpoints. When someone deploys a new app and creates an Ingress rule, KIC picks it up, translates the Kubernetes config into Kong’s native format, and reloads the proxy. No manual intervention.
How Traffic Actually Flows
When external traffic hits your cluster, the path looks like this:
- External Load Balancer forwards traffic to the Kong proxy pods
- Kong evaluates the incoming request against its routing table (headers, paths, hostnames)
- Plugins execute before routing, handling cross-cutting concerns at the edge instead of inside your application code
- Upstream routing sends traffic directly to Pod IPs, bypassing
kube-proxyfor better performance
That plugin step is where Kong really earns its keep. Rate limiting, API key auth, mTLS, request transformation. All of that happens at the gateway layer so your services don’t have to think about it.
CRDs Make It Actually Useful
Standard Kubernetes Ingress is pretty limited. Host-based routing, path-based routing, and that’s about it. Kong extends this with Custom Resource Definitions:
- KongPlugin lets you attach behaviors to routes or services. Deploy a manifest to enforce rate limits, require API keys, or add mTLS to a specific endpoint.
- KongConsumer manages user identities and credentials directly in Kubernetes, so you can tie routing rules or rate limits to specific clients.
This means your API gateway configuration lives right alongside your application manifests. Version controlled, reviewable, deployable through your normal CI/CD pipeline.
Skip the Database
Kong used to require PostgreSQL or Cassandra to store its routing config. In modern Kubernetes deployments, you almost always run it in DB-less mode instead.
Why? Kubernetes already has
etcdas its source of truth for cluster state. Running a second database just for the API gateway adds overhead and failure modes you don’t need. In DB-less mode, Kong stores its configuration entirely in memory. The Ingress Controller reads state from Kubernetes and pushes updates to the proxy dynamically.This is one of those decisions that sounds minor but changes everything about how you operate Kong. No database backups to worry about. No schema migrations. Your gateway config is just Kubernetes manifests managed through GitOps.
Observability at the Edge
Sitting at the edge of the cluster, Kong is perfectly positioned to capture metrics, logs, and traces. With the right plugins, it exports traffic data (latency, status codes, request volumes) directly into whatever observability stack you’re running.
You get visibility across your entire microservice architecture without instrumenting every individual service.
Kong isn’t the only Ingress controller out there, but the combination of plugin architecture, DB-less mode, and CRD-based configuration makes it a solid choice if you need more than basic routing. If you’re already running Kubernetes and find yourself writing the same auth and rate-limiting logic across multiple services, moving that to the gateway layer is worth your time.
I’d appreciate a follow. You can subscribe with your email below. The emails go out once a week, or you can find me on Mastodon at @[email protected].
/ DevOps / Kubernetes / Kong / Infrastructure
-
Agentic Development Trends: What's Changed in Early 2026
I’ve been following the agentic development space around Claude Code and similar tools and the last couple months have been interesting. Here’s what I’m seeing as we move through March and April 2026.
From Solo Agents to Coordinated Teams
The biggest shift is that more people are moving away from trying to build one agent that does everything. Instead, we’re seeing coordinated teams of specialized agents managed by an orchestrator, often running tasks in parallel. I think this is the more proper use of these systems, and it’s great to see the community arriving here.
If you’re curious about the different levels of working with agentic software development, I created an agentic maturity model on GitHub that goes into more detail on this progression.
Long-Running Autonomous Workflows
Early on, agents handled what were essentially one-shot tasks. Now in 2026, agents can be configured to work for days at a time, requiring only strategic oversight at key decision points. Doesn’t that sound fun? You’re still the bottleneck, but at least now you’re a strategic bottleneck.
Graph-Based Orchestration
Frameworks like LangGraph and AutoGen are converging on graph-based state management to handle the complex logic of multi-agent workflows. I think this makes sense when you consider the branching and conditional logic of real-world tasks could map naturally to graphs.
MCP Is Everywhere
MCP (Model Context Protocol) has become the industry standard for tool integration. All vendors fully support it, and there’s no sign of slowing down. Every week there are new MCP servers popping up for connecting agents to different services and tools.
Unified Agentic Stacks
The developer tooling is becoming more consistent. Cursor is becoming more like Claude Code, and Codex is becoming more like Claude Code. Maybe you see a pattern there… might tell you something about who’s setting the pace.
What is also noteable, people are experimenting with using different tools for different parts of the workflow. You might use Cursor to build the interface, Claude Code for the reasoning and main logic, and Codex for specific isolated tasks. Mix and match based on strengths.
Scheduled Agents and Routines
Claude Code recently released routines or scheduled or trigger-based automations that can run 24/7 on cloud infrastructure without needing your laptop. Microsoft with GitHub Copilot are working on similar capabilities? Cursor had something like this a while back too.
Security Gets Serious
Two things happening here. First, people are getting better at leveraging agents for security reviews and monitoring. Tasks that previously required highly specialized InfoSec expertise. You no longer need to be a hacker to find vulnerabilities; you can let your AI try to hack you.
However, the same capabilities that harden defenses can also be used for offensive attacks. We’re seeing a major push for security-first architecture as a requirement for all new applications, specifically to defend against the rise of agentic offensive attacks. Red team and blue team are both getting AI-pilled.
FinOps: Watching the Bill
Last on the list is financial operations. Inference costs now account for over half of AI cloud spending according to recent estimates. Organizations are prioritizing frameworks that offer explicit cost monitoring and cost-per-task alerts. Getting granular about how much you’re spending to solve specific problems and optimizing at the task level. I think that’s pretty interesting and something we’ll see a lot more tooling around.
The common thread across all of these trends is maturity. We’re past the “wow, an AI wrote code” phase and into “how do we make this reliable, secure, and cost-effective at scale.” That’s a good place to be.
/ DevOps / AI / Development / Claude
-
What Companies Are Actually Paying for Application Security
In the Application Security Testing (AST) market, Static Application Security Testing (SAST) and Software Composition Analysis (SCA) represent the two most critical pillars of preventative cyber defense.
So as a part of that, we should talk about the thing that people normally can’t or don’t talk about and that is cost. Vendors like to hide their pricing behind “contact sales” buttons, and buyers end up negotiating based on hard to find information.
So here’s an unofficial look at what companies are actually paying, pulled from a Deep Research report provided by Gemini. At the very end there is a list of resoruces where you can learn more about these subjects. However it is important to mention, there are not a lot of viable options for the home/hobby market.
What the Market Looks Like
Vendor Average Mid-Market / SMB Spend (Annual) Average Large Enterprise Spend (Annual) Economic Dynamics and Negotiation Factors Snyk ~$47,428 ~$222,516 Costs scale rapidly with developer headcount. Highly susceptible to volume discounting. Total cost includes separate quoting for onboarding and services. Black Duck (Coverity) $60,000 – $120,000 (50-100 devs) $150,000 – $300,000+ (150+ devs) Full platform deployments (SAST + SCA) often range from $300k to $600k+. Volume discounts and custom enterprise agreements are typical. Premium support adds 20-30%. Checkmarx $35,000 – $75,000 $100,000 – $250,000+ Pricing is considered complex. Hidden costs include mandatory professional services, premium support, and infrastructure overhead, adding 15-35% to year-one totals. Veracode $40,000 – $80,000 $100,000 – $250,000+ Application-based pricing feels predictable until microservice architectures cause application counts to explode. Discounts are heavily available for SAST+DAST+SCA bundles. SonarQube $30,000 – $50,000 (up to 5M LOC) $80,000 – $180,000 (5M - 20M+ LOC) Highly predictable LOC model. However, self-managed deployments incur separate infrastructure and administrative overhead costs not reflected in the software license. HCL AppScan $50,000+ $100,000 – $500,000+ Unified platform pricing for large deployments can easily exceed $1M. Implementations often require months of setup and heavy professional service fees. Official Licensing Models and Published Structures
Vendor / Platform Primary Pricing Metric Published Entry-Level / Standard Tier Pricing Enterprise Pricing Status Key Inclusions & Pricing Caveats Snyk Per Contributing Developer Team Tier: ~$52–$98 per developer/month ($624–$1,176/year). Custom / Unpublished Includes Snyk Code (SAST) and Open Source (SCA). Enterprise plans drop per-seat costs at high volume but require minimum seat counts. SonarQube Lines of Code (LOC) Analyzed Developer Edition: ~$15,000 for 1M LOC. Smaller tiers available (e.g., ~$2,500 for 100k LOC). Annual Pricing; Talk to Sales Prices scale strictly by the largest branch of private projects. Enterprise Edition adds legacy languages. Advanced Security is an add-on. GitHub Advanced Security Per Active Committer $19/user/month (Secrets) + $30/user/month (Code) = $49/user/month. Custom / Add-on to Enterprise ($21/user base) GHAS is strictly an add-on to the GitHub Enterprise plan. Tied directly to commit activity within a 30-day window. Mend.io Per Contributing Developer AppSec Platform: Up to $1,000 per developer/year. Included in upper bound limit Includes SAST, SCA, Renovate, and AI Inventory. No limits on LOC, scans, or applications. AI Premium is an extra $300/dev. Checkmarx Custom (Historically Per App or Node) Team Plans: ~$1,188/year base. Enterprise base starts ~$6,850/year. Custom / Unpublished Highly modular pricing based on developer count, module selection (SAST, SCA, DAST), and deployment model. Veracode Per Application or Per Scan Basic plans start at ~$15,000/year for up to 100 applications. Custom / Unpublished Pricing heavily depends on application count, scan frequency, and support levels. SCA alone starts around $12,000/year. Black Duck (Coverity) Per Team Member / Custom Coverity SAST: $800–$1,500 per team member annually. Custom / Unpublished Pricing scales with user access. Often bundled. Perpetual licenses with 18-22% annual maintenance fees exist for legacy deployments. Contrast Security Custom (GiB hour / usage) Essential tier: $119/mo. Advanced: $359/mo. Enterprise base ~$6,850/yr. Custom / Unpublished Pricing varies by package (AST vs. Contrast One managed service) and workload throughput. HCL AppScan Per Scan / Enterprise License SaaS: ~$313 per scan (min 5 scans). Basic Codesweep: $29.99/scan. Custom / Unpublished Enterprise suite pricing is highly customized, often requiring significant upfront capital expenditure. Feature Comparison
Feature / Capability Snyk Veracode Black Duck Checkmarx Mend.io GitHub (GHAS) SonarQube Endor Labs Primary Strength Developer Adoption & Speed Enterprise Governance & Low FPs License Compliance & Deep SAST Unified ASPM & Repo Scanning Automated Remediation Native Ecosystem Integration Code Quality & Baseline Security Noise Reduction & Reachability Reachability Analysis Basic No No No Advanced No No Full-Stack (95% reduction) Automated AI Fixes Yes (DeepCode) Yes (Proprietary Data) No Yes (Limited IDE) Yes Yes (Copilot) Yes (CodeFix) Yes (Without upgrades) Compilation Required No Yes (Binary) Yes (Coverity) No No No No No Broad Language Support High (14+) Very High (100+) High (22+) High (35+) Very High (200+) Moderate High (40) Moderate License Compliance Moderate Moderate Enterprise-Grade Moderate Enterprise-Grade Basic Basic Moderate Learning
-
Is There Something Better Than JSON?
Have you ever looked at a JSON file and thought, “There has to be something better than this”? I have.
JSON has served us well. It works with everything, and it’s human readable. It’s a decent default, don’t get me wrong, but the more you use it, you’ll find its limitations to be quite painful. So before we answer the question of whether there’s anything better, we should describe what’s actually wrong with JSON.
The Problems with JSON
First, there’s no type system. No datetimes, no real integers, no structs, no unions, no tuples. If you need types, and you almost always do, you’re on your own.
Second, JSON is simple, which sounds like a feature until you try to store anything complicated in it. You end up inventing your own schema, and the schema tooling out there (JSON Schema, etc.) gets verbose fast. Because the spec is so loose, validation can be inconsistent across implementations.
There’s more: fields can be reordered, you have to receive the entire document before you can start verifying it, and there are no comments. You can’t leave a note for the next person explaining why a config value is set a certain way. That’s a real problem for anything that lives in version control.
The Machine-Readable Alternatives
Now, there are plenty of binary serialization formats that solve some of these issues. Protobuf, Cap’n Proto, CBOR, MessagePack, BSON. They’re all interesting and have their place. But they’re machine readable, not human readable. You can’t just open one up in your editor and make sense of it. So let’s set those aside.
The question I’m more interested in is: is there something better than JSON that you can still read and edit as a text file?
It turns out there are two solid options.
Dhall
Dhall is a programmable configuration language. Think of it as JSON with all the things you wish JSON had: functions, types, and imports. You can convert JSON to Dhall and back, and it’s just a text file you can open in any editor. The name comes from a character in an old video game, and the language itself is interesting enough that it’s worth your time to explore.
CUE
CUE stands for Configure, Unify, and Execute. It’s similar to Dhall in that it fills the gaps JSON leaves behind, like types, validation, and constraints, while staying human readable. Where CUE really pulls ahead is in its feature set. You can import Protobuf definitions, generate JSON Schema, validate existing configs, and a lot more. In terms of raw capabilities, CUE has more going on than Dhall.
JSON isn’t going anywhere. But if you’re looking for something interesting to explore, check out both of these. They make great fun little side projects.
/ DevOps / Programming / Json / Configuration
-
Multi-Repos Are Underrated
If you’re considering a monorepo, I’d like you to, stop, and reconsider. Monorepos cause more problems than they solve, and I think multi-repos deserve way more love than they get.
The “Shared Libraries” Argument
The pitch usually goes something like this: “If we put everything in a monorepo, we can have shared libraries across multiple applications.” Okay, sure. But let’s talk about what’s actually happening here.
This is almost always closed-source, internal code. You don’t have a public package registry to lean on. And maybe your org hasn’t approved a private package hosting service. So the monorepo becomes the path of least resistance, not because it’s the best solution, but because nobody wants to fight for the budget to host private packages.
But, actually, private package hosting for most languages doesn’t cost a lot. You can host private packages in GCP pretty easily, but there are several affordable options. However it can depend somewhat on the language.
Monorepos often exist because nobody fought for the right infrastructure, not because it was the right call.
Coupling Will Eat You Alive
Probably the biggest problem with monorepos is coupling. You can very easily introduce tightly coupled dependencies across several applications. Now you can’t update your libraries safely because two completely different applications are using the same one, and nobody wants to touch it.
You know that feeling within a single application where there’s tightly coupled code without proper abstractions? Congratulations, now you have that problem across several different applications.
This is why we have packages with versions. Would you make breaking changes to an API and not version it?
Are we not engineers dedicated to a craft? Version your packages. Version your APIs.
Let the applications that pull in dependencies manage their own upgrades. If something worked on version 1.2 and breaks on 1.3, either fix your application or stay on the old version. That’s the whole point of versioning.
CI/CD Becomes a Nightmare
Monorepos make your CI/CD pipelines absolutely terrible to work on. Not only does it make things harder for everyone on the team to work with their applications day-to-day, but now your build and deploy pipelines are a tangled mess.
There are going to be undocumented parts of the monorepo tooling, like little hidden landmines waiting to kneecap you when you least expect it.
What About NX?
Yes, I’ve used NX. I don’t want to get into and re-traumatize myself, but chances are most of your team secretly hates it. I’ll use it if I’m forced to. But if it’s my decision? No-thX.
A Multi-Repo Example
From my own work: api2spec has fixture repos for Hono, Express, chi, gin, Fastify and many more all in a separate repositories.
They test the same tool against different frameworks across many different programming languages. Putting them in a monorepo would’ve complicated things significantly. Instead we have separate repos under the same GitHub org with consistent naming convention. Simple not Stupid.
For The Love of All that Is Holy Do Yourself A Favor and stick with Multi-Repos
Multi-repos give you clear boundaries, independent versioning, simpler CI/CD, and teams that can move without stepping on each other.
Yes, the overhead of managing separate repositories is real, but it’s a manageable and with good hygiene, the much preferred path over a never ending battle with your own tooling.
The monorepo pitch sounds great in a meeting. The reality is coupling, pipeline complexity, and a team that’s afraid to merge.
-
I switched to mise for version management a month ago. No regrets. No more
brew upgradebreaking Python. Built-in task runner replaced some of projects that were using Makefiles.Still juggling nvm + pyenv + rbenv?
/ DevOps / Programming / Tools