Foundation: Production Infrastructure Case Studies

Foundation

Production infrastructure work that grounds the AI.

Before the agents, the platform. These case studies are the EKS, Spark, and release-engineering work that informs every design decision in my AI systems today.

Result: 100% uptime

Zero-downtime across 20 EKS clusters

How a Fortune 500 platform team kept 1,000+ nodes upgraded through multiple Kubernetes minor versions without a single unplanned outage — phased rollouts, rehearsed rollbacks, and policy gates.

Read the case study →

Result: $1M+ saved / −49%

$1M saved on Spark-on-Kubernetes

Moving a data team from a managed Spark service to a tuned Spark-on-EKS setup — Karpenter-driven provisioning, checkpoint-aware spot handling, and per-job cost attribution.

Read the case study →

Result: −50% TTM

Halving time-to-merge with release guardrails

GitOps, semantic versioning, policy-as-code, and image provenance — turning multi-day releases into a reversible, boring non-event.

Read the case study →

From this foundation, the AI work.

These platform lessons shape the agentic systems I build today. The articles are where I write that down.