Foundation
Production infrastructure work that grounds the AI.
Before the agents, the platform. These case studies are the EKS, Spark, and release-engineering work that informs every design decision in my AI systems today.
Result: 100% uptime
Zero-downtime across 20 EKS clusters
How a Fortune 500 platform team kept 1,000+ nodes upgraded through multiple Kubernetes minor versions without a single unplanned outage — phased rollouts, rehearsed rollbacks, and policy gates.
Result: $1M+ saved / −49%
$1M saved on Spark-on-Kubernetes
Moving a data team from a managed Spark service to a tuned Spark-on-EKS setup — Karpenter-driven provisioning, checkpoint-aware spot handling, and per-job cost attribution.
Result: −50% TTM
Halving time-to-merge with release guardrails
GitOps, semantic versioning, policy-as-code, and image provenance — turning multi-day releases into a reversible, boring non-event.
From this foundation, the AI work.
These platform lessons shape the agentic systems I build today. The articles are where I write that down.