Principal Engineer

About Ryan Flynn.

Principal Engineer building agentic AI for platform engineering.

What I’m focused on now

Production AI for the parts of engineering where context matters.

I build production-grade AI systems for the parts of engineering where context matters: Kubernetes, AWS, ArgoCD, logs, secrets, service topology, and platform operations.

My current focus is Atlas, a Slack-native multi-agent AI platform that helps engineers move from symptom to root cause in minutes instead of hours. As Technical Lead I designed the routing-planning-orchestration pipeline, wired LLM-as-judge testing into CI on a 170+ case corpus, and shipped an autonomous multi-level investigation engine that compresses 30–60 minute log dives into sub-5-minute Slack interactions.

Why this frontier

Why platform engineering is the right place for agentic AI.

Most AI work is racing toward general chat. Platform engineering is the opposite shape of problem: dense context, narrow but consequential surfaces, expensive human time, and an answer that can be graded against reality. That combination is rare and it’s exactly where agents earn their keep.

An engineer staring at a Slack alert isn’t typing “write me a poem.” They need to know which service broke, which deploy preceded it, which dependent is degraded, which secret is rotating, and which log line tells the truth. That investigation is structured, repeatable, and gradeable — the three properties an agent system can actually be engineered against.

It’s also a place where wrong answers are visible. There’s no faking a root cause. That keeps the work honest.

Background

Years of production Kubernetes before any of the AI work.

Before the agents, the platform. I came to AI engineering through years of fleet-scale EKS operations, GitOps, release engineering, Spark-on-Kubernetes cost work, and the unglamorous Day-2 work that keeps clusters boring. That foundation shapes everything about how I build AI now — because I know what a real production system looks like when it’s failing.

It’s also why my AI work is opinionated. An agent that hasn’t seen a real cluster degrade can’t help you debug one. The infrastructure background is the unfair advantage.

How I think about AI engineering

Five principles I keep coming back to.

01 · Eval-driven, not vibes-driven.

A gold dataset, LLM-as-judge graders, and a CI gate are non-negotiable. If you can’t measure regressions, you don’t have an AI system — you have a demo.

02 · Modular agents over one giant prompt.

Routing, planning, and sub-agent orchestration let each stage do one thing well and fail visibly. Monolithic prompts hide where the failure came from.

03 · Topology beats text.

Infrastructure agents need a model of relationships — accounts, clusters, apps, traffic paths, secrets — not just a wall of log lines. A graph is the right primitive.

04 · Slack is the interface.

Engineers already debug in Slack. Meeting them there beats forcing adoption of yet another dashboard.

05 · Investigations are autonomous, multi-level, and explainable.

An agent shouldn’t need a human in the middle to keep going from symptom to root cause — and it should be able to show its work when it’s done.

Selected outcomes

The numbers behind the work.

$1M+

Annualized engineering productivity reclaimed.

4.5 hrs

Reclaimed per engineer per week from manual debugging.

99.7%+

Accuracy on the gold evaluation set.

170+

Case corpus for LLM-as-judge regression gating.

Foundation: production infrastructure and EKS

The platform work that grounds everything else.

Before the AI work I spent years on EKS fleet operations, GitOps, release engineering, Spark-on-Kubernetes cost tuning, and the Day-2 work that keeps clusters boring. The themes were familiar: make upgrades boring, prove it with SLOs, guardrails over heroics, reversibility as design.

That experience is what makes the AI work credible. An agent designed by someone who has never paged at 3am can’t help someone who has. Detailed case studies of the foundation:

100% uptime across 20 EKS clusters and 1,000+ nodes — phased rollouts, GitOps, rehearsed rollbacks, SLO-driven promotion.
$1M+ saved on Spark-on-Kubernetes (~49% spend reduction) — Spark Operator, Karpenter, spot decommission, per-job cost attribution.
Halved time-to-merge with release guardrails — semantic versioning as contract, policy-as-code, image provenance, reversibility by design.

Where to start reading

If you’re new here, start with these.

The Start Here guide walks you through the five articles that frame everything else on the site — positioning, architecture, pipelines, evaluation, and investigation.

Read the Start Here Guide →

Browse Articles

Want practical notes on agentic AI for platform engineering?

Get the newsletter, or find me on LinkedIn.

Join the Newsletter →

Connect on LinkedIn

recent posts

about

About