Article · Production

What Production Infrastructure Agents Need to Know

The agent loop is half of what it takes. The other half is request context, retries, identity, observability, graceful degradation — the boring infrastructure that determines whether an agent runs in prod or stays a demo.

When you read articles about agentic systems, almost all of them are about the agent loop — planner, tools, reflection, evaluation. Those are the interesting parts intellectually. They’re also less than half of what it takes to run an agent in production.

The other half is boring. Request context, retries, timeouts, secret management, identity propagation, observability, fail modes, graceful degradation. Skipping these doesn’t show up in a demo. It shows up at 2am when one MCP server is flapping and your agent is silently failing every other request.

Here’s what I’ve learned shipping a real one.

Request context is the spine

Every request that enters the system needs a request ID, a user identity, a transport (Slack, REST, gRPC), and enough context to trace through every layer. Without this, debugging a misbehaving request means correlating timestamps across half a dozen logs by eye.

The way I’d put it: the request context is the spine. Every log line, every tool call, every sub-agent invocation should carry the request ID. When something goes wrong, you grep one ID and get the entire story.

The instrumentation isn’t fancy — a RequestContext dataclass that gets built at the entry point and threaded through every async call. The win is that once it’s in place, the cost of new observability is near zero. Trace logging for sub-agents, audit logs for tool calls, structured error logs — they all just include the context, and Splunk does the rest.

The one detail that matters: propagate it through every entry point, including the ones you don’t think need it. REST, Slack, gRPC, scheduled jobs, MCP servers. The day a scheduled background task malfunctions, you’ll want the same correlation ID structure as the user-driven path.

Identity and approval are not the model’s job

Agents that take action — write a file, post a comment, create a Jira ticket, deploy code — need an identity model that the LLM doesn’t get to decide. Specifically: the actor (who’s calling), the subject (what they’re acting on), and the delegation (who they’re acting on behalf of) need to be enforced outside the LLM call.

We use three modes for this — compat, strict, pending. The strict mode rejects any approval whose action fingerprint doesn’t match the originally requested action, even if the LLM rewrites the request mid-turn. The compat mode permits drift for backwards compatibility with older Slack workflows. The pending mode requires human approval before any side-effecting action.

The reason this can’t live in the prompt: the LLM is the thing being checked. A prompt that says “only take action X if approved” is honored by the model most of the time and ignored just often enough to be unsafe for anything that touches production. The identity and approval checks happen at the tool-call boundary, in code, regardless of what the model decided.

This is one of those infrastructure investments that looks like overengineering until the first time you don’t have it. We had a Jira-ticket creation flow where the LLM, mid-conversation, decided to file the ticket against the wrong project. Without the action-fingerprint check, that would have shipped. With it, the approval check rejected the action and surfaced the discrepancy to the user.

Tool calls fail. Plan for it.

This is the single most underrated piece of infrastructure-agent design. Production tools fail in five distinct ways, and your agent has to handle each:

  1. Hard timeout. The tool didn’t respond within the budget. Cancel the call, surface “tool unresponsive” to the model, let it pivot or report.
  2. Soft timeout. The tool responded slowly but did respond. The result is valid; the latency is the issue.
  3. Empty result. The tool worked and returned no data. This is often the correct answer (“no errors in the time range”) but the model has to distinguish it from a failure mode.
  4. Auth failure. The credentials are wrong, expired, or missing. The agent must not retry indefinitely; it has to report the auth issue clearly.
  5. Service degradation. The backend returned a 5xx or rejected the request. Distinct from a no-data response; needs distinct handling.

Each of these should produce a structurally different response envelope. For Loki we have a LokiQueryTimeout envelope on hard timeout, an empty-results indicator on case 3, and a fallback_to_splunk: true flag when the empty result might be due to the cutover date. The agent’s prompt teaches it how to react to each.

The default behavior — “try the tool; if it fails, retry — is fine for transient errors and disastrous for auth errors or chronic upstream issues. We had a period where flaky MCP server startup would leave one agent with zero tools, and the request-time tool calls would silently fail without surfacing anything useful. The fix was a per-agent summary at startup: which agents loaded successfully, which have degraded (missing) tools, and a clear warning that a degraded agent will run with function tools only. Loud failures are better than silent degradation.

Secrets and the boundary problem

Agents call tools. Tools need credentials. Credentials come from your secret store. So far, so simple — until you realize that your agent system is now a high-value target for the same reason it’s useful. It knows the topology, it has access to the tools, and it can stitch them together in ways a human would have to manually orchestrate.

A few rules I’ve come to:

  • Secrets never touch the LLM. The agent calls a tool. The tool reads the credential at its own layer and uses it. The credential value is never in the prompt, never in the response, never in the context passed to the model.
  • Per-environment secrets, period. No prod credential should be reachable from a non-prod agent run. We had a near-incident early on where a dev environment had access to a prod token through a misconfigured fallback. The fix was to make secret lookups environment-scoped at the lowest layer, with no implicit fallback.
  • Tool rejections that leak credentials are security incidents, not infra incidents. We hit a case where one MCP server dumped the upstream Authorization header in plaintext on every backend rejection. Treat these as P1 and patch immediately. The convenience of debugging is not worth the credential exposure.

Most of this is generic security guidance. The reason it matters specifically for agentic systems is that the agent’s ability to chain operations multiplies the blast radius of a single credential leak. A leaked token used to mean one attacker action; in an agent context, it means N attacker actions because the agent itself is the orchestrator.

Observability that survives compaction

Agent runs are long. The LLM context gets compacted as it grows. The sub-agent results get truncated to fit the synthesizer’s budget. If you’re not careful, the observability story degrades right when you most need it — long, complicated runs.

A few patterns that helped:

  • Trace logging is structured, not free-form. Every sub-agent invocation logs structured fields (agent name, query, tool calls, result summary, duration) — not a free-text paragraph. Splunk queries against the structured fields work even when individual messages get long.
  • Head + tail truncation, not head-only. When a sub-agent result gets too long for the synthesizer’s budget, we keep the first 65% and last 25% with a clear [TRUNCATED] marker. The pattern preserves both the opening summary and the closing recommendation. Head-only truncation loses the conclusion, which is usually the load-bearing part.
  • Phase-level heartbeat ticks. Long operations emit phase transitions to the trace, so when a run takes 90 seconds you can see where the time went. Without phase ticks, a 90-second run is opaque.
  • Greppable tokens for every state transition. SLACK_ACK_PRIMARY_POSTED, HEARTBEAT_TICK, ROUTER_DECISION, SYNTH_FALLBACK_TRIGGERED. Plain English in logs is fine for reading; tokens are what you grep when you’re debugging at 2am.

The lesson: invest in observability for the long-running, complicated case. The short-and-simple case is easy to debug from the response alone. The case where you need observability is the case where everything else has already failed.

Health probes that don’t lie

Kubernetes health probes are a place where infrastructure agents quietly screw up. The naive setup: /health/live returns 200 if the process is running, /health/ready returns 200 if the FastAPI server is up.

Both of those probes can return 200 while the agent is unusable. The Slack bot thread can have died. The MCP servers might not have connected. The LLM gateway might be unreachable. The probe says healthy; users get error replies.

The fix is for the probes to reflect what the agent actually needs to function. /health/ready should return 200 only when FastAPI is up and the Slack bot thread is alive and the critical MCP connections established. We use a shared state module (health_state.py) with module-level globals updated by the components as they come up. The probe reads the globals.

The principle: a health probe that returns 200 when the system is broken is worse than no probe at all, because it actively misleads the orchestrator. Make them honest.

Graceful shutdown

The last piece is the one I forgot for the longest. A process that doesn’t shut down cleanly will, sooner or later, corrupt some piece of state — drop an event mid-flight, lose an investigation journal entry, leak a redis lock. The Kubernetes pod lifecycle gives you a window to clean up; use it.

Practical: handle SIGTERM, drain in-flight requests with a deadline, mark the bot as offline before exiting, flush any structured logging buffers. We also handle SIGINT with a double-signal pattern — first Ctrl+C asks for graceful shutdown, second forces immediate exit. The double-signal pattern is for local dev where developers expect Ctrl+C to actually quit, and it’s a small piece of code that saves a lot of “why won’t this process die” debugging.

What I’d build first if I were starting over

The order I’d ship infra in if I were doing it again:

  1. Request context infrastructure. Tag everything with an ID from minute one.
  2. Structured trace logging with greppable tokens. You’ll thank yourself the first incident.
  3. Health probes that reflect real readiness. Don’t ship to Kubernetes without honest probes.
  4. Tool-call failure modes (timeout, empty, auth, degradation). Each needs a distinct response envelope.
  5. Identity / approval / action fingerprinting before any side-effecting tools ship.
  6. Per-environment secrets boundary. Defaults must boot against the right environment with no implicit fallback.
  7. Graceful shutdown. Wire it in early; harder to add later.

These aren’t the parts of an agent that get talked about, because they aren’t the interesting parts. They are the parts that determine whether the agent is something you can run in production or a demo you have to babysit.

The agentic-AI hype cycle is about the LLM. The production-AI work is about everything else.

More like this in the newsletter.

Practical notes on agentic AI for platform engineering. No hype. No spam.

Posted in

Leave a Reply

Discover more from Ryan Flynn

Subscribe now to keep reading and get access to the full archive.

Continue reading