Architecture

I Deleted 1,300 Lines of Code. Atlas Got Smarter.

Modern LLMs are reliable enough to format their own output. The scaffolding I wrote in 2023 had quietly become the thing holding the model back.

I made Atlas — the multi-agent platform I lead — noticeably smarter last week. I did it by deleting ~1,300 lines of code.

TL;DR: Modern LLMs are reliable enough to format their own output when you give them clear instructions.

The trace that set me off

A teammate asked Atlas a simple ops question. First it returned an 8-row table demanding they pick an environment group before it would do anything. Then a backend tool timed out and it gave up: “I couldn’t get the details, here are some queries you can run yourself.” A dead end, handed back to the human. When it finally worked, the answer came back in the exact same five-section template it stamps onto every investigation. Three disclaimers and three feedback prompts in one thread.

It read like filing paperwork, not talking to a colleague.

What nagged me, though: the underlying model is genuinely good. The intelligence was there. So why did every answer come out sounding like a form?

The culprit

I went looking and found it: a 441-line module whose entire job was to take the model’s prose and restructure it into a rigid taxonomy. Thirteen response types. Per-type builders. Section-slicing. Label normalization.

That code wasn’t a mistake. It was written in 2023, when you genuinely couldn’t trust an LLM to format its own output. It was a safety scaffold, and it was load-bearing.

The model had quietly outgrown it.

The scaffold had flipped from a safety net into the source of the bugs we kept chasing. The robotic templates, the wordiness, the “why does it always use those headings?” None of it was the model failing. It was the model succeeding, and then our own code wrapping the answer in a uniform.

The few-shot trap

The template was even self-inflicted: we’d written five example headings into the prompt, and the model dutifully used those exact five, every time. It’s the classic trap. Your few-shot examples don’t teach a style. They become the style.

The call

So I made the call that feels backwards until you’ve lived it: I deleted the scaffold.

Killed the 441-line module. The model’s prose is the answer now.
Killed the dataclass layer above it (~770 more lines).
Rewrote the prompt to forbid the old labels, so the model has to actually think instead of filling in blanks.

Net result

−1,281 lines · +141 lines

A single commit. The assistant got better, not worse.

Three things I’d hand anyone building on LLMs right now

Audit your scaffolding against the model you have today, not the one you had when you wrote it.

“Our AI feels dumb” is usually old defensive code fighting a model that has moved past it.

Your few-shot examples don’t suggest a style. They install one.

Want range? Sometimes you have to explicitly forbid the obvious answer.

Deletion is a feature.

1,140 fewer lines means 1,140 fewer places for bugs to live.

The best instinct I exercised that week wasn’t writing something clever. It was recognizing the cleverness was already in the model, and getting the code out of its way.

More on this theme: browse the Articles feed for related notes on multi-agent architecture, evaluation, and infrastructure topology.

Build better AI agents for production infrastructure.

Practical notes on agentic AI, evaluation, and platform engineering — straight to your inbox.

No hype. No spam. Just practical engineering notes.

recent posts

about