Here's something that took me a while to fully internalize: when people talk about "AI agents", they often picture a chatbot on steroids — a smarter, more capable version of the classic prompt-response loop. But that mental model leads to some pretty avoidable design mistakes. The ones I see most often aren't about picking the wrong model or writing bad prompts. They're architectural. And they all trace back to the same confusion: the agent and the LLM are not the same thing.
The LLM is a service. The agent is the system that calls it.
Once that clicks, a lot of things become clearer.
- The layers most people design for — and the ones they treat as optional but aren't
- Context rot — what goes wrong when the LLM becomes your database
- What the right design looks like: toolchain, lean calls, persistent state outside the window
- The "OK" problem — routing, cheap classifiers, and keeping the expensive model clean
- The code generation flywheel — strategy once, execution at scale
- The short version: which layer should handle this?
Structure Beyond the Chat Loop
A well-designed agentic system has several distinct layers:
- The human — types something, speaks something, clicks something
- The agent — receives the input, decides what to do with it, orchestrates everything else
- The LLM — one of many tools the agent can call, good at language and reasoning
- Code execution — the agent can run arbitrary Python (or whatever), instantly and essentially for free
- Memory / storage — databases, vector stores, key-value stores — persistent state that lives outside the LLM's context
- External tools — APIs, MCPs, web search, anything the agent can call
Most people design for layers 1–3 and treat 4–6 as optional extras. In practice, layers 4–6 are where the real leverage is.
The LLM is expensive, slow, and has a finite context window. Code execution is fast and essentially free. Storage is persistent and reliable. The whole point of the agent layer is to route tasks to the right layer — not to route everything through the LLM.
Context Rot
Let me give a concrete example of how this goes wrong.
A while back I decided to ditch my nutrition tracking app and just tell ChatGPT what I was eating. It worked pretty well at first — I'd describe a meal, it would give me the macros, and at the end of the day I could ask for a breakdown. Simple enough.
But then I started asking questions like "what did I have yesterday?" or "how's my protein been this week?" And things got... shaky. It would occasionally hallucinate meals I hadn't mentioned. After two or three weeks of conversation history, it started losing track of things entirely. The context had grown so large that the model was essentially drowning in its own memory.
The problem wasn't the model. The problem was the design. I was using the LLM as a database — relying on its context window to store and retrieve structured historical data. That's precisely what it was never designed to do.
This pattern has a name: context rot. The longer the conversation goes, the more the LLM struggles to accurately retrieve earlier information, the more it hallucinates, the more it loses focus on the current task. It's not a bug in the model. It's a misuse of the tool.
Toolchain, Not Monologue
The fix is straightforward once you think in layers.
The LLM should only touch the natural language parts — parsing what I said into structured data, and formatting stored data back into readable output. Everything in between should bypass the LLM entirely.
Concretely, the agent should have a toolchain:
parse_meal_description— LLM call, converts natural language to structured entryget_macros— database lookup, no LLM neededstore_entry— writes to database, no LLM neededget_previous_entries— database query, no LLM neededstore_favorite_meal— saves a named shortcut, no LLM neededupdate_preferences— modifies user settings, no LLM needed
The active context window for any given LLM call? Probably just the last two or three exchanges, plus whatever specific data was retrieved for that query. The LLM doesn't need to know that you asked for your macros four times yesterday, or that you changed your calorie target last Tuesday. That's noise. The agent knows it, the database knows it, the LLM doesn't need to.
This design is faster, cheaper, more reliable, and doesn't degrade over time. The conversation can go on for months and the LLM call for each interaction stays lean.
Router Pattern
Here's a smaller but equally instructive example.
Imagine someone is interacting with an agent and types "ok" to acknowledge something it just said. In a naive implementation, that "ok" gets appended to the context and the whole thing — full conversation history, system prompt, everything — gets sent to GPT-4 or Claude Opus to figure out what to do next.
That's like hiring a senior consultant to read a one-word confirmation email.
The right design: route the "ok" to a fast, cheap classifier model first. It recognizes it as an acknowledgment, increments the conversation state, and moves on. The expensive model never sees it. You've just saved cost, reduced latency, and kept the expensive model's context clean for when it actually matters.
This is the router pattern — using a cheap fast model as a triage layer that decides what even needs to go to the capable model. Most agentic systems don't have this. They route everything to the best model available, which is wasteful in the same way that using a sledgehammer for everything is wasteful.
Strategy Once, Runtime at Scale
One more pattern worth knowing about, because it's genuinely underused.
Say your agent needs to do something repetitive — filter a list of stocks by some criteria, transform a data format, run a calculation across many rows. The naive approach: send all the data to the LLM and ask it to process everything. This burns tokens proportional to your data size, is slow, and doesn't get cheaper over time.
The better approach: ask the LLM to write a function that does the thing. Store the function. Call it directly every subsequent time. The LLM pays its cost once. Python runs the function at C speed across any volume of data, essentially for free.
This is the code generation flywheel: the LLM writes the strategy, the runtime executes it at scale. The mistake is using the LLM for both. Once you see this pattern you start finding opportunities for it everywhere.
Which Layer Handles This?
If there's one thing to take from all of this, it's that the design question for an agentic system is never just "which model should I use?" It's "which layer should handle this?"
- Natural language in, structured data out → LLM
- Repetitive computation at scale → code execution
- Anything that needs to persist → storage
- Cheap acknowledgments and routing → fast classifier
- Judgment calls and ambiguous reasoning → capable model with surgical context
The LLM is the most expensive, slowest, and most context-sensitive part of your system. Treat it accordingly — use it for the things only it can do, and get everything else out of its way.