There is a moment in every engineering era when the people doing the work are ahead of the vocabulary used to describe it. They have intuitions that are correct, patterns that work, scars from things that didn't — but the theoretical framework hasn't caught up yet. I think that's where we are with applied AI, and I think the people building closest to the metal are accumulating something valuable that the conference talks and research papers haven't fully named.
This is an attempt to name some of it.
- Why LLMs have structural limits analogous to mechanical calculators — and why that's not a counsel of despair
- What a harness is: the four dimensions that turn raw LLM output into reliable infrastructure
- Why context is a quality problem, not just a cost problem — and the architecture that follows
- Where we are in the technology cycle: the transistor moment, hand-soldering, and what comes next
- How mathematicians use analogy as a load-bearing tool — what Thales and Bachelier have in common
- What builders at the hand-soldering stage accumulate, and why it will matter when the abstractions break
Why the Analogy Matters
Mechanical calculators — the beautiful, clicking, gear-driven machines of the early 20th century — were genuinely remarkable. They could add, subtract, multiply, divide. The best of them could do it faster than any human. But they were never going to show you a photograph of a kitten or compute a Fourier transform. Not because the engineers weren't brilliant. Because the substrate had hard physical limits baked into its nature. Gears have friction. Springs have fatigue. The architecture of the mechanism constrained the space of possible computation.
It took electronics — a categorically different substrate — to make computation scalable in the ways that mattered.
Large language models have structural limits that feel similar in character. Context windows. Token-by-token generation. No persistent state between calls. Confident hallucination under uncertainty. These aren't bugs to be patched in the next release. They're architectural properties, the way gear ratios were architectural properties. You can work around them. You can design systems that are robust to them. But you cannot engineer them away from within the paradigm.
The question isn't whether LLMs have limits. Everything has limits. The question is: what do you build given that those limits are the ones you have to work with?
The answer — the one that emerges when you actually build real systems rather than demos — is: harnesses.
Structure That Makes Output Buildable
A harness is a structure that constrains and directs AI output so it stays useful, inspectable, and composable. The key insight is that raw LLM output is unpredictable enough to be a liability in production. A harness makes it predictable enough to be infrastructure.
There are four dimensions to a harness:
Input Harness
Controls what the model sees. Skill documents and system prompts that define role and constraints. Feature extraction that compresses raw data before it enters the context. Minimal context patterns that strip irrelevant information so worker agents stay focused. The input harness is the discipline of asking: what does this agent actually need to know, and what is just noise?
Output Harness
Controls what comes out. Forcing structured JSON via schema in the prompt. Pydantic validation on responses. A feedback schema with explicit fields like score, issues, and revision_focus. The output harness converts open-ended text generation into something with a defined interface.
Execution Harness
Controls how output is used. Self-healing loops with caps on iterations. Human approval gates before consequential actions. Panel registries that map finding types to UI components. Artifact patterns where agents write scripts before executing them. The execution harness is where you decide what the agent is allowed to do autonomously versus what requires a human in the loop.
Evaluation Harness
Controls how you know if it worked. Ground truth tables for comparing agent findings against known answers. Objective validators like pronunciation dictionaries for rhyme checking. Reproducible test scenarios for debuggers. The evaluation harness is what separates systems that feel like they're working from systems you can actually prove are working.
A harness converts the LLM from an open-ended text generator into a component with a defined interface — known inputs, known output shape, known failure modes. Without a harness, you have a demo. With a harness, you have infrastructure you can build on.
The Insight That Changes Everything
One of the most practically valuable intuitions that emerges from building real agentic systems is this: context is not just a cost problem, it's a quality problem. The epistemic version of the same claim — what a subproblem actually needs to know — is what the essay on local reasoning calls minimum sufficient context.
A bloated context window doesn't just burn tokens. It increases the probability that the model gets distracted, loses focus, or anchors on irrelevant earlier information. The minimal-context worker agent is often more accurate precisely because there's less noise.
This leads to a design principle that sounds obvious once you've been burned by ignoring it: any task that is repetitive, parallelizable, or requires only a slice of the available information should be delegated to a stateless worker with a surgical context. The orchestrator's job is to decide what to delegate, construct the minimal context for the worker, and aggregate results.
Consider the difference between asking an agent to scan 1,000 stocks for anomalies versus asking it to write a Python filter function that the runtime then applies to all 1,000 stocks in milliseconds. The first approach burns tokens proportional to the data size. The second approach burns tokens once to produce a reusable artifact.
The LLM writes the strategy. Python executes it at scale. The mistake is using the LLM for both.
This is the pattern that separates thoughtful agentic architecture from naive LLM pipelines. The model's comparative advantage is reasoning about strategy and writing code that encodes that strategy. The runtime's comparative advantage is applying that strategy at C speed across arbitrary data volumes. Conflating these two is like hiring a senior engineer to manually check every row in a database.
Where We Actually Are
A useful analogy for the current moment: we are at the transistor stage.
The mechanical calculator analogy captures something true about the structural limits of LLMs. But the implication of that analogy — that we need a categorically different substrate to proceed — may be wrong. What happened between vacuum tubes and integrated circuits wasn't a substrate change. It was the same physics, radically better organized. The transistor made vacuum tube computation vastly more scalable. The integrated circuit made transistor computation vastly more composable. The microprocessor made integrated circuit computation vastly more accessible.
Each step wasn't a new kind of physics. It was a new abstraction layer that hid the complexity of the previous layer and made the next layer of capability possible.
Right now, building reliable systems with LLMs requires enormous engineering effort. Harnesses. Evals. Multi-agent patterns. Context engineering. Artifact patterns. Human-in-the-loop gates. This is the circuit design work. We are hand-soldering. Every connection made by hand, every failure debugged manually, every reliability property achieved through careful architecture rather than inherent robustness.
The question the transistor analogy raises is: what is the integrated circuit? What abstraction layer makes all this hand-soldering disappear into reliable, composable components that the next layer of builders can take for granted?
Nobody knows yet. The MCP protocol is a candidate for part of it — a standard interface for tool use, analogous to how a standard chip footprint let you swap components without redesigning the board. Evals frameworks are a candidate for another part — automated quality measurement that replaces manual inspection. The harness patterns we're developing now may simply be the schematics that future abstractions compile from.
Most production AI at banks, hedge funds, and mid-size companies is still glorified RAG pipelines and single-agent chatbots. True multi-agent coordination with feedback loops is genuinely uncommon outside of well-funded AI-native startups. The gap between what's demoed at conferences and what's actually running in production is enormous. The people building closest to the metal are ahead.
What Mathematicians Know That Others Don't
The transistor analogy didn't arrive from nowhere. It emerged from a particular habit of mind — one that treats analogy not as a rhetorical device but as a cognitive tool with load-bearing potential.
Most people who use analogies stop at the felt similarity. Two things remind you of each other, you note the resemblance, you move on. The mathematical instinct is different. It asks: what exactly is preserved? Not "these feel similar" but "what is the invariant that makes them similar, and how far does it extend?"
Thales didn't just observe that a shadow looks like a smaller version of the building casting it. He asked what the precise structure was. The answer is the ratio — the invariant that holds across any similar triangle regardless of scale. Once you've identified the invariant, the analogy stops being a metaphor and becomes a theorem. It becomes load-bearing. You can measure the height of a building you cannot climb by measuring its shadow and a stick you can hold in your hand. Proportionality is the bridge between the tractable and the intractable.
Mathematics is the discipline that takes an analogy — which starts as a felt similarity — and asks what exactly is preserved. When you find the invariant, the analogy stops being poetry and becomes a tool.
The Brownian motion and stock prices example is the canonical case. Bachelier in 1900 noticed that the movement of prices felt like the movement of particles suspended in fluid. The profitable move was identifying that both are governed by the same stochastic differential equation. Once you have that structural equivalence, you can import an entire century of physics intuition into finance. Black-Scholes is downstream of that one observation. The analogy didn't just describe something — it transferred a mathematical apparatus wholesale from one domain to another.
The pattern recurs everywhere once you see it. Spectroscopy uses the light a star emits to infer its temperature, composition, and velocity — something tractable standing in for something unreachable. Carbon dating uses isotope ratios today to infer age. Feature extraction in a stock anomaly pipeline uses computable statistics to stand in for latent structure the model can then reason about. In each case: something measurable stands in for something that cannot be directly observed, and an invariant relationship — proportionality, or something more complex — is what makes the substitution valid.
This is why the mechanical calculator analogy was worth taking seriously rather than dismissing as a vague metaphor. The felt similarity was: both have structural limits baked into their substrate. The invariant question was: what exactly is the constraint, and does it require a new substrate to overcome, or just a new abstraction layer? Pushing on that question produced the transistor analogy — a sharper, more load-bearing claim.
The reason analogy-as-cognitive-tool annoys some people is that analogies are often used loosely, as decoration. Someone says "stock prices behave like feathers in the wind" and it sounds like poetry. They don't see that you're one step away from a stochastic differential equation. The mathematician's superpower isn't coming up with more analogies. It's knowing which ones are worth formalizing, and having the tools to do it.
Use a known, measurable small thing to infer an unknown, unmeasurable large thing. The shadow is tractable. The building is not. Proportionality is the bridge. This pattern — instrumental analogy — recurs across spectroscopy, carbon dating, statistical inference, and agentic AI architecture. The substrate changes. The move stays the same.
The Value of Hand-Soldering Experience
There is a specific kind of knowledge you accumulate when you build at the hand-soldering stage of a technology. You learn the failure modes that the abstractions will eventually hide. You develop intuitions about what the substrate will and won't do. You understand the tradeoffs that the next generation of tools will make invisible.
The engineers who built systems with vacuum tubes understood electricity in a way that integrated circuit designers could afford not to. That understanding became relevant again when the abstractions broke down — when a chip ran hot, or a timing issue caused intermittent failures, or a power supply problem manifested in mysterious ways. The people who knew what was underneath were the ones who could debug what the abstraction couldn't explain.
The same will be true for applied AI. The patterns being developed now — minimal context workers, self-healing loops with critic agents, artifact-generating autonomous agents, evaluation harnesses with ground truth tables — these are the foundations that the next generation of tools will be built on. Understanding them at the implementation level, not just as conceptual patterns, is what will make the difference when the abstractions break.
Which they will. They always do.
Programming languages were invented to make it easier for humans to tell machines what to do. The argument sometimes made today is that since LLMs write the code, maybe we should optimize for what the machine understands rather than what the human reads. Generate LLVM IR directly. Skip the human-readable layer.
This misunderstands what source code is for. Source code isn't just input to a compiler. It's the artifact that humans maintain, audit, review, and reason about. It's how teams communicate intent across time. It's the ground truth for debugging when production behaves unexpectedly. Removing the human-readable layer would be like navigating by raw GPS coordinates instead of street names — technically more precise, practically catastrophic.
What's actually happening is that the level of abstraction is moving upward, not downward. Humans write intent. LLMs write code. The human-readable representation remains essential — it just moves one layer up the stack.
We are not replacing the abstraction layers. We are adding one more.
And we are, right now, hand-soldering the components that layer will be built from.
That seems like a good place to be.