Field notes / engineering

The 20-step problem — why production AI agents fail, and the architecture that fixes it

Your agent demos beautifully. Five steps, clean output, the room nods. Then it meets production, the task is twenty steps long, and it reaches a correct end-state about a third of the time. Nothing got worse — the model is the same one that demoed. What changed is the arithmetic. A 95% per-step success rate compounds to 36% over twenty steps, and real chains behave worse than that because the errors are correlated. Gartner puts the 2026 agentic-pilot failure rate at 78%. This is the mechanism behind that number, and the architecture that survives it.

12 min read 18 June 2026

Nitish Founder, Indica Tech Field notes / Engineering

The compounding-error table. A per-step success rate that looks production-grade in isolation collapses across a long chain. 95% per step — the number most agent demos quietly run at — is a 36% end-to-end success rate by step 20.

Executive summary

The single most expensive misunderstanding in production AI right now is that an agent’s reliability is a property of the model. It is not. It is a property of the chain. An individual step that succeeds 95% of the time looks production-ready when you inspect it on its own. String twenty of those steps together into an autonomous task and the chain reaches a correct end-state 36% of the time, because success multiplies: 0.95 ^ 20 = 0.36. Push the per-step number to a heroic 99% and a hundred-step task still fails 63% of the time. Gartner’s Q1 2026 data put the agentic-pilot failure rate at 78%, and analysts expect over 40% of agentic projects to be cancelled by 2027. The 2026 International AI Safety Report found the same shape from the model side: frontier models succeed reliably on tasks that take minutes and fall apart as tasks stretch into the hours that real work requires. None of this is a reason not to build agents. It is a reason to stop building them as single long chains and start building them the way every other reliable distributed system has been built for decades — bounded units of work, verification at the seams, isolation between components, idempotent retries, and a human gate in front of anything irreversible. This post is the math, then the architecture.

The math nobody runs before the demo

Take a single agent step — read a record, call a tool, parse a result, decide the next action. Suppose it does the right thing 95% of the time. That is a genuinely good step. In a traditional service you would ship it. The mistake is assuming the chain inherits the step’s reliability. It does not. The chain inherits the product of every step’s reliability, because the whole task only succeeds if every step succeeds.

So the arithmetic is p ^ n, where p is per-step success and n is the number of steps. At 95% per step: five steps succeed 77% of the time, ten steps 60%, and twenty steps just 36%. By the time the agent is doing the kind of multi-tool, multi-decision work that justified building an agent in the first place, it is failing the majority of the time — not because any step is bad, but because there are a lot of them. The same model that earned applause at five steps is now below a coin flip.

Two numbers from this table are worth committing to memory. First, 95% per step is a 36% task at 20 steps — the headline of this post. Second, 99% per step is a 37% task at 100 steps — meaning even a near-perfect step rate does not save you once chains get long; it just moves the cliff further out. There is no per-step accuracy that makes a long unguarded chain reliable. The structure has to change.

Why real chains are worse than the table

The p ^ n model is the optimistic case, because it assumes every step fails independently. Production chains do not fail independently. They fail in correlated bursts, and that makes the real-world numbers worse than the arithmetic suggests.

The mechanism is context. When an agent makes a small error in step three — misreads a field, picks a slightly wrong tool, hallucinates a plausible value — that error does not stay in step three. It is written into the context that step four reads. Now step four is reasoning from a premise that is already wrong, and it is more likely to produce its own error, which step five then inherits. DeepMind’s Demis Hassabis described this as “compound interest in reverse,” and it is the right image: the chain does not just accumulate errors, it accelerates into them. A misunderstanding in step one is not one bad step out of twenty — it is a poisoned premise that raises the failure probability of every step after it.

This is why agents in production fail in a recognisable way: they do not stop at the broken step and raise their hand. They confidently keep going, building an increasingly elaborate structure on top of the early mistake, and they end with a fluent, well-formatted, completely wrong result. The fluency is the danger. A traditional system throws an exception; an agent narrates its way to the wrong answer. That is the failure mode the architecture below is designed to interrupt.

Fix 1 — Bounded autonomy

The first move is the cheapest and the one teams resist most, because it feels like admitting the agent is not as autonomous as the pitch deck said. Cap the chain. An agent should have a hard ceiling on how many steps it may take before it must either finish or hand back, and that ceiling should be set deliberately against the reliability table, not left unbounded because “the model will figure it out.”

Concretely: decompose the big task into sub-tasks that are each short enough to sit in the high-reliability part of the curve — typically a handful of steps, not dozens. A twenty-step task becomes four bounded sub-tasks of five steps each, with a checkpoint between them. The end-to-end reliability of four 77% sub-tasks with verification at each seam is dramatically higher than one unguarded 36% chain, because a failure is now caught and contained at a seam instead of silently propagating to the end. Bounded autonomy is not the opposite of agency. It is the thing that makes agency survive contact with a real task.

Fix 2 — Verification gates at the seams

A checkpoint is only useful if something checks it. The second fix is a verification gate between bounded units — a deliberate step whose only job is to ask “is the state still correct?” before the next unit is allowed to build on it. This is the single highest-leverage pattern in agent reliability, and it is the one most builds skip entirely.

Verification does not have to be expensive. The gate can be a deterministic check where one exists (does the SQL parse, does the JSON match the schema, does the total reconcile, does the referenced file exist), an LLM-as-judge rubric scored against the sub-task’s acceptance criteria where it does not, or a cross-check by a second model with a different prompt. The point is that the gate either passes the state forward, triggers a bounded retry of the failed unit, or escalates — and crucially, it stops the poisoned premise from reaching the next unit. Every gate you add converts a silent compounding failure into a caught, local one. That is the entire game: catch errors at the seam, where they are cheap, instead of at the end, where they are a wrong answer with a customer attached.

Fix 3 — Subagent isolation

When you do decompose, isolate. A subagent handed one bounded sub-task should receive only the context it needs and return only the result the parent needs — not the entire running transcript of everything every other subagent has done. This matters for reliability, not just tidiness: a shared, ever-growing context is exactly the channel through which one subagent’s early error contaminates another’s reasoning. Isolation breaks the correlation.

The pattern is an orchestrator that owns the plan and the verified state, and subagents that are stateless workers against a clean brief. The orchestrator passes a subagent a tight, validated input; the subagent does its bounded work; the orchestrator verifies the output before merging it back into the canonical state. Errors are contained inside the subagent that made them, surfaced at the merge gate, and never silently inherited by a sibling. This is the same separation of concerns that makes microservices debuggable — applied to the place agents actually break.

Fix 4 — Idempotent steps and replay

Bounded retries only help if retrying is safe. The fourth fix is borrowed wholesale from distributed systems: make each step idempotent, and make the chain replayable. If a step calls a tool that sends an email, charges a card, or writes a row, then a naive retry after a transient failure will send the email twice, charge the card twice, write the row twice. The reliability table already told you retries will be frequent; without idempotency keys, every retry is a new way to corrupt state.

Give every side-effecting step an idempotency key so a repeat is a no-op rather than a duplicate. Persist the verified state after each bounded unit so a failure resumes from the last good checkpoint instead of restarting the whole chain — restarting a twenty-step task from scratch on every hiccup is how a 36% task becomes a 0% one. The combination — idempotent side effects plus checkpointed, replayable state — is what lets you retry aggressively at the seams without the retries themselves becoming the incident. It is the same discipline behind the queue layer in the fallback ladder and the idempotency section of production-grade retry logic, pointed at agent steps.

Fix 5 — Human gates on high-blast-radius actions

Not every step is equal. Reading a record and sending a refund are not the same risk, and the architecture should not treat them the same. The fifth fix is to classify actions by blast radius and require a human approval gate in front of the irreversible, high-consequence ones — the same triage logic behind capping hallucination exposure: you do not need a human on 100% of steps, you need a human on the 100% of steps that can actually hurt you.

In a regulated deployment this is non-negotiable, and it is also where the trust comes from. A FinTech agent that drafts the customer remediation and routes it to a human before it sends is both safer and more sellable than one that sends autonomously and is right 36% of the time. The pattern is: low-blast-radius steps run autonomously, high-blast-radius steps pause for an approval with full context attached, and the agent is designed from the start to operate in a world where some of its actions require a human yes. The teams that get this right do not bolt the human on after an incident — they decide the blast-radius map before the first line of agent code, the same way they decide the identity and privilege the agent runs under.

The pattern, in one place

Run this as a design review before you ship an agent, and as an audit on one that is already misbehaving. Score each on your current build.

1 — Bounded autonomy. Hard ceiling on chain length. Big tasks decomposed into short, high-reliability sub-tasks with checkpoints between them. No unbounded “keep going until done” loops.

2 — Verification gates. A deterministic check, judge rubric, or cross-model check between every bounded unit. The gate passes forward, retries locally, or escalates — it never lets unverified state propagate.

3 — Subagent isolation. An orchestrator owns the plan and the verified state. Subagents get a tight brief and return a tight result. No shared, ever-growing transcript.

4 — Idempotent steps and replay. Idempotency keys on every side-effecting step. Verified state checkpointed after each unit so failures resume from the last good point.

5 — Human gates on blast radius. Actions classified by consequence. Irreversible, high-consequence steps pause for a human approval with full context. Decided before the build, not after the incident.

Risks and what to avoid

Don’t try to fix it with a better model. The most common reaction to a flaky agent is to swap in the newest, highest-benchmark model and hope. It moves the cliff out by a few steps and costs more per call. The reliability table is exponential; a linear improvement in per-step accuracy cannot beat it. Structure beats model on long chains, every time.

Don’t mistake a longer prompt for a verification gate. Adding “think carefully and double-check your work” to the system prompt is not a gate — it is the same fallible step marking its own homework inside the same context that already contains the error. A real gate is a separate step, ideally with different logic or a different model, that can fail the work and stop the chain.

Don’t let the agent self-report success. Agents are fluent, and fluency reads as competence. An agent that says “task complete” has told you nothing about whether the task is correct. Measure end-to-end success against an external acceptance check, not against the agent’s own summary of what it did.

Don’t skip idempotency because “it usually works the first time.” The reliability table guarantees retries. The first time a retry double-charges a customer or double-files a ticket, you will wish the idempotency key had been five lines of work up front. It is cheaper than the incident, always.

What good looks like — one quarter from now

The big autonomous task is gone, replaced by an orchestrator running a sequence of bounded sub-tasks. Each sub-task sits in the high-reliability part of the curve and is verified at its seam before the next one begins. Subagents work from clean briefs and return clean results; no early error silently contaminates a later step. Every side-effecting action is idempotent, and a failure resumes from the last verified checkpoint instead of restarting from zero. The handful of genuinely irreversible actions pause for a human, with full context, every time. End-to-end success is measured against an external check, not the agent’s own optimism — and it reads in the high nineties, not the mid thirties, because the architecture stopped the compounding instead of hoping the model would.

That agent is slower than the demo. It is also one you can put in front of a regulated customer, because when it fails it fails at a seam, locally and visibly, instead of narrating its way to a confident wrong answer twenty steps deep. That difference — failing safely at the seam versus failing silently at the end — is the entire distance between a demo and a production system.

Final thought

2026 is being sold as the year of the agent, and the failure-rate numbers are being read as a verdict on the technology. They are not. They are a verdict on the architecture. The agents failing 78% of the time are mostly single long chains run at a per-step accuracy that the compounding math was always going to defeat. The fix is not a better model or a more clever prompt. It is the boring, decades-old engineering discipline of bounding the work, verifying at the seams, isolating the components, making retries safe, and keeping a human in front of the irreversible. Do that, and the same model that failed at twenty steps succeeds — because you stopped asking it to be reliable across twenty steps and started building a system that is reliable around it.

Is your agent a single long chain — or an architecture?

Indica Tech’s two-week agent reliability audit maps your agent’s chain length against the compounding-error curve, scores it on the five-part pattern above, identifies the steps where errors compound silently and the actions that need a human gate, and gives you a 90-day implementation roadmap with named owners. Fixed price £3,500. Written report. Whether you hire us for the remediation or not.

See the audit engagement →