Tell me if this sounds familiar: you’ve set up a comprehensive suite of benchmarks, and your agent is sitting at a triumphal 95% accuracy on your eval set; then, you wire it into a ten-step workflow, only to find that your actual success rate in production is 60%.
That is the unforgiving mathematics of sequential composition, and most teams building AI agent workflows are not accounting for it.
Shaun Moran has christened the gap between how reliable teams think their agents are and how reliable the end-to-end workflow actually is the "17x Error Trap". Carnegie Mellon's 2025 study on agent performance in realistic office tasks found failure rates that bear this out: systems that look credible on individual benchmarks collapse when chained into anything resembling real work.
The arithmetic lurking behind these results is simple, stark, and sobering, and it’s what we will cover in this post.
When agents are chained together into a sequence, each agent conditions its output on the input it receives, which is the previous agent's output. When step two misinterprets something ambiguous, step three does not get a chance to correct it. Step three inherits the misinterpretation as its premise and reasons forward from there. Errors are not independent events that might offset each other across a pipeline, they are inherited – for this reason, they compound.
A recent paper from Google DeepMind on scaling agent systems (Kim et al., 2025) formalizes this: as you compose agents, reliability degrades structurally. The failure is not necessarily contained in any individual component, it emerges as a consequence of the act of composition itself. More agents, more handoffs, more steps, with each step being another opportunity for error to enter the system, with no guarantee that there will be a corresponding opportunity for it to leave.
Here is what that product looks like across the configurations teams actually deploy:
|
Per-agent reliability |
5 steps |
10 steps |
20 steps |
|
90% |
59% |
35% |
12% |
|
95% |
77% |
60% |
36% |
|
99% |
95% |
90% |
82% |
|
99.9% |
99.5% |
99% |
98% |
To hit 90% end-to-end reliability across a ten-step workflow, the reliability for each agent needs to clear roughly 99%. As distressing as it might be to hear, the truth is, most production agents are nowhere near that. Worse, the collapse is steepest exactly where teams are most ambitious about chain length, i.e., exactly where agent systems represent the most interesting targets to invest in robust, powerful automation.
The 95% column is worth staring at. A 95% agent is one that most teams would ship, but a 60% end-to-end workflow is one that most teams would not.
Thus far, reliability has been our focus, but there are other things that compound alongside chain length, and all of them show up somewhere you notice. Here’s a non-exhaustive list of compound effects to keep an eye out for:
A workflow that is 60% reliable, three times more expensive than projected, and takes a day to debug per incident is a liability masquerading as automation when viewed from a distance.
Gartner's June 2025 forecast that over 40% of agentic AI projects will be canceled by the end of 2027 is this math playing out at the portfolio level. Teams build, deploy, watch the numbers come in, and eventually conclude that the economics do not work — not because agents cannot do the task, but because the composition of agents doing the task is structurally too expensive and too unreliable to sustain.
The most uncomfortable part of the compound reliability problem is that it does not announce itself. A structural failure is consistent with each agent passing its eval in isolation and each span in the production trace looking clean. No single component appears to be broken, because it’s not. Per-component testing will not surface these problems, and neither will basic distributed tracing. There is nothing for a conventional observability platform to grab onto, just a wrong answer at the end of a pipeline that looks, from every individual vantage point, like it worked.
This is the class of problem the MAST taxonomy (Cemri et al., 2025) was built to name. Multi-agent failures are systematically missed by conventional tooling because conventional tooling was built to catch execution failure, not reasoning failure. When everything executed and the answer is still wrong, the instrumentation has no opinion.
We went deeper on this in a recent post on agentic debugging, where compound reliability decay is one of five failure modes that standard observability misses entirely.
The compound reliability problem is real. It is not, however, inevitable.
A ten-step workflow composed one way degrades to 60%. Composed another way, the same task does not produce the same decay curve, because the curve is a function of topology as much as of model quality. Some architectural patterns concentrate reasoning in fewer, higher-stakes decisions, others distribute it through handoffs that can be verified and bounded, and some introduce feedback loops that let the system recover from a bad intermediate step instead of inheriting it. The math in the table above assumes the simplest case, with independent sequential steps, no correction, and no structural resistance to error propagation. That case is common, but other options are on offer.
Picking the right composition pattern is the difference between an agent workflow that looks like automation and one that looks like a 60% reliability problem you are paying three times over for.
We walk through the architectural patterns that change this calculus — and the failure modes (including this one) that standard tooling misses — in our upcoming webinar, “Navigating the Multi-Agent Trap”. It drops on May 7, 2026, at 11:00 AM PT / 2:00 PM ET, register here if you’re interested in joining us.