arxiv: 2605.08563 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

Zhanfu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords context contaminationLLM agentsretry mechanismserror propagationbudget allocationtool-augmented taskssuccess probabilitycascade overhead

0 comments

The pith

Retrying LLM agents after failure contaminates context and raises per-step error rates from a base level ε0 to a higher fixed ε1, requiring new formulas for success probability and budget splits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when an LLM agent fails a multi-step tool task and retries, the failed attempt stays in the context window and increases the chance of errors on the next try. It formalizes this as the Context-Contaminated Restart Model and derives exact expressions for overall success within K attempts plus the extra attempts caused by the contamination. A reader would care because common retry strategies in agent systems perform worse than expected under this effect, and the model supplies concrete ways to choose pipeline depth versus number of retries for a fixed total budget.

Core claim

Under the Context-Contaminated Restart Model a task consists of T tool-call steps that each fail with probability ε0 on a clean attempt but with elevated probability ε1 > ε0 on every subsequent attempt because prior failures remain in context. The model supplies a closed-form probability of succeeding in at most K attempts, a theorem for the additional attempts Delta K caused by contamination, an optimal depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) that maximizes success for fixed budget B = K T, an information-theoretic lower bound showing this K is tight up to a constant, and a theorem quantifying the gain from clearing context before each retry. Validation on SWE-bench Verified data

What carries the argument

The Context-Contaminated Restart Model (CCRM): a sequence of T tool-call steps in which any failure leaves the context contaminated, raising the per-step failure probability from base rate ε0 to a constant higher rate ε1 for all later attempts.

If this is right

Exact closed-form formula for the probability of succeeding within at most K attempts under contamination.
Cascade-overhead theorem that states the precise number of extra attempts required compared with a clean-restart baseline.
Optimal budget-allocation theorem that identifies the pipeline depth T* maximizing success for any fixed total budget B = K T.
Information-theoretic lower bound via Le Cam's method showing the required number of attempts is within a constant factor of the best possible.
Clean-restart dominance theorem that quantifies the exact improvement obtained by clearing context before each retry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent implementations should add context-clearing or summarization steps before retries to avoid paying the contamination penalty.
The optimal T* formula can be used directly once practitioners estimate ε0 and ε1 from their own logs or benchmarks.
The same contamination pattern may appear in other sequential LLM workflows such as long chain-of-thought reasoning or multi-turn planning.
Testing variable ε1 that depends on failure type or history length would be a direct extension of the constant-rate assumption.

Load-bearing premise

The elevated error rate after contamination is a fixed constant ε1 that does not depend on the details of the failure or on how many prior attempts have occurred.

What would settle it

Measure the actual per-step success rate on the first attempt versus on immediate retries for the same agent and tasks on SWE-bench or a similar benchmark; the ratio of those rates should equal the model's ε1/ε0 if the central claim holds.

Figures

Figures reproduced from arXiv: 2605.08563 by Zhanfu Yang.

**Figure 1.** Figure 1: CCRM: attempt k has T steps with per-step error rate εi where i = Zk ∈ {0, 1}. A failed attempt flips Zk+1 = 1, elevating the error rate. 2 The Context-Contaminated Restart Model 2.1 Pipeline Setup Definition 1 (Tool-call pipeline). A pipeline of depth T is a sequence of T tool invocations executed in order. The pipeline succeeds iff all T invocations succeed; it fails on the first failure. 2.2 Stochastic … view at source ↗

**Figure 2.** Figure 2: Synthetic validation (n = 30,000 MC trials, diamonds). (a) Formula matches simulation for all cascade strengths. (b) Cascade overhead diverges (phase transition) as ε1/ε0 increases. (c) Optimal T ∗ agrees with simulated optimum. 6 Clean-Restart Dominance Definition 3 (Clean-restart). Reset Zk = 0 before each attempt (clear context window). Pclean(EK) = 1 − (1 − p0) K. Theorem 5 (Clean-restart dominance). F… view at source ↗

**Figure 3.** Figure 3: Real-data validation. (a) CCRM fit to Verdent [14] SWE-bench Verified: pass@1=0.761, pass@3=0.812 (consecutive attempts, same context). IID overestimates by 0.174 at K = 3. (b) For independent fresh-start retries (USEagent [19], OpenHands), IID overestimates due to correlated task difficulty—a distinct phenomenon. 7.2 Real-World Validation Data. Verdent [14]: pass@1= 0.761, pass@3= 0.812 (“three consecutiv… view at source ↗

read the original abstract

When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first closed-form model of context contamination in LLM agent retries and shows a tight fit to SWE-bench, but the constant-ε1 assumption and parameter fitting need checking.

read the letter

The main takeaway is that this paper formalizes why retries often underperform in LLM agents: the failed context stays in the window and raises the per-step error rate from a base ε0 to a higher ε1. It introduces the Context-Contaminated Restart Model and derives five results, including an exact success probability formula and a closed-form optimal pipeline depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) for fixed budget B = K T. This is new; the cited literature notes the practical issue but has no equivalent mathematical treatment or theorems. The work does well on the empirical side. On SWE-bench Verified the model matches observed pass rates with error below 0.001 and shows the usual IID assumption overestimates pass@3 by 17 points, with a cascade ratio of 7.1. That gives practitioners a concrete way to compare clean-restart versus contaminated strategies. The soft spots are real but not fatal. The abstract states the closed forms without showing the derivations, so independent verification is limited until the full proofs appear. ε0 and ε1 are fitted to the same data that produces the reported fit, which introduces circularity when using the optimal-allocation theorem. The stress-test concern also lands: the model treats ε1 as fixed across all retries and failure types, but the single scalar fit does not test whether contamination actually stays constant when error modes or context length change. This paper is for engineers tuning retry budgets in tool-augmented agents and for researchers who want a starting point for formal analysis of agent reliability. Readers who need practical formulas or benchmark comparisons will get immediate use from it. It deserves a serious referee because the modeling is coherent, the empirical match is specific, and the topic matters for deployed systems. I would send it to review; the main questions will be whether the derivations hold and how robust the results are once ε1 is allowed to vary.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Context-Contaminated Restart Model (CCRM) for LLM agent pipelines. In this model, a chain of T tool-call steps each fail with base rate ε₀, but after failure the next attempt has elevated error rate ε₁ > ε₀ due to context contamination. The authors derive (R1) closed-form P(success in ≤K attempts), (R2) cascade-overhead theorem, (R3) optimal T* = sqrt(B log(1/(1-ε₁))/log(1/(1-ε₀))) for budget B=KT, (R4) Le Cam lower bound, (R5) clean-restart dominance. Validation on SWE-bench Verified shows CCRM error <0.001 vs IID overestimating pass@3 by 17.4pp, with cascade ratio 7.1.

Significance. The results provide a formal treatment of a common practical issue in LLM agents. The closed-form theorems and tight empirical validation (error <0.001) are notable strengths, as is the explicit optimal-allocation result. If the key assumptions hold, the framework could guide retry strategies in agent systems.

major comments (2)

[Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.
[Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.

minor comments (1)

[Abstract] Add explicit equation numbers when referring to the closed forms in the abstract and results summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the concerns are valid, and outline specific revisions that will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.

Authors: We agree that the constancy of ε₁ is a modeling assumption required for the closed-form results, including the optimal T* derivation. The current validation relies on an aggregate fit to the full SWE-bench Verified trajectories. In the revised manuscript we will add a new subsection that stratifies the observed per-step error rates by retry index (first attempt, second attempt, etc.) and by error category where possible. This will provide a direct empirical check on whether ε₁ remains approximately constant. The results of this stratification will be reported, and if material deviations appear we will qualify the applicability of the optimality theorem accordingly. revision: yes
Referee: [Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.

Authors: We acknowledge that fitting and evaluating on the identical dataset limits the strength of the predictive claims. While the primary empirical contribution is the demonstration that CCRM explains the observed data far better than the IID baseline, we agree that an independent test is needed to support the optimal-allocation result. In the revision we will perform a hold-out experiment: parameters will be fit on 70% of the SWE-bench tasks and the predicted success probabilities, cascade overhead, and optimal T* will be evaluated on the remaining 30%. The results of this out-of-sample check will be added to the empirical section. revision: yes

Circularity Check

0 steps flagged

No circularity: mathematical derivation of T* is independent of the data fit

full rationale

The paper defines the CCRM with fixed parameters ε0 and ε1, then derives the closed-form T* by optimizing the explicit success-probability expression P(success) = 1 - (1-ε0)^T * [(1-ε1)^T]^(K-1) under the budget constraint B = K T. This is a standard calculus exercise on the model equations and does not reduce to the SWE-bench fit. The fit is performed afterward solely for model validation and to obtain numerical ε values; it is not an input to the proof of the closed form. No self-citation, self-definition, or renaming of fitted quantities as predictions occurs in the derivation chain.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claims rest on two fitted error rates and the assumption of constant state-dependent failure probabilities; the optimal allocation theorem is derived from these quantities.

free parameters (3)

epsilon_0
Base per-step failure probability in clean context; estimated from data to fit observed success rates.
epsilon_1
Elevated per-step failure probability in contaminated context; estimated from data to fit observed success rates.
cascade_ratio
Ratio epsilon_1 / epsilon_0 reported as 7.1 after fitting to SWE-bench Verified data.

axioms (2)

domain assumption Each tool-call step fails independently with a constant probability that depends only on whether the context is clean or contaminated.
Core modeling choice stated in the CCRM definition.
domain assumption Contamination persists uniformly for the entire retry attempt once a failure has occurred.
Assumed when defining the elevated error rate ε1 for subsequent attempts.

invented entities (1)

Contaminated context state no independent evidence
purpose: To represent the mechanism that elevates error rate after a failed attempt.
New state introduced to explain the difference between clean and retry performance.

pith-pipeline@v0.9.0 · 5625 in / 1888 out tokens · 99247 ms · 2026-05-12T01:53:48.597038+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3 (Optimal pipeline depth). ... T∗=sqrt(B·log(1/(1−ε1))/log(1/(1−ε0)))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CCRM success formula (Theorem 1) ... modified geometric distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Wi- ley (2006)

Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wi- ley (2006)

work page 2006
[2]

ICLR (2024)

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.R.: SWE-bench: Can language models resolve real-world GitHub issues? In: Proc. ICLR (2024)

work page 2024
[3]

Le Cam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat.1(1), 38–53 (1973)

work page 1973
[4]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., et al.: ToolLLM: Facilitating large language models to master 16000+ real-world APIs. arXiv:2307.16789 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Wiley (2020)

Rausand, M., Barros, A., Hoyland, A.: System Reliability Theory, 3rd edn. Wiley (2020)

work page 2020
[6]

In: Proc

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. In: Proc. NeurIPS (2023)

work page 2023
[7]

Wiley (2002)

Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Com- puter Science Applications, 2nd edn. Wiley (2002)

work page 2002
[8]

Wang, L., Ma, C., Feng, X., Zhang, Z., et al.: A survey on large language model based autonomous agents. Front. Comput. Sci.18(6), 186345 (2024)

work page 2024
[9]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct:Synergizingreasoningandactinginlanguagemodels.In:Proc.ICLR (2023)

work page 2023
[10]

arXiv:2508.00890 (2025)

Wang, F., Liu, H., Dai, Z., Zeng, J., et al.: AgentTTS: LLM agent for test- time compute-optimal scaling in complex tasks. arXiv:2508.00890 (2025)

work page arXiv 2025
[11]

In: Proc

Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., et al.: Why do multi-agent LLM systems fail? A systematic study. In: Proc. NeurIPS D&B (2025)

work page 2025
[12]

In: Proc

Chen, Z., Zheng, Y., Lai, Z., Yang, Z., Li, C., Liu, Y., Lin, L.: Quadratic coreset selection: certifying and reconciling sequence and token mining for efficient instruction tuning. In: Proc. NeurIPS (2025)

work page 2025
[13]

arXiv:2511.05804 (2025)

Noël, V.: Catching contamination before generation: spectral kill switches for agents. arXiv:2511.05804 (2025)

work page arXiv 2025
[14]

Verdent AI: SWE-bench Verified Technical Report: 76.1% pass@1 and 81.2% pass@3.https://www.verdent.ai/blog/ swe-bench-verified-technical-report(2025)

work page 2025
[15]

Budget-aware tool-use enables effective agent scaling

Liu, T., Wang, Z., Miao, J., Hsu, I., Yan, J., Chen, J., Han, R., Xu, F., Chen, Y., Jiang, K., Daruki, S., Liang, Y., Wang, W.Y., Pfister, T., Lee, C.Y.: Budget-aware tool-use enables effective agent scaling. arXiv:2511.17006 (2025)

work page arXiv 2025
[16]

In: Proc

Yang, Z., et al.: Language models as implicit tree search. In: Proc. ICML (2025) Why Retrying Fails: Context Contamination in LLM Agent Pipelines 11

work page 2025
[17]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Zhu, K., Liu, Z., Li, B., Tian, M., Yang, Y., Zhang, J., Han, P., Xie, Q., Cui, F., Zhang, W., et al.: Where LLM agents fail and how they can learn from failures. arXiv:2509.25370 (2025)

work page arXiv 2025
[18]

arXiv:2602.12276 (2026)

Lee, N., Erdogan, L.E., John, C.J., Krishnapillai, S., Mahoney, M.W., Keutzer, K., Gholami, A.: Agentic test-time scaling for web agents. arXiv:2602.12276 (2026)

work page arXiv 2026
[19]

In: Proc

Applis, L., et al.: USEagent: Unified software engineering agent. In: Proc. ICSE (2026)

work page 2026
[20]

Spend less, reason better: Budget-aware value tree search for llm agents

Li, Y., Deng, W., Li, J., Li, X.: Spend less, reason better: budget-aware value tree search for LLM agents. arXiv:2603.12634 (2026)

work page arXiv 2026
[21]

arXiv:2602.18998 (2026)

Li, X., Ming, R., Setlur, P., Paladugu, A., Tang, A., Kang, H., Shao, S., Jin, R., Xiong, C.: Benchmark test-time scaling of general LLM agents. arXiv:2602.18998 (2026)

work page arXiv 2026
[22]

Datadog: State of AI Engineering 2026.https://www.datadoghq.com/ state-of-ai-engineering/(2026)

work page 2026
[23]

arXiv:2603.29231 (2026)

Khanal, A., Tao, Y., Zhou, J.: Beyond pass@1: A reliability science frame- work for long-horizon LLM agents. arXiv:2603.29231 (2026)

work page arXiv 2026
[24]

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Wang, X.J., Bai, H., Sun, Y., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., Nowak, R.D.: The long-horizon task mirage: diagnosing where and why agentic systems break. arXiv:2604.11978 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

logrocket.com/llm-context-problem-strategies-2026/(2026)

LogRocket Blog: The LLM context problem in 2026.https://blog. logrocket.com/llm-context-problem-strategies-2026/(2026)

work page 2026
[26]

Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,

Fan, F.X., Tan, C., Wattenhofer, R., Ong, Y.S.: Information fidelity in tool- using LLM agents: a martingale analysis of MCP. In: Proc. AAMAS (2026). arXiv:2602.13320

work page arXiv 2026
[27]

arXiv:2601.22290 (2026)

Patel, K., Surendira, S., George, J., Kapale, S.: The Six Sigma agent: enterprise-grade reliability via consensus-driven decomposed execution. arXiv:2601.22290 (2026)

work page arXiv 2026
[28]

Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Tran-Truong, P.T., Le, X.B.: Measuring the unmeasurable: Markov chain reliability for LLM agents. arXiv:2604.24579 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026