pith. machine review for the scientific record. sign in

arxiv: 2605.08563 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords context contaminationLLM agentsretry mechanismserror propagationbudget allocationtool-augmented taskssuccess probabilitycascade overhead
0
0 comments X

The pith

Retrying LLM agents after failure contaminates context and raises per-step error rates from a base level ε0 to a higher fixed ε1, requiring new formulas for success probability and budget splits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when an LLM agent fails a multi-step tool task and retries, the failed attempt stays in the context window and increases the chance of errors on the next try. It formalizes this as the Context-Contaminated Restart Model and derives exact expressions for overall success within K attempts plus the extra attempts caused by the contamination. A reader would care because common retry strategies in agent systems perform worse than expected under this effect, and the model supplies concrete ways to choose pipeline depth versus number of retries for a fixed total budget.

Core claim

Under the Context-Contaminated Restart Model a task consists of T tool-call steps that each fail with probability ε0 on a clean attempt but with elevated probability ε1 > ε0 on every subsequent attempt because prior failures remain in context. The model supplies a closed-form probability of succeeding in at most K attempts, a theorem for the additional attempts Delta K caused by contamination, an optimal depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) that maximizes success for fixed budget B = K T, an information-theoretic lower bound showing this K is tight up to a constant, and a theorem quantifying the gain from clearing context before each retry. Validation on SWE-bench Verified data

What carries the argument

The Context-Contaminated Restart Model (CCRM): a sequence of T tool-call steps in which any failure leaves the context contaminated, raising the per-step failure probability from base rate ε0 to a constant higher rate ε1 for all later attempts.

If this is right

  • Exact closed-form formula for the probability of succeeding within at most K attempts under contamination.
  • Cascade-overhead theorem that states the precise number of extra attempts required compared with a clean-restart baseline.
  • Optimal budget-allocation theorem that identifies the pipeline depth T* maximizing success for any fixed total budget B = K T.
  • Information-theoretic lower bound via Le Cam's method showing the required number of attempts is within a constant factor of the best possible.
  • Clean-restart dominance theorem that quantifies the exact improvement obtained by clearing context before each retry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent implementations should add context-clearing or summarization steps before retries to avoid paying the contamination penalty.
  • The optimal T* formula can be used directly once practitioners estimate ε0 and ε1 from their own logs or benchmarks.
  • The same contamination pattern may appear in other sequential LLM workflows such as long chain-of-thought reasoning or multi-turn planning.
  • Testing variable ε1 that depends on failure type or history length would be a direct extension of the constant-rate assumption.

Load-bearing premise

The elevated error rate after contamination is a fixed constant ε1 that does not depend on the details of the failure or on how many prior attempts have occurred.

What would settle it

Measure the actual per-step success rate on the first attempt versus on immediate retries for the same agent and tasks on SWE-bench or a similar benchmark; the ratio of those rates should equal the model's ε1/ε0 if the central claim holds.

Figures

Figures reproduced from arXiv: 2605.08563 by Zhanfu Yang.

Figure 1
Figure 1. Figure 1: CCRM: attempt k has T steps with per-step error rate εi where i = Zk ∈ {0, 1}. A failed attempt flips Zk+1 = 1, elevating the error rate. 2 The Context-Contaminated Restart Model 2.1 Pipeline Setup Definition 1 (Tool-call pipeline). A pipeline of depth T is a sequence of T tool invocations executed in order. The pipeline succeeds iff all T invocations succeed; it fails on the first failure. 2.2 Stochastic … view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic validation (n = 30,000 MC trials, diamonds). (a) Formula matches simulation for all cascade strengths. (b) Cascade overhead diverges (phase transition) as ε1/ε0 increases. (c) Optimal T ∗ agrees with simulated optimum. 6 Clean-Restart Dominance Definition 3 (Clean-restart). Reset Zk = 0 before each attempt (clear context window). Pclean(EK) = 1 − (1 − p0) K. Theorem 5 (Clean-restart dominance). F… view at source ↗
Figure 3
Figure 3. Figure 3: Real-data validation. (a) CCRM fit to Verdent [14] SWE-bench Verified: pass@1=0.761, pass@3=0.812 (consecutive attempts, same context). IID overestimates by 0.174 at K = 3. (b) For independent fresh-start retries (USEagent [19], OpenHands), IID overestimates due to correlated task difficulty—a distinct phenomenon. 7.2 Real-World Validation Data. Verdent [14]: pass@1= 0.761, pass@3= 0.812 (“three consecutiv… view at source ↗
read the original abstract

When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Context-Contaminated Restart Model (CCRM) for LLM agent pipelines. In this model, a chain of T tool-call steps each fail with base rate ε₀, but after failure the next attempt has elevated error rate ε₁ > ε₀ due to context contamination. The authors derive (R1) closed-form P(success in ≤K attempts), (R2) cascade-overhead theorem, (R3) optimal T* = sqrt(B log(1/(1-ε₁))/log(1/(1-ε₀))) for budget B=KT, (R4) Le Cam lower bound, (R5) clean-restart dominance. Validation on SWE-bench Verified shows CCRM error <0.001 vs IID overestimating pass@3 by 17.4pp, with cascade ratio 7.1.

Significance. The results provide a formal treatment of a common practical issue in LLM agents. The closed-form theorems and tight empirical validation (error <0.001) are notable strengths, as is the explicit optimal-allocation result. If the key assumptions hold, the framework could guide retry strategies in agent systems.

major comments (2)
  1. [Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.
  2. [Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.
minor comments (1)
  1. [Abstract] Add explicit equation numbers when referring to the closed forms in the abstract and results summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the concerns are valid, and outline specific revisions that will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Model Definition and (R3)] The assumption that the contaminated error rate ε₁ is fixed and does not depend on the specific failure details or history length is central to all derived results, including the optimal T* formula in (R3). The manuscript provides no empirical test of this constancy (e.g., by stratifying fits by attempt index or error type), relying instead on a single aggregate fit to SWE-bench data. This makes the optimality theorem's applicability conditional on an unverified assumption.

    Authors: We agree that the constancy of ε₁ is a modeling assumption required for the closed-form results, including the optimal T* derivation. The current validation relies on an aggregate fit to the full SWE-bench Verified trajectories. In the revised manuscript we will add a new subsection that stratifies the observed per-step error rates by retry index (first attempt, second attempt, etc.) and by error category where possible. This will provide a direct empirical check on whether ε₁ remains approximately constant. The results of this stratification will be reported, and if material deviations appear we will qualify the applicability of the optimality theorem accordingly. revision: yes

  2. Referee: [Empirical Validation] The parameters ε₀ and ε₁ are fitted to the SWE-bench Verified data, which is also used to compute the reported cascade ratio of 7.1 and the fit error <0.001. This circularity means the validation does not independently confirm the model's predictive power for the optimal allocation; a hold-out or cross-validation approach would strengthen the claims.

    Authors: We acknowledge that fitting and evaluating on the identical dataset limits the strength of the predictive claims. While the primary empirical contribution is the demonstration that CCRM explains the observed data far better than the IID baseline, we agree that an independent test is needed to support the optimal-allocation result. In the revision we will perform a hold-out experiment: parameters will be fit on 70% of the SWE-bench tasks and the predicted success probabilities, cascade overhead, and optimal T* will be evaluated on the remaining 30%. The results of this out-of-sample check will be added to the empirical section. revision: yes

Circularity Check

0 steps flagged

No circularity: mathematical derivation of T* is independent of the data fit

full rationale

The paper defines the CCRM with fixed parameters ε0 and ε1, then derives the closed-form T* by optimizing the explicit success-probability expression P(success) = 1 - (1-ε0)^T * [(1-ε1)^T]^(K-1) under the budget constraint B = K T. This is a standard calculus exercise on the model equations and does not reduce to the SWE-bench fit. The fit is performed afterward solely for model validation and to obtain numerical ε values; it is not an input to the proof of the closed form. No self-citation, self-definition, or renaming of fitted quantities as predictions occurs in the derivation chain.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claims rest on two fitted error rates and the assumption of constant state-dependent failure probabilities; the optimal allocation theorem is derived from these quantities.

free parameters (3)
  • epsilon_0
    Base per-step failure probability in clean context; estimated from data to fit observed success rates.
  • epsilon_1
    Elevated per-step failure probability in contaminated context; estimated from data to fit observed success rates.
  • cascade_ratio
    Ratio epsilon_1 / epsilon_0 reported as 7.1 after fitting to SWE-bench Verified data.
axioms (2)
  • domain assumption Each tool-call step fails independently with a constant probability that depends only on whether the context is clean or contaminated.
    Core modeling choice stated in the CCRM definition.
  • domain assumption Contamination persists uniformly for the entire retry attempt once a failure has occurred.
    Assumed when defining the elevated error rate ε1 for subsequent attempts.
invented entities (1)
  • Contaminated context state no independent evidence
    purpose: To represent the mechanism that elevates error rate after a failed attempt.
    New state introduced to explain the difference between clean and retry performance.

pith-pipeline@v0.9.0 · 5625 in / 1888 out tokens · 99247 ms · 2026-05-12T01:53:48.597038+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Wi- ley (2006)

    Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wi- ley (2006)

  2. [2]

    ICLR (2024)

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.R.: SWE-bench: Can language models resolve real-world GitHub issues? In: Proc. ICLR (2024)

  3. [3]

    Le Cam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat.1(1), 38–53 (1973)

  4. [4]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., et al.: ToolLLM: Facilitating large language models to master 16000+ real-world APIs. arXiv:2307.16789 (2023)

  5. [5]

    Wiley (2020)

    Rausand, M., Barros, A., Hoyland, A.: System Reliability Theory, 3rd edn. Wiley (2020)

  6. [6]

    In: Proc

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. In: Proc. NeurIPS (2023)

  7. [7]

    Wiley (2002)

    Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Com- puter Science Applications, 2nd edn. Wiley (2002)

  8. [8]

    Wang, L., Ma, C., Feng, X., Zhang, Z., et al.: A survey on large language model based autonomous agents. Front. Comput. Sci.18(6), 186345 (2024)

  9. [9]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct:Synergizingreasoningandactinginlanguagemodels.In:Proc.ICLR (2023)

  10. [10]

    arXiv:2508.00890 (2025)

    Wang, F., Liu, H., Dai, Z., Zeng, J., et al.: AgentTTS: LLM agent for test- time compute-optimal scaling in complex tasks. arXiv:2508.00890 (2025)

  11. [11]

    In: Proc

    Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., et al.: Why do multi-agent LLM systems fail? A systematic study. In: Proc. NeurIPS D&B (2025)

  12. [12]

    In: Proc

    Chen, Z., Zheng, Y., Lai, Z., Yang, Z., Li, C., Liu, Y., Lin, L.: Quadratic coreset selection: certifying and reconciling sequence and token mining for efficient instruction tuning. In: Proc. NeurIPS (2025)

  13. [13]

    arXiv:2511.05804 (2025)

    Noël, V.: Catching contamination before generation: spectral kill switches for agents. arXiv:2511.05804 (2025)

  14. [14]

    Verdent AI: SWE-bench Verified Technical Report: 76.1% pass@1 and 81.2% pass@3.https://www.verdent.ai/blog/ swe-bench-verified-technical-report(2025)

  15. [15]

    Budget-aware tool-use enables effective agent scaling

    Liu, T., Wang, Z., Miao, J., Hsu, I., Yan, J., Chen, J., Han, R., Xu, F., Chen, Y., Jiang, K., Daruki, S., Liang, Y., Wang, W.Y., Pfister, T., Lee, C.Y.: Budget-aware tool-use enables effective agent scaling. arXiv:2511.17006 (2025)

  16. [16]

    In: Proc

    Yang, Z., et al.: Language models as implicit tree search. In: Proc. ICML (2025) Why Retrying Fails: Context Contamination in LLM Agent Pipelines 11

  17. [17]

    Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

    Zhu, K., Liu, Z., Li, B., Tian, M., Yang, Y., Zhang, J., Han, P., Xie, Q., Cui, F., Zhang, W., et al.: Where LLM agents fail and how they can learn from failures. arXiv:2509.25370 (2025)

  18. [18]

    arXiv:2602.12276 (2026)

    Lee, N., Erdogan, L.E., John, C.J., Krishnapillai, S., Mahoney, M.W., Keutzer, K., Gholami, A.: Agentic test-time scaling for web agents. arXiv:2602.12276 (2026)

  19. [19]

    In: Proc

    Applis, L., et al.: USEagent: Unified software engineering agent. In: Proc. ICSE (2026)

  20. [20]

    Spend less, reason better: Budget-aware value tree search for llm agents

    Li, Y., Deng, W., Li, J., Li, X.: Spend less, reason better: budget-aware value tree search for LLM agents. arXiv:2603.12634 (2026)

  21. [21]

    arXiv:2602.18998 (2026)

    Li, X., Ming, R., Setlur, P., Paladugu, A., Tang, A., Kang, H., Shao, S., Jin, R., Xiong, C.: Benchmark test-time scaling of general LLM agents. arXiv:2602.18998 (2026)

  22. [22]

    Datadog: State of AI Engineering 2026.https://www.datadoghq.com/ state-of-ai-engineering/(2026)

  23. [23]

    arXiv:2603.29231 (2026)

    Khanal, A., Tao, Y., Zhou, J.: Beyond pass@1: A reliability science frame- work for long-horizon LLM agents. arXiv:2603.29231 (2026)

  24. [24]

    The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

    Wang, X.J., Bai, H., Sun, Y., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., Nowak, R.D.: The long-horizon task mirage: diagnosing where and why agentic systems break. arXiv:2604.11978 (2026)

  25. [25]

    logrocket.com/llm-context-problem-strategies-2026/(2026)

    LogRocket Blog: The LLM context problem in 2026.https://blog. logrocket.com/llm-context-problem-strategies-2026/(2026)

  26. [26]

    Information fidelity in tool-using llm agents: A martingale analysis of the model context protocol,

    Fan, F.X., Tan, C., Wattenhofer, R., Ong, Y.S.: Information fidelity in tool- using LLM agents: a martingale analysis of MCP. In: Proc. AAMAS (2026). arXiv:2602.13320

  27. [27]

    arXiv:2601.22290 (2026)

    Patel, K., Surendira, S., George, J., Kapale, S.: The Six Sigma agent: enterprise-grade reliability via consensus-driven decomposed execution. arXiv:2601.22290 (2026)

  28. [28]

    Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

    Tran-Truong, P.T., Le, X.B.: Measuring the unmeasurable: Markov chain reliability for LLM agents. arXiv:2604.24579 (2026)