arxiv: 2601.21684 · v2 · submitted 2026-01-29 · 💻 cs.CL · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Xinglin Wang , Jiayi Shi , Shaoxiong Feng , Peiwen Yuan , Yiwei Li , Yueqi Zhang , Chuyi Tan , Ji Zhang

show 3 more authors

Boyuan Pan Yao Hu Kan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords test-time scalingrecycling search experienceLLM reasoningexperience bankrollout reuseefficient inferencemath benchmarks

0 comments

The pith

Recycling rollout experience into a shared bank turns isolated LLM searches into a cumulative process that reduces redundant computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Recycling Search Experience (RSE), a self-guided and training-free method that distills raw search trajectories into a shared experience bank. This bank supports positive recycling of useful intermediate conclusions to avoid repeating derivations and negative recycling of failure patterns to skip known dead ends. The approach is analyzed theoretically for efficiency gains over independent sampling and tested empirically on math and reasoning benchmarks where it outperforms baselines under matched compute budgets. If the distillation works as claimed, test-time scaling shifts from throwing away each rollout to building reusable knowledge across trials.

Core claim

By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretical analysis formalizes the efficiency gains of RSE over independent sampling in solving complex reasoning tasks, while experiments on HMMT24, HMMT25, IMO-Bench, and HLE show consistent outperformance under comparable computational budgets.

What carries the argument

The shared experience bank that distills positive and negative insights from raw trajectories to guide subsequent searches in a self-guided manner.

If this is right

Search shifts from isolated trials to an accumulating process that reuses prior insights.
Redundant re-derivations of already-discovered conclusions are avoided across multiple attempts.
Known dead ends are pruned early, freeing compute for unexplored branches.
Solve rates on hard reasoning problems increase without raising the total inference budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recycling idea could apply to search in code generation or theorem proving outside the tested benchmarks.
Experience bank size and update frequency may need explicit controls when search horizons grow longer.
RSE could be combined with other test-time methods such as self-consistency or tree search to compound gains.

Load-bearing premise

Raw trajectories can be reliably distilled into a shared experience bank without introducing errors or biases that cancel out the efficiency gains.

What would settle it

A direct comparison where maintaining and consulting the experience bank produces lower accuracy than independent sampling at the same total rollout count, due to propagated distillation errors.

Figures

Figures reproduced from arXiv: 2601.21684 by Boyuan Pan, Chuyi Tan, Jiayi Shi, Ji Zhang, Kan Li, Peiwen Yuan, Shaoxiong Feng, Xinglin Wang, Yao Hu, Yiwei Li, Yueqi Zhang.

**Figure 1.** Figure 1: From memoryless rollouts to experience-guided search. (Left) Existing test-time scaling paradigms (parallel, sequential, and hybrid) largely treat rollouts as disposable: intermediate conclusions are repeatedly re-derived and dead ends are revisited across rollouts. (Right) Recycling Search Experience (RSE) runs rollouts in batches, distills reusable trajectory information into a shared Experience Bank, an… view at source ↗

**Figure 2.** Figure 2: Scalability and Efficiency Analysis of Test-Time Search. We evaluate the scaling behaviors of different search strategies across three dimensions: (a) search depth, (b) search width, and (c) computational efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Non-truncated Pass@1 across varying difficulty problems. Samples are stratified by baseline Non-truncated Pass@1, with gray bars indicating the sample count distribution per bin. scarce. In the extremely-hard bracket, while PaCoRe stagnates due to a behavioral collapse towards passive reference verification, RSE sustains performance gains by preserving independent exploration capacity. Analysis of reason… view at source ↗

**Figure 4.** Figure 4: Word Cloud Analysis of Reasoning Content. Left: RSE; Right: PaCoRe. The Verification-Centric Bottleneck. The PaCoRe word cloud (Right) is dominated by meta-cognitive verification terms such as "reference", "verify", and "check". This lexical distribution reveals a fundamental shift in the model’s behavior: instead of engaging in independent problem-solving, the model repurposes its compute budget to vali… view at source ↗

**Figure 5.** Figure 5: Truncated rate varying difficulty problems. C.5. Quality Verification of Distilled Experiences To verify the reliability of our distilled experiences, we employed GEMINI-3-PRO-PREVIEW (DeepMind, 2025) as an automated validator. The model was prompted to evaluate each experience against the original problem, determining whether Positive Experiences are mathematically valid and whether Negative Experience… view at source ↗

**Figure 6.** Figure 6: Analysis of Reasoning Components. The figure illustrates the problem statement, key positive constraints (Green), and critical failure modes (Red). The bottom section displays reasoning slices where the model successfully utilizes the intermediate conclusions (marked ⃝1 , ⃝2 , ⃝4 ) and actively avoids the identified failure patterns (marked ⃝3 , ⃝5 , ⃝6 ). In conclusion, this case exemplifies how RSE optim… view at source ↗

**Figure 7.** Figure 7: Default System Prompt. We apply this system instruction across all evaluated models to enforce step-by-step reasoning and standardized answer formatting. E.2 PaCoRe Input Serialization Template You are given a problem and a list of reference responses. Your job is to analyze these references and provide your own response. Original Problem: {{ original prompt }} Reference Responses: {% for response in ref r… view at source ↗

**Figure 8.** Figure 8: Input Serialization Template for PaCoRe. Adopted from the PaCoRe implementation, this template embeds the current problem x (denoted as original prompt) and the reference message set M (denoted as ref responses) into the model’s context via Jinja2 syntax. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Experience Distillation. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for Experience-Guided Solver. E.5 Experience Validation Prompt [System Prompt] You are a rigorous mathematical validator. Your task is to evaluate whether each given mathematical statement is logically valid and correct in the context of the provided problem. Instructions: 1. Carefully read the original problem. 2. Analyze each statement in the provided list. 3. For each statement, determine if it … view at source ↗

**Figure 11.** Figure 11: Prompt for Experience Validation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This wasted rollout-level experience leads to substantial computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative, experience-guided process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines under comparable computational budgets, establishing a strong compute-efficiency frontier for test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSE recycles rollout experience into a shared bank for shortcuts and pruning in training-free test-time search, but the self-guided distillation step risks compounding errors.

read the letter

RSE turns test-time rollouts from disposable samples into a cumulative process by distilling them into a persistent experience bank. The method splits the recycling into positive use of useful intermediate steps to avoid re-deriving them and negative use of failure patterns to skip dead ends. This is the main new piece: a concrete, training-free mechanism that treats search as building on prior trajectories rather than starting fresh each time. The paper shows this on math-heavy benchmarks like HMMT24, HMMT25, IMO-Bench, and HLE, where it reports better results than strong baselines at matched compute budgets. That framing and the empirical edge are the parts worth noting if the numbers hold up in the full experiments. The theoretical analysis is described as formalizing gains over independent sampling, which is a reasonable direction even if the details are not visible here. The paper does a clean job of naming the waste problem in standard search and offering a simple fix that does not require retraining. The experiments target relevant hard reasoning tasks, which gives the claims some grounding. The soft spot is exactly the one the stress test flags. Because distillation happens inside the same model loop with no external verification or decay rule for stored items, flawed or partial conclusions can get re-injected. On hard instances this could raise rather than lower total search cost. The abstract gives no mechanism for checking what enters the bank, so the central efficiency claim rests on an assumption that needs direct testing in the full paper. This work is for groups already running test-time search on math, science, or coding problems and looking for cheap ways to reduce redundancy. A reader who cares about inference efficiency would get value from the idea and the benchmark results. It deserves a serious referee because the approach is implementable, the benchmarks are appropriate, and the potential cost savings are practical even if revisions are needed on the distillation safeguards and analysis details. Send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Recycling Search Experience (RSE), a self-guided, training-free approach that distills raw trajectories from test-time rollouts into a shared experience bank. Positive recycling reuses intermediate conclusions to shortcut derivations, while negative recycling prunes known dead ends. A theoretical analysis formalizes efficiency gains relative to independent sampling, and experiments on HMMT24, HMMT25, IMO-Bench, and HLE report consistent outperformance over strong baselines under matched computational budgets.

Significance. If the efficiency claims hold, RSE would represent a practical advance in test-time scaling by converting disposable rollouts into reusable experience, potentially lowering the compute required for complex reasoning without additional training. The training-free, self-guided design is a notable strength if the distillation process can be shown to remain reliable.

major comments (3)

[Abstract and §4] Abstract and §4 (theoretical analysis): the formalization of efficiency gains over independent sampling assumes that distilled items provide reliable positive shortcuts and negative prunings, but the analysis does not appear to quantify the overhead or error rate introduced by self-guided extraction of intermediate conclusions from raw trajectories.
[§3] §3 (method): the shared experience bank is described as accumulating items without a verification, decay, or conflict-resolution mechanism; this leaves open the possibility that locally consistent but globally invalid steps are recycled, which would directly undermine the claimed reduction in total search cost on hard instances.
[§5] §5 (experiments): the reported gains on HMMT24/25, IMO-Bench, and HLE are presented under comparable budgets, yet the manuscript does not detail how the experience bank size, retrieval cost, or distillation frequency are accounted for in the compute budget, making it difficult to confirm that the efficiency frontier is strictly superior rather than an artifact of unmeasured overhead.

minor comments (2)

[§3] Notation for the experience bank (e.g., how items are keyed and retrieved) should be introduced earlier and used consistently across the theoretical and experimental sections.
[§5] The abstract claims 'extensive experiments' but the main text should include an ablation isolating the contribution of positive versus negative recycling to clarify which component drives the observed gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (theoretical analysis): the formalization of efficiency gains over independent sampling assumes that distilled items provide reliable positive shortcuts and negative prunings, but the analysis does not appear to quantify the overhead or error rate introduced by self-guided extraction of intermediate conclusions from raw trajectories.

Authors: We agree that §4 presents an idealized analysis assuming reliable extraction to establish the efficiency bounds relative to independent sampling. This choice highlights the potential gains when distillation succeeds. In the revised manuscript we will augment §4 with a new subsection that incorporates an error-rate term into the formalization and reports the observed extraction accuracy (measured as the fraction of distilled items that are consistent with ground-truth solutions on a held-out subset of trajectories). Empirical error rates from our runs on HMMT24/25 will be included to quantify the gap between the idealized bound and observed performance. revision: partial
Referee: [§3] §3 (method): the shared experience bank is described as accumulating items without a verification, decay, or conflict-resolution mechanism; this leaves open the possibility that locally consistent but globally invalid steps are recycled, which would directly undermine the claimed reduction in total search cost on hard instances.

Authors: The concern is valid: without safeguards, erroneous items could accumulate. The current design relies on the fact that items are only added from trajectories that produced a final answer (positive) or explicit failure (negative), and retrieval is gated by embedding similarity. To address the risk of globally invalid steps, we will revise §3 to introduce (i) a lightweight self-consistency check during distillation (re-querying the model on the extracted step) and (ii) an exponential decay on item utility scores. These additions remain training-free and will be accompanied by an ablation showing their effect on final accuracy. revision: partial
Referee: [§5] §5 (experiments): the reported gains on HMMT24/25, IMO-Bench, and HLE are presented under comparable budgets, yet the manuscript does not detail how the experience bank size, retrieval cost, or distillation frequency are accounted for in the compute budget, making it difficult to confirm that the efficiency frontier is strictly superior rather than an artifact of unmeasured overhead.

Authors: We acknowledge that the original experimental section did not break down the overhead components. In the revised §5 we will add a dedicated paragraph and supplementary table that reports: average bank size per problem, wall-clock time for retrieval and distillation, and the fraction of total FLOPs attributable to these operations (measured at <4 % across all benchmarks). Updated efficiency curves will be plotted after subtracting this overhead, confirming that the reported gains remain strictly superior to the baselines under the corrected budgets. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained

full rationale

The paper introduces RSE as a training-free distillation of raw trajectories into an experience bank for positive and negative recycling during test-time search. Its theoretical section formalizes efficiency gains relative to independent sampling via direct comparison of search costs, without reducing to fitted parameters, self-definitions, or prior self-citations as load-bearing premises. Empirical claims rest on explicit budget-matched comparisons against baselines on HMMT24/25, IMO-Bench, and HLE rather than any renaming or ansatz smuggling. No step in the provided derivation equates a claimed prediction or uniqueness result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone; the method is presented as training-free and self-guided.

pith-pipeline@v0.9.0 · 5534 in / 961 out tokens · 33407 ms · 2026-05-16T09:50:55.353813+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Phi-4-reasoning technical report

Nemo rl: A scalable and efficient post-training li- brary. https://github.com/NVIDIA-NeMo/RL. GitHub repository. Marah I Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat S. Behl, Lingjiao Chen, Gus- tavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio C´esar Teodoro Mendes, Arindam Mitra, Bes...

work page arXiv 2025
[2]

Memory in the Age of AI Agents

Chain-of-verification reduces hallucination in large language models. InFindings of the association for com- putational linguistics: ACL 2024, pages 3563–3578. Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Lost in the middle: How language models use long contexts.Trans. Assoc. Comput. Linguistics, 12:157– 173. Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. 2025. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703. Minh-Thang Luong, Dawsen Hwang, Hoang H Ng...

work page arXiv 2025
[4]

A Survey of Context Engineering for Large Language Models

Towards robust mathematical reasoning. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Her- man...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InProceedings of the 36th annual acm sympo- sium on user interface software and technology, pages 1–22

Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm sympo- sium on user interface software and technology, pages 1–22. 10 Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi

work page
[6]

Ensembling large language models with process reward-guided tree search for better complex reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 10256–10277. Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Joseph...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314. Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang. 2024. AlphaZero-like tree-search can guide large language model decoding and training. InInternational Confer- ence on Machine Learning...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

there exists a rollout whose verified output contains all re- quired conclusions

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837. Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2023. Large language models are better reasoners with self- verification. InFindings of the Association for Computa- ti...

work page arXiv 2023
[9]

NEVER use a specific number from the report unless you have independently derived it.Action: If a proposition offers a shortcut, verify its premise instantly

Accelerate via Verified Propositions (The Anchor): Rule:Treat Propositions asstructural hypotheses, not proven facts.Priority:Prioritize propositions that offer abstract insights, simplifications, or identities.Skepticism:Be extremely skeptical of raw numerical propositions. NEVER use a specific number from the report unless you have independently derived...

work page
[10]

Critical Pitfalls

Navigate via Critical Pitfalls: The provided "Critical Pitfalls" describe specific logical errors or dead-ends.You are STRICTLY FORBIDDEN from repeating them. If you approach a decision point mentioned in a pitfall, you MUST actively choose an alternative strategy

work page
[11]

Backtrack to the very beginning, re-read the problem statement, and challenge your initial setup

Conflict Resolution & Robustness: Scenario:If you encounter a contradiction (e.g., conflicting values).Constraint:Do NOT simply choose the "easier" value.Action:A contradiction usually means a foundational assumption is incorrect. Backtrack to the very beginning, re-read the problem statement, and challenge your initial setup. Context from Previous Attemp...

work page
[12]

Carefully read the original problem

work page
[13]

Analyze each statement in the provided list

work page
[14]

For each statement, determine if it is mathematically correct and logically sound

work page
[15]

- Use True if the statement is CORRECT, False if it is INCORRECT or FLAWED

Output your decisions as a Python-style boolean list in the following format: <decision>[True, False, True, ...]</decision> Important: - The list must contain exactly the same number of boolean values as the number of statements provided. - Use True if the statement is CORRECT, False if it is INCORRECT or FLAWED. - For propositions: Check if the intermedi...

work page