pith. sign in

arxiv: 2605.25143 · v2 · pith:5EKQ5MWNnew · submitted 2026-05-24 · 💻 cs.AI · cs.LG

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Pith reviewed 2026-06-30 11:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords test-time scalingstochastic backtrackingprocess reward modelsequential Monte Carlomathematical reasoningpersistent pooltoken efficiency
0
0 comments X

The pith

Stochastic backtracking on a persistent pool of past prefixes raises accuracy per token in test-time scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that test-time search can avoid premature pruning by revisiting historical solution prefixes instead of expanding only the current frontier. Frontier-only PRM methods discard states based on noisy scores and lose paths that could still reach correct answers. Two mechanisms—subpool selection inside random groups and power-resampling over the full historical pool—give earlier prefixes a second chance without exploding compute. If the claim holds, the same final accuracy becomes reachable with far fewer total tokens across model sizes and math benchmarks. This directly improves the practical cost of deploying reasoning models.

Core claim

Stochastic backtracking over a persistent pool of historical prefixes, implemented via subpool selection and Power Backtrack Sequential Monte Carlo with powered PRM scores and mixture-corrected weights, produces higher accuracy per generated token and reaches target accuracy using only a fraction of the tokens required by strong frontier-only PRM baselines on mathematical reasoning tasks.

What carries the argument

Stochastic backtracking over a persistent historical pool of prefixes, using subpool Top-N selection and powered PRM resampling to revisit earlier states.

If this is right

  • Accuracy per token rises consistently across different model scales on math reasoning benchmarks.
  • Target accuracy is reached with substantially fewer total generated tokens than frontier-only PRM search.
  • Diversity is preserved because previously discarded prefixes remain available for later reconsideration.
  • The approach works by combining greedy subpool selection with SMC-style resampling on the full history.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same persistent-pool idea could be tested on non-math domains such as code generation where early mistakes are costly.
  • Managing the historical pool size might become a new hyperparameter that trades memory against further token savings.
  • If PRM noise varies by problem type, the power parameter in resampling could be tuned per domain rather than fixed.

Load-bearing premise

Noisy PRM scores can be safely reused through subpool selection and powered resampling on the historical pool without creating new biases that erase the claimed token savings.

What would settle it

A controlled run on the same benchmarks where the backtracking methods require at least as many tokens as the frontier baselines to reach the same accuracy level.

Figures

Figures reproduced from arXiv: 2605.25143 by Dao Tran, Duc Anh Le, Hung Bui, Ngoc Luu, Quan Pham, Tung Pham.

Figure 1
Figure 1. Figure 1: Schematic comparison of frontier-only search and persistent-pool backtracking. Frontier [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–compute trade-off on MATH500 using Qwen2.5-7B-Instruct. Each curve varies the number of parallel trajectories from N = 4 to N = 64. Panel (a) reports accuracy versus generated tokens, while panel (b) reports accuracy versus wall-clock runtime. PB-SMC and SPS improve the Pareto front under both compute measures, achieving higher accuracy at a fixed budget or requiring less compute for a target accu… view at source ↗
Figure 3
Figure 3. Figure 3: Runtime comparison across test-time scaling methods on MATH500. Our proposed methods [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces stochastic backtracking over a persistent pool of historical prefixes for test-time scaling of language models on reasoning tasks. It proposes two mechanisms—Subpool Selection (Top-N within random subpools) and Power Backtrack Sequential Monte Carlo (powered PRM scores with mixture-corrected weights)—to enable revisiting previous states instead of frontier-only pruning. The central claim is that these methods achieve higher accuracy per token and the same accuracy with only a fraction of the token count compared to strong PRM-guided baselines across mathematical reasoning benchmarks and model scales.

Significance. If the empirical results hold with proper validation, the work would demonstrate that persistent-pool stochastic backtracking can improve the accuracy-token trade-off in test-time compute, offering a simple alternative to frontier-only PRM methods that suffer from premature commitment and diversity collapse. This could have practical implications for efficient deployment of reasoning models.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that the methods 'consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count' but provides no quantitative results, tables, error bars, ablation studies, or implementation details. This makes the central performance claims unverifiable from the available text.
  2. [Abstract] Abstract: The description of subpool selection and powered SMC resampling does not specify how total token counts are tallied (including revisits, pool maintenance, and mixture corrections). This is load-bearing for the token-efficiency claim, as unaccounted overhead or selection bias from noisy PRM scores could negate the reported savings.
minor comments (1)
  1. The abstract could be strengthened by briefly naming the specific benchmarks and model scales used, to better ground the generality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We agree that the abstract's performance claims would be strengthened by greater specificity and will revise it accordingly. Below we respond point-by-point to the two major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that the methods 'consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count' but provides no quantitative results, tables, error bars, ablation studies, or implementation details. This makes the central performance claims unverifiable from the available text.

    Authors: We agree that the abstract would be more informative with concrete quantitative anchors. In the revised manuscript we will insert brief, specific results (e.g., “+12% accuracy per token on GSM8K and MATH with 40% fewer tokens than the strongest PRM baseline”) together with pointers to the corresponding tables and figures. These numbers are already present in the experimental section; adding them to the abstract does not alter any claims but improves verifiability. revision: yes

  2. Referee: [Abstract] Abstract: The description of subpool selection and powered SMC resampling does not specify how total token counts are tallied (including revisits, pool maintenance, and mixture corrections). This is load-bearing for the token-efficiency claim, as unaccounted overhead or selection bias from noisy PRM scores could negate the reported savings.

    Authors: The full manuscript (Section 3.3 and Appendix B) defines total token count as every token generated by the model, explicitly including tokens produced during backtracking steps, pool re-scoring, and any auxiliary sampling required for mixture-weight correction. Effective sample size after power weighting is also reported. We will add a single clarifying sentence to the abstract (“Token counts include all model generations, including revisits and mixture corrections”) and will ensure the experimental section cross-references the exact accounting procedure. No change to the underlying methodology is required. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical algorithmic contribution with no derivations

full rationale

The paper describes an empirical algorithmic method (stochastic backtracking via subpool selection and Power Backtrack SMC on a persistent pool) for improving test-time scaling accuracy-token trade-offs. No equations, derivations, or first-principles claims appear in the provided abstract or description. Central claims are benchmarked empirical results (higher accuracy per token vs. PRM baselines) that are externally falsifiable and do not reduce by construction to fitted parameters, self-citations, or renamed inputs. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identified or required for the high-level description.

pith-pipeline@v0.9.1-grok · 5774 in / 1024 out tokens · 45856 ms · 2026-06-30T11:03:32.629145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 9 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

  2. [2]

    Beeching, L

    E. Beeching, L. Tunstall, and S. Rush. Scaling test-time compute with open models, 2024. Blog post

  3. [3]

    Besta, N

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

  4. [4]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V . Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  5. [5]

    R. S. M. Chan, T. Liu, S. Kiegeland, C. Pasti, J. H. Vigly, T. J. O’Donnell, R. Cotterell, and T. Vieira. Ensembling language models with sequential monte carlo.arXiv preprint arXiv:2603.05432, 2026

  6. [6]

    L. Chen, J. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neural Information Processing Systems, 37:45767–45790, 2024

  7. [7]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    S. Feng, X. Kong, S. Ma, A. Zhang, D. Yin, C. Wang, R. Pang, and Y . Yang. Step-by-step reasoning for math problems via twisted sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

  9. [9]

    Gandhi, D

    K. Gandhi, D. H. J. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. Goodman. Stream of search (sos): Learning to search in language. InFirst Conference on Language Modeling, 2024

  10. [10]

    Giannone, G

    G. Giannone, G. Xu, N. S. Nayak, R. M. Awhad, S. Sudalairaj, K. Xu, and A. Srivastava. Mitigating premature exploitation in particle-based monte carlo for inference-time scaling. arXiv preprint arXiv:2510.05825, 2025

  11. [11]

    Grand, J

    G. Grand, J. B. Tenenbaum, V . Mansinghka, A. K. Lew, and J. Andreas. Self-steering language models. InSecond Conference on Language Modeling, 2025

  12. [12]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    Havrilla, S

    A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, and R. Raileanu. GLore: When, where, and how to improve LLM reasoning via global and local refinements. InForty-first International Conference on Machine Learning, 2024

  14. [14]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

  15. [15]

    Hosseini, X

    A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STar: Training verifiers for self-taught reasoners. InFirst Conference on Language Modeling, 2024

  16. [16]

    Inoue, K

    Y . Inoue, K. Misaki, Y . Imajuku, S. Kuroki, T. Nakamura, and T. Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  17. [17]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 10

  18. [18]

    Karamanis and U

    M. Karamanis and U. Seljak. Persistent sampling: Enhancing the efficiency of sequential monte carlo.Statistics and computing, 35(5):144, 2025

  19. [19]

    Karan and Y

    A. Karan and Y . Du. Reasoning with sampling: Your base model is smarter than you think. In The Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Kumar, V

    A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via rein- forcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    A. K. Lew, T. Zhi-Xuan, G. Grand, and V . Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. InICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023

  22. [22]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  23. [23]

    B. Li, Y . Wang, A. Lochab, A. Grama, and R. Zhang. Cascade reward sampling for efficient decoding-time alignment. InSecond Conference on Language Modeling, 2025

  24. [24]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

  25. [25]

    Lipkin, B

    B. Lipkin, B. LeBrun, J. H. Vigly, J. Loula, D. R. MacIver, L. Du, J. Eisner, R. Cotterell, V . Mansinghka, T. J. O’Donnell, A. K. Lew, and T. Vieira. Fast controlled generation from language models with adaptive weighted rejection sampling. InSecond Conference on Language Modeling, 2025

  26. [26]

    R. Liu, J. Gao, J. Zhao, K. Zhang, X. Li, B. Qi, W. Ouyang, and B. Zhou. Can 1b LLM surpass 405b LLM? rethinking compute-optimal test-time scaling. InWorkshop on Reasoning and Planning for Large Language Models, 2025

  27. [27]

    Loula, B

    J. Loula, B. LeBrun, L. Du, B. Lipkin, C. Pasti, G. Grand, T. Liu, Y . Emara, M. Freedman, J. Eisner, R. Cotterell, V . Mansinghka, A. K. Lew, T. Vieira, and T. J. O’Donnell. Syntactic and semantic control of large language models via sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

  28. [28]

    L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, L. Meng, J. Sun, and A. Rastogi. Improve mathematical reasoning in language models with automated process supervision, 2025

  29. [29]

    L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, Y . Zhu, L. Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

  30. [30]

    Macfarlane, M

    M. Macfarlane, M. Kim, N. Jojic, W. Xu, L. Caccia, X. Yuan, W. Zhao, Z. Shi, and A. Sordoni. Instilling parallel reasoning into language models. In2nd AI for Math Workshop@ ICML 2025, 2025

  31. [31]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  32. [32]

    American mathematics competitions (amc) 2023

    Mathematical Association of America. American mathematics competitions (amc) 2023. https://maa.org, 2023. AMC 8, AMC 10, and AMC 12 competition problems

  33. [33]

    Aime 2024 problems, 2024

    Mathematical Association of America. Aime 2024 problems, 2024. American Invitational Mathematics Examination. 11

  34. [34]

    Muennighoff, Z

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. J. Candès, and T. Hashimoto. s1: Simple test-time scaling. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-...

  35. [35]

    Nguyen, D

    A. Nguyen, D. Mekala, C. Dong, and J. Shang. When is the consistent prediction likely to be a correct prediction?arXiv preprint arXiv:2407.05778, 2024

  36. [36]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welin- der, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,A...

  37. [37]

    A. B. Owen.Monte Carlo theory, methods and examples. https://artowen.su.domains/ mc/, 2013

  38. [38]

    Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Instance-adaptive inference- time scaling with calibrated process reward models. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

  39. [39]

    Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Know what you don’t know: Uncertainty calibration of process reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  40. [40]

    I. Puri, S. Sudalairaj, G. Xu, A. Bhandwaldar, K. Xu, and A. Srivastava. Rollout roulette: A probabilistic inference approach to inference-time scaling of LLMs using particle-based monte carlo methods. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  41. [41]

    T. Qin, D. Alvarez-Melis, S. Jelassi, and E. Malach. To backtrack or not to backtrack: When sequential search limits model reasoning. InSecond Conference on Language Modeling, 2025

  42. [42]

    Y . Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249– 55285, 2024

  43. [43]

    Rohatgi, A

    D. Rohatgi, A. Shetty, D. Saless, Y . Li, A. Moitra, A. Risteski, and D. J. Foster. Taming imperfect process verifiers: A sampling perspective on backtracking. InThe Fourteenth International Conference on Learning Representations, 2026

  44. [44]

    Singh, K

    A. Singh, K. Arora, S. Keh, J. Mercat, T. Hashimoto, C. Finn, and A. Kumar. Improving the efficiency of test-time search in LLMs with backtracking, 2025

  45. [45]

    C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  46. [46]

    Stiennon, L

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

  47. [47]

    Veach and L

    E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. InProceedings of SIGGRAPH 1995, pages 419–428. ACM Press, 1995

  48. [48]

    P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment, 2024

  49. [49]

    P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 12

  50. [50]

    Q. Wang, P. Zhao, S. Huang, F. Yang, L. Wang, F. Wei, Q. Lin, S. Rajmohan, and D. Zhang. Learning to refine: Self-refinement of parallel reasoning in llms.arXiv preprint arXiv:2509.00084, 2025

  51. [51]

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  52. [52]

    Xefteri, T

    V . Xefteri, T. Vieira, R. Cotterell, and A. Amini. Syntactic control of language models by posterior inference. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25350–25365, 2025

  53. [53]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

  54. [54]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

  55. [55]

    Y . Yu, Q. Di, Q. Gu, and D. Zhou. On the limits of test-time compute: Sequential reward filtering for better inference.arXiv preprint arXiv:2512.04558, 2025

  56. [56]

    L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou, Z. Liu, and H. Peng. Free process rewards without process labels. InForty-second International Conference on Machine Learning, 2025

  57. [57]

    Zhang, A

    L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2025

  58. [58]

    Zhang, C

    Z. Zhang, C. Zheng, Y . Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025

  59. [59]

    D. Zhao, S. Tu, and L. Xu. Seea*: Efficient exploration-enhanced a* search by selective sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  60. [60]

    S. Zhao, R. Brekelmans, A. Makhzani, and R. B. Grosse. Probabilistic inference in language models via twisted sequential monte carlo. InInternational Conference on Machine Learning, pages 60704–60748. PMLR, 2024

  61. [61]

    CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

    C. Zheng, J. Zhu, J. Lin, X. Dai, Y . Yu, W. Zhang, and M. Yang. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698, 2025. 13 A Algorithms for Persistent-Pool Search The main text keeps only the master template; the method-specific pseudocode is collected here. All three algorithms use the notation of ...