Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Dao Tran; Duc Anh Le; Hung Bui; Ngoc Luu; Quan Pham; Tung Pham

arxiv: 2605.25143 · v2 · pith:5EKQ5MWNnew · submitted 2026-05-24 · 💻 cs.AI · cs.LG

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Dao Tran , Duc Anh Le , Ngoc Luu , Quan Pham , Tung Pham , Hung Bui This is my paper

Pith reviewed 2026-06-30 11:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords test-time scalingstochastic backtrackingprocess reward modelsequential Monte Carlomathematical reasoningpersistent pooltoken efficiency

0 comments

The pith

Stochastic backtracking on a persistent pool of past prefixes raises accuracy per token in test-time scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that test-time search can avoid premature pruning by revisiting historical solution prefixes instead of expanding only the current frontier. Frontier-only PRM methods discard states based on noisy scores and lose paths that could still reach correct answers. Two mechanisms—subpool selection inside random groups and power-resampling over the full historical pool—give earlier prefixes a second chance without exploding compute. If the claim holds, the same final accuracy becomes reachable with far fewer total tokens across model sizes and math benchmarks. This directly improves the practical cost of deploying reasoning models.

Core claim

Stochastic backtracking over a persistent pool of historical prefixes, implemented via subpool selection and Power Backtrack Sequential Monte Carlo with powered PRM scores and mixture-corrected weights, produces higher accuracy per generated token and reaches target accuracy using only a fraction of the tokens required by strong frontier-only PRM baselines on mathematical reasoning tasks.

What carries the argument

Stochastic backtracking over a persistent historical pool of prefixes, using subpool Top-N selection and powered PRM resampling to revisit earlier states.

If this is right

Accuracy per token rises consistently across different model scales on math reasoning benchmarks.
Target accuracy is reached with substantially fewer total generated tokens than frontier-only PRM search.
Diversity is preserved because previously discarded prefixes remain available for later reconsideration.
The approach works by combining greedy subpool selection with SMC-style resampling on the full history.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same persistent-pool idea could be tested on non-math domains such as code generation where early mistakes are costly.
Managing the historical pool size might become a new hyperparameter that trades memory against further token savings.
If PRM noise varies by problem type, the power parameter in resampling could be tuned per domain rather than fixed.

Load-bearing premise

Noisy PRM scores can be safely reused through subpool selection and powered resampling on the historical pool without creating new biases that erase the claimed token savings.

What would settle it

A controlled run on the same benchmarks where the backtracking methods require at least as many tokens as the frontier baselines to reach the same accuracy level.

Figures

Figures reproduced from arXiv: 2605.25143 by Dao Tran, Duc Anh Le, Hung Bui, Ngoc Luu, Quan Pham, Tung Pham.

**Figure 2.** Figure 2: Accuracy–compute trade-off on MATH500 using Qwen2.5-7B-Instruct. Each curve varies the number of parallel trajectories from N = 4 to N = 64. Panel (a) reports accuracy versus generated tokens, while panel (b) reports accuracy versus wall-clock runtime. PB-SMC and SPS improve the Pareto front under both compute measures, achieving higher accuracy at a fixed budget or requiring less compute for a target accu… view at source ↗

**Figure 3.** Figure 3: Runtime comparison across test-time scaling methods on MATH500. Our proposed methods [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Persistent-pool stochastic backtracking is a reasonable engineering response to noisy PRM pruning but the abstract supplies zero data to support the efficiency claims.

read the letter

The paper's main contribution is stochastic backtracking over a persistent historical pool of prefixes instead of limiting search to the current frontier. It adds subpool selection, which runs Top-N inside random subpools to give older prefixes a chance, and power backtrack SMC, which resamples from the pool using powered PRM scores plus mixture-corrected weights.

This directly targets the documented problems with frontier-only methods: early commitment and diversity loss when PRM scores are noisy. The descriptions of both mechanisms are clear and build on standard SMC resampling in a straightforward way.

The paper does a decent job framing why revisiting past states could improve the accuracy-token tradeoff. That part of the argument holds up on its own terms.

The central weakness is the complete lack of numbers. The abstract claims higher accuracy per token and equivalent accuracy at a fraction of the token count across benchmarks and model scales, yet provides no results, ablations, error bars, or details on how total tokens are counted when pool operations and revisits are included. The stress-test point about possible bias from reusing noisy scores and unmeasured overhead from pool maintenance is not addressed at all in the text. Without those checks, the efficiency claims cannot be evaluated.

This is for researchers working on test-time compute for mathematical reasoning models. Someone already running PRM-guided search might want to try the mechanisms if the full paper contains the missing experiments.

I would not send this to peer review in its current state. The idea is coherent but the empirical support is absent, so it does not yet merit referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces stochastic backtracking over a persistent pool of historical prefixes for test-time scaling of language models on reasoning tasks. It proposes two mechanisms—Subpool Selection (Top-N within random subpools) and Power Backtrack Sequential Monte Carlo (powered PRM scores with mixture-corrected weights)—to enable revisiting previous states instead of frontier-only pruning. The central claim is that these methods achieve higher accuracy per token and the same accuracy with only a fraction of the token count compared to strong PRM-guided baselines across mathematical reasoning benchmarks and model scales.

Significance. If the empirical results hold with proper validation, the work would demonstrate that persistent-pool stochastic backtracking can improve the accuracy-token trade-off in test-time compute, offering a simple alternative to frontier-only PRM methods that suffer from premature commitment and diversity collapse. This could have practical implications for efficient deployment of reasoning models.

major comments (2)

[Abstract] Abstract: The abstract asserts that the methods 'consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count' but provides no quantitative results, tables, error bars, ablation studies, or implementation details. This makes the central performance claims unverifiable from the available text.
[Abstract] Abstract: The description of subpool selection and powered SMC resampling does not specify how total token counts are tallied (including revisits, pool maintenance, and mixture corrections). This is load-bearing for the token-efficiency claim, as unaccounted overhead or selection bias from noisy PRM scores could negate the reported savings.

minor comments (1)

The abstract could be strengthened by briefly naming the specific benchmarks and model scales used, to better ground the generality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We agree that the abstract's performance claims would be strengthened by greater specificity and will revise it accordingly. Below we respond point-by-point to the two major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that the methods 'consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count' but provides no quantitative results, tables, error bars, ablation studies, or implementation details. This makes the central performance claims unverifiable from the available text.

Authors: We agree that the abstract would be more informative with concrete quantitative anchors. In the revised manuscript we will insert brief, specific results (e.g., “+12% accuracy per token on GSM8K and MATH with 40% fewer tokens than the strongest PRM baseline”) together with pointers to the corresponding tables and figures. These numbers are already present in the experimental section; adding them to the abstract does not alter any claims but improves verifiability. revision: yes
Referee: [Abstract] Abstract: The description of subpool selection and powered SMC resampling does not specify how total token counts are tallied (including revisits, pool maintenance, and mixture corrections). This is load-bearing for the token-efficiency claim, as unaccounted overhead or selection bias from noisy PRM scores could negate the reported savings.

Authors: The full manuscript (Section 3.3 and Appendix B) defines total token count as every token generated by the model, explicitly including tokens produced during backtracking steps, pool re-scoring, and any auxiliary sampling required for mixture-weight correction. Effective sample size after power weighting is also reported. We will add a single clarifying sentence to the abstract (“Token counts include all model generations, including revisits and mixture corrections”) and will ensure the experimental section cross-references the exact accounting procedure. No change to the underlying methodology is required. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical algorithmic contribution with no derivations

full rationale

The paper describes an empirical algorithmic method (stochastic backtracking via subpool selection and Power Backtrack SMC on a persistent pool) for improving test-time scaling accuracy-token trade-offs. No equations, derivations, or first-principles claims appear in the provided abstract or description. Central claims are benchmarked empirical results (higher accuracy per token vs. PRM baselines) that are externally falsifiable and do not reduce by construction to fitted parameters, self-citations, or renamed inputs. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identified or required for the high-level description.

pith-pipeline@v0.9.1-grok · 5774 in / 1024 out tokens · 45856 ms · 2026-06-30T11:03:32.629145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Beeching, L

E. Beeching, L. Tunstall, and S. Rush. Scaling test-time compute with open models, 2024. Blog post

2024
[3]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

2024
[4]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V . Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

R. S. M. Chan, T. Liu, S. Kiegeland, C. Pasti, J. H. Vigly, T. J. O’Donnell, R. Cotterell, and T. Vieira. Ensembling language models with sequential monte carlo.arXiv preprint arXiv:2603.05432, 2026

work page arXiv 2026
[6]

L. Chen, J. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neural Information Processing Systems, 37:45767–45790, 2024

2024
[7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

S. Feng, X. Kong, S. Ma, A. Zhang, D. Yin, C. Wang, R. Pang, and Y . Yang. Step-by-step reasoning for math problems via twisted sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[9]

Gandhi, D

K. Gandhi, D. H. J. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. Goodman. Stream of search (sos): Learning to search in language. InFirst Conference on Language Modeling, 2024

2024
[10]

Giannone, G

G. Giannone, G. Xu, N. S. Nayak, R. M. Awhad, S. Sudalairaj, K. Xu, and A. Srivastava. Mitigating premature exploitation in particle-based monte carlo for inference-time scaling. arXiv preprint arXiv:2510.05825, 2025

work page arXiv 2025
[11]

Grand, J

G. Grand, J. B. Tenenbaum, V . Mansinghka, A. K. Lew, and J. Andreas. Self-steering language models. InSecond Conference on Language Modeling, 2025

2025
[12]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Havrilla, S

A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, and R. Raileanu. GLore: When, where, and how to improve LLM reasoning via global and local refinements. InForty-first International Conference on Machine Learning, 2024

2024
[14]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

2021
[15]

Hosseini, X

A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STar: Training verifiers for self-taught reasoners. InFirst Conference on Language Modeling, 2024

2024
[16]

Inoue, K

Y . Inoue, K. Misaki, Y . Imajuku, S. Kuroki, T. Nakamura, and T. Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[17]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Karamanis and U

M. Karamanis and U. Seljak. Persistent sampling: Enhancing the efficiency of sequential monte carlo.Statistics and computing, 35(5):144, 2025

2025
[19]

Karan and Y

A. Karan and Y . Du. Reasoning with sampling: Your base model is smarter than you think. In The Fourteenth International Conference on Learning Representations, 2026

2026
[20]

Kumar, V

A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via rein- forcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[21]

A. K. Lew, T. Zhi-Xuan, G. Grand, and V . Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. InICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023

2023
[22]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

2022
[23]

B. Li, Y . Wang, A. Lochab, A. Grama, and R. Zhang. Cascade reward sampling for efficient decoding-time alignment. InSecond Conference on Language Modeling, 2025

2025
[24]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024
[25]

Lipkin, B

B. Lipkin, B. LeBrun, J. H. Vigly, J. Loula, D. R. MacIver, L. Du, J. Eisner, R. Cotterell, V . Mansinghka, T. J. O’Donnell, A. K. Lew, and T. Vieira. Fast controlled generation from language models with adaptive weighted rejection sampling. InSecond Conference on Language Modeling, 2025

2025
[26]

R. Liu, J. Gao, J. Zhao, K. Zhang, X. Li, B. Qi, W. Ouyang, and B. Zhou. Can 1b LLM surpass 405b LLM? rethinking compute-optimal test-time scaling. InWorkshop on Reasoning and Planning for Large Language Models, 2025

2025
[27]

Loula, B

J. Loula, B. LeBrun, L. Du, B. Lipkin, C. Pasti, G. Grand, T. Liu, Y . Emara, M. Freedman, J. Eisner, R. Cotterell, V . Mansinghka, A. K. Lew, T. Vieira, and T. J. O’Donnell. Syntactic and semantic control of large language models via sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[28]

L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, L. Meng, J. Sun, and A. Rastogi. Improve mathematical reasoning in language models with automated process supervision, 2025

2025
[29]

L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, Y . Zhu, L. Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Macfarlane, M

M. Macfarlane, M. Kim, N. Jojic, W. Xu, L. Caccia, X. Yuan, W. Zhao, Z. Shi, and A. Sordoni. Instilling parallel reasoning into language models. In2nd AI for Math Workshop@ ICML 2025, 2025

2025
[31]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[32]

American mathematics competitions (amc) 2023

Mathematical Association of America. American mathematics competitions (amc) 2023. https://maa.org, 2023. AMC 8, AMC 10, and AMC 12 competition problems

2023
[33]

Aime 2024 problems, 2024

Mathematical Association of America. Aime 2024 problems, 2024. American Invitational Mathematics Examination. 11

2024
[34]

Muennighoff, Z

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. J. Candès, and T. Hashimoto. s1: Simple test-time scaling. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-...

2025
[35]

Nguyen, D

A. Nguyen, D. Mekala, C. Dong, and J. Shang. When is the consistent prediction likely to be a correct prediction?arXiv preprint arXiv:2407.05778, 2024

work page arXiv 2024
[36]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welin- der, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,A...

2022
[37]

A. B. Owen.Monte Carlo theory, methods and examples. https://artowen.su.domains/ mc/, 2013

2013
[38]

Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Instance-adaptive inference- time scaling with calibrated process reward models. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

2025
[39]

Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Know what you don’t know: Uncertainty calibration of process reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[40]

I. Puri, S. Sudalairaj, G. Xu, A. Bhandwaldar, K. Xu, and A. Srivastava. Rollout roulette: A probabilistic inference approach to inference-time scaling of LLMs using particle-based monte carlo methods. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[41]

T. Qin, D. Alvarez-Melis, S. Jelassi, and E. Malach. To backtrack or not to backtrack: When sequential search limits model reasoning. InSecond Conference on Language Modeling, 2025

2025
[42]

Y . Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249– 55285, 2024

2024
[43]

Rohatgi, A

D. Rohatgi, A. Shetty, D. Saless, Y . Li, A. Moitra, A. Risteski, and D. J. Foster. Taming imperfect process verifiers: A sampling perspective on backtracking. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[44]

Singh, K

A. Singh, K. Arora, S. Keh, J. Mercat, T. Hashimoto, C. Finn, and A. Kumar. Improving the efficiency of test-time search in LLMs with backtracking, 2025

2025
[45]

C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[46]

Stiennon, L

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020
[47]

Veach and L

E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. InProceedings of SIGGRAPH 1995, pages 419–428. ACM Press, 1995

1995
[48]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment, 2024

2024
[49]

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 12

2024
[50]

Q. Wang, P. Zhao, S. Huang, F. Yang, L. Wang, F. Wei, Q. Lin, S. Rajmohan, and D. Zhang. Learning to refine: Self-refinement of parallel reasoning in llms.arXiv preprint arXiv:2509.00084, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[52]

Xefteri, T

V . Xefteri, T. Vieira, R. Cotterell, and A. Amini. Syntactic control of language models by posterior inference. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25350–25365, 2025

2025
[53]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

2023
[55]

Y . Yu, Q. Di, Q. Gu, and D. Zhou. On the limits of test-time compute: Sequential reward filtering for better inference.arXiv preprint arXiv:2512.04558, 2025

work page arXiv 2025
[56]

L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou, Z. Liu, and H. Peng. Free process rewards without process labels. InForty-second International Conference on Machine Learning, 2025

2025
[57]

Zhang, A

L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[58]

Zhang, C

Z. Zhang, C. Zheng, Y . Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025

2025
[59]

D. Zhao, S. Tu, and L. Xu. Seea*: Efficient exploration-enhanced a* search by selective sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[60]

S. Zhao, R. Brekelmans, A. Makhzani, and R. B. Grosse. Probabilistic inference in language models via twisted sequential monte carlo. InInternational Conference on Machine Learning, pages 60704–60748. PMLR, 2024

2024
[61]

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

C. Zheng, J. Zhu, J. Lin, X. Dai, Y . Yu, W. Zhang, and M. Yang. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698, 2025. 13 A Algorithms for Persistent-Pool Search The main text keeps only the master template; the method-specific pseudocode is collected here. All three algorithms use the notation of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Beeching, L

E. Beeching, L. Tunstall, and S. Rush. Scaling test-time compute with open models, 2024. Blog post

2024

[3] [3]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

2024

[4] [4]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V . Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

R. S. M. Chan, T. Liu, S. Kiegeland, C. Pasti, J. H. Vigly, T. J. O’Donnell, R. Cotterell, and T. Vieira. Ensembling language models with sequential monte carlo.arXiv preprint arXiv:2603.05432, 2026

work page arXiv 2026

[6] [6]

L. Chen, J. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neural Information Processing Systems, 37:45767–45790, 2024

2024

[7] [7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

S. Feng, X. Kong, S. Ma, A. Zhang, D. Yin, C. Wang, R. Pang, and Y . Yang. Step-by-step reasoning for math problems via twisted sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[9] [9]

Gandhi, D

K. Gandhi, D. H. J. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and N. Goodman. Stream of search (sos): Learning to search in language. InFirst Conference on Language Modeling, 2024

2024

[10] [10]

Giannone, G

G. Giannone, G. Xu, N. S. Nayak, R. M. Awhad, S. Sudalairaj, K. Xu, and A. Srivastava. Mitigating premature exploitation in particle-based monte carlo for inference-time scaling. arXiv preprint arXiv:2510.05825, 2025

work page arXiv 2025

[11] [11]

Grand, J

G. Grand, J. B. Tenenbaum, V . Mansinghka, A. K. Lew, and J. Andreas. Self-steering language models. InSecond Conference on Language Modeling, 2025

2025

[12] [12]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Havrilla, S

A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, and R. Raileanu. GLore: When, where, and how to improve LLM reasoning via global and local refinements. InForty-first International Conference on Machine Learning, 2024

2024

[14] [14]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

2021

[15] [15]

Hosseini, X

A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STar: Training verifiers for self-taught reasoners. InFirst Conference on Language Modeling, 2024

2024

[16] [16]

Inoue, K

Y . Inoue, K. Misaki, Y . Imajuku, S. Kuroki, T. Nakamura, and T. Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[17] [17]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Karamanis and U

M. Karamanis and U. Seljak. Persistent sampling: Enhancing the efficiency of sequential monte carlo.Statistics and computing, 35(5):144, 2025

2025

[19] [19]

Karan and Y

A. Karan and Y . Du. Reasoning with sampling: Your base model is smarter than you think. In The Fourteenth International Conference on Learning Representations, 2026

2026

[20] [20]

Kumar, V

A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via rein- forcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[21] [21]

A. K. Lew, T. Zhi-Xuan, G. Grand, and V . Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. InICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023

2023

[22] [22]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

2022

[23] [23]

B. Li, Y . Wang, A. Lochab, A. Grama, and R. Zhang. Cascade reward sampling for efficient decoding-time alignment. InSecond Conference on Language Modeling, 2025

2025

[24] [24]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024

[25] [25]

Lipkin, B

B. Lipkin, B. LeBrun, J. H. Vigly, J. Loula, D. R. MacIver, L. Du, J. Eisner, R. Cotterell, V . Mansinghka, T. J. O’Donnell, A. K. Lew, and T. Vieira. Fast controlled generation from language models with adaptive weighted rejection sampling. InSecond Conference on Language Modeling, 2025

2025

[26] [26]

R. Liu, J. Gao, J. Zhao, K. Zhang, X. Li, B. Qi, W. Ouyang, and B. Zhou. Can 1b LLM surpass 405b LLM? rethinking compute-optimal test-time scaling. InWorkshop on Reasoning and Planning for Large Language Models, 2025

2025

[27] [27]

Loula, B

J. Loula, B. LeBrun, L. Du, B. Lipkin, C. Pasti, G. Grand, T. Liu, Y . Emara, M. Freedman, J. Eisner, R. Cotterell, V . Mansinghka, A. K. Lew, T. Vieira, and T. J. O’Donnell. Syntactic and semantic control of large language models via sequential monte carlo. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[28] [28]

L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, L. Meng, J. Sun, and A. Rastogi. Improve mathematical reasoning in language models with automated process supervision, 2025

2025

[29] [29]

L. Luo, Y . Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y . Li, L. Shu, Y . Zhu, L. Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Macfarlane, M

M. Macfarlane, M. Kim, N. Jojic, W. Xu, L. Caccia, X. Yuan, W. Zhao, Z. Shi, and A. Sordoni. Instilling parallel reasoning into language models. In2nd AI for Math Workshop@ ICML 2025, 2025

2025

[31] [31]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023

[32] [32]

American mathematics competitions (amc) 2023

Mathematical Association of America. American mathematics competitions (amc) 2023. https://maa.org, 2023. AMC 8, AMC 10, and AMC 12 competition problems

2023

[33] [33]

Aime 2024 problems, 2024

Mathematical Association of America. Aime 2024 problems, 2024. American Invitational Mathematics Examination. 11

2024

[34] [34]

Muennighoff, Z

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. J. Candès, and T. Hashimoto. s1: Simple test-time scaling. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-...

2025

[35] [35]

Nguyen, D

A. Nguyen, D. Mekala, C. Dong, and J. Shang. When is the consistent prediction likely to be a correct prediction?arXiv preprint arXiv:2407.05778, 2024

work page arXiv 2024

[36] [36]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welin- der, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,A...

2022

[37] [37]

A. B. Owen.Monte Carlo theory, methods and examples. https://artowen.su.domains/ mc/, 2013

2013

[38] [38]

Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Instance-adaptive inference- time scaling with calibrated process reward models. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

2025

[39] [39]

Y .-J. Park, K. Greenewald, K. Alim, H. Wang, and N. Azizan. Know what you don’t know: Uncertainty calibration of process reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[40] [40]

I. Puri, S. Sudalairaj, G. Xu, A. Bhandwaldar, K. Xu, and A. Srivastava. Rollout roulette: A probabilistic inference approach to inference-time scaling of LLMs using particle-based monte carlo methods. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[41] [41]

T. Qin, D. Alvarez-Melis, S. Jelassi, and E. Malach. To backtrack or not to backtrack: When sequential search limits model reasoning. InSecond Conference on Language Modeling, 2025

2025

[42] [42]

Y . Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249– 55285, 2024

2024

[43] [43]

Rohatgi, A

D. Rohatgi, A. Shetty, D. Saless, Y . Li, A. Moitra, A. Risteski, and D. J. Foster. Taming imperfect process verifiers: A sampling perspective on backtracking. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[44] [44]

Singh, K

A. Singh, K. Arora, S. Keh, J. Mercat, T. Hashimoto, C. Finn, and A. Kumar. Improving the efficiency of test-time search in LLMs with backtracking, 2025

2025

[45] [45]

C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[46] [46]

Stiennon, L

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020

[47] [47]

Veach and L

E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. InProceedings of SIGGRAPH 1995, pages 419–428. ACM Press, 1995

1995

[48] [48]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment, 2024

2024

[49] [49]

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 12

2024

[50] [50]

Q. Wang, P. Zhao, S. Huang, F. Yang, L. Wang, F. Wei, Q. Lin, S. Rajmohan, and D. Zhang. Learning to refine: Self-refinement of parallel reasoning in llms.arXiv preprint arXiv:2509.00084, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[52] [52]

Xefteri, T

V . Xefteri, T. Vieira, R. Cotterell, and A. Amini. Syntactic control of language models by posterior inference. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25350–25365, 2025

2025

[53] [53]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

2023

[55] [55]

Y . Yu, Q. Di, Q. Gu, and D. Zhou. On the limits of test-time compute: Sequential reward filtering for better inference.arXiv preprint arXiv:2512.04558, 2025

work page arXiv 2025

[56] [56]

L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou, Z. Liu, and H. Peng. Free process rewards without process labels. InForty-second International Conference on Machine Learning, 2025

2025

[57] [57]

Zhang, A

L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[58] [58]

Zhang, C

Z. Zhang, C. Zheng, Y . Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025

2025

[59] [59]

D. Zhao, S. Tu, and L. Xu. Seea*: Efficient exploration-enhanced a* search by selective sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[60] [60]

S. Zhao, R. Brekelmans, A. Makhzani, and R. B. Grosse. Probabilistic inference in language models via twisted sequential monte carlo. InInternational Conference on Machine Learning, pages 60704–60748. PMLR, 2024

2024

[61] [61]

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

C. Zheng, J. Zhu, J. Lin, X. Dai, Y . Yu, W. Zhang, and M. Yang. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698, 2025. 13 A Algorithms for Persistent-Pool Search The main text keeps only the master template; the method-specific pseudocode is collected here. All three algorithms use the notation of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025