pith. sign in

arxiv: 2606.05464 · v1 · pith:5XH4K5R6new · submitted 2026-06-03 · 💻 cs.AI

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

Pith reviewed 2026-06-28 05:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords optimization-like reasoningLLM trainingsearch spacespolicy optimizationoffline RLstep-by-step reasoningfeasibility checkerreward shaping
0
0 comments X

The pith

Training LLMs on optimization-style tasks with expanding search spaces improves their step-by-step reasoning for high-value feasible plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OPT*, a family of tasks that require finding high-value feasible plans among many valid alternatives, each equipped with a feasibility checker and evaluator while a complexity parameter grows the search space without new labels. It studies two regimes: solver-guided online policy optimization that uses a solver as value oracle and applies rank-based reward shaping, and search-based offline RL when solvers are absent. Theoretical analysis links success in large spaces to information extracted per unit search budget. Empirically, ablations isolate the ingredients that make search efficient, and training on OPT* demonstrably improves optimization-like reasoning.

Core claim

Training on OPT* improves step-by-step optimization-like reasoning, as shown by ablations of search-efficiency ingredients and results from the two regimes.

What carries the argument

OPT*, a scalable family of optimization-style tasks that supplies a feasibility checker and evaluator while a complexity parameter expands the search space.

If this is right

  • Solver-guided online policy optimization with rank-based reward shaping reinforces better next steps using the solver as value oracle.
  • Search-based offline RL enables training when solvers are unavailable.
  • Ablations identify the specific ingredients required for efficient search over expanding spaces.
  • Success scales with the information a reasoner extracts per unit of search budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The OPT* construction could be adapted to generate synthetic data for domains where real labels are scarce but feasibility and value can be checked programmatically.
  • The per-budget information extraction relation suggests a way to allocate compute between model size and search depth during inference.
  • Hybrid training that alternates between the two regimes might reduce dependence on external solvers while retaining their guidance.

Load-bearing premise

The feasibility checker and evaluator supplied by each OPT* task sufficiently capture the structure of real-world tasks that require selecting high-value feasible plans among many valid alternatives.

What would settle it

A controlled test in which models trained on OPT* show no gain over baselines when evaluated on a new planning domain whose checker and evaluator encode different structural constraints.

Figures

Figures reproduced from arXiv: 2606.05464 by Mihaela van der Schaar, Nabeel Seedat, Nicol\'as Astorga.

Figure 1
Figure 1. Figure 1: OPT⋆ overview using a Traveling Salesman-style example. (a) Tasks can be instantiated by humans, programs, or LLM-assisted generators. (b1) Classical optimization problems naturally expose objectives and constraints, making them auto-verifiable. (b2) Difficulty can be scaled by changing a complexity parameter α, such as the number of customers or the structure of the instance. Increasing α enlarges the sea… view at source ↗
Figure 2
Figure 2. Figure 2: Traditional optimization tasks. Top: Task description detailing objectives, constraints, required reasoning, and scaling behavior with α. Bottom: Solver experimental results averaged over eight runs. Plots illustrate search space growth, the number of unique solutions, and the fraction representing the top-5 solutions as a function of complexity parameter α. Traditional optimization tasks. Mathematical opt… view at source ↗
Figure 3
Figure 3. Figure 3: Cols 1-4: Distribution of solution quality (histograms). Col 5: Accuracy on Knapsack problems across four αi. The νMCTS method (orange) approaches solver-guided performance (green), particularly at lower complexities. This suggests that verification-guided self-improvement can recover much of the gain from solver-guided supervision. checker or duplicate merging increases the effective branching factor and … view at source ↗
Figure 4
Figure 4. Figure 4: 2D Spatial task. Many skills can be learned by solving optimization problems. Left: Solving polyomino tasks improves the capabilities required for the task. Right: Pass@8 accuracy (%) on polyomino and spatial capability tasks when training sets are generated using different strategies over αi. The mixed α1–α4 set contains 4k examples total, with 1k per difficulty; the α1-only and α4-only sets each contain … view at source ↗
Figure 5
Figure 5. Figure 5: Polyomino target cover examples. This shows the optimal solution found by the solver. [PITH_FULL_IMAGE:figures/full_fig_p039_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Qwen2.5-3B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p048_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (b) Qwen2.5-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p048_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (c) Llama3.2-3B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy and solution-quality histograms for baseline, pure MCTS, and solver-guided [PITH_FULL_IMAGE:figures/full_fig_p052_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effective branching factor versus OPT⋆ difficulty level, separated by task and search family. Each column is a task, and each row is a search family. The y-axis is logarithmic, with lower values indicating better search concentration. Across tasks, the full S1 configuration achieves the lowest or near-lowest beff among model-based generators, while removing checking/pruning or duplicate merging increases … view at source ↗
Figure 12
Figure 12. Figure 12: Global search-only metrics across the OPT [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: QAP-focused search metrics. QAP follows the same component pattern as the global [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Proposal-level component diagnostics. The full S1 configuration retains fewer raw [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Sample-complexity and crossover readouts. Lower [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗
read the original abstract

Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces OPT*, a scalable family of optimization-style tasks for LLMs that supply explicit feasibility checkers and evaluators along with a tunable complexity parameter to expand search spaces without new human labels. It examines two regimes—solver-guided online policy optimization with rank-based reward shaping and search-based offline RL—provides a theoretical relation linking success in large spaces to information extracted per search budget, and reports empirical ablations showing that training on OPT* improves step-by-step optimization-like reasoning.

Significance. If the empirical results and ablations hold under scrutiny, the work supplies a new, controllable benchmark family for optimization-like reasoning that extends beyond math and coding domains, together with a theoretical framing of search efficiency. The ablations on search-efficiency ingredients and the two-regime design are concrete strengths that could support follow-on work on plan selection.

major comments (1)
  1. [Abstract and OPT* task definition] The central motivation—that improvements on OPT* will transfer to real-world tasks requiring selection of high-value feasible plans among many valid alternatives—rests on the assumption that the supplied feasibility checkers and evaluators induce the same reasoning demands as tasks lacking such oracles. This assumption is load-bearing for the broader claim yet is not tested; all reported regimes and ablations operate inside the oracle-provided setting (Abstract; description of the two regimes).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and OPT* task definition] The central motivation—that improvements on OPT* will transfer to real-world tasks requiring selection of high-value feasible plans among many valid alternatives—rests on the assumption that the supplied feasibility checkers and evaluators induce the same reasoning demands as tasks lacking such oracles. This assumption is load-bearing for the broader claim yet is not tested; all reported regimes and ablations operate inside the oracle-provided setting (Abstract; description of the two regimes).

    Authors: We acknowledge that the empirical regimes operate with provided feasibility checkers and evaluators (enabling verifiable rewards) and that the solver-guided regime additionally uses a solver oracle. The search-based offline RL regime removes the solver but retains the checkers/evaluators. We agree that direct empirical testing of transfer to settings with no oracles at all is absent and constitutes a limitation for the broader transfer claim. The OPT* design isolates optimization-like reasoning under verifiable signals, analogous to math/coding benchmarks; we will revise the abstract and regime descriptions to explicitly scope the claims, clarify oracle dependence, and flag oracle-free transfer as future work. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper defines OPT* tasks independently via explicit feasibility checkers, evaluators, and a tunable complexity parameter that expands search space without new labels. It then evaluates two regimes (solver-guided online policy optimization and search-based offline RL) and reports ablations showing performance gains on these tasks. The theoretical relation between search budget and extracted information is stated as a relation rather than derived from fitted parameters or self-citations. No equations or claims reduce the reported improvements to the task definitions by construction, and no self-citation chains or ansatzes are invoked as load-bearing for the central empirical result. The derivation is therefore self-contained against the introduced benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the contribution centers on task design and training procedures.

pith-pipeline@v0.9.1-grok · 5709 in / 990 out tokens · 62274 ms · 2026-06-28T05:44:15.132413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associ...

  2. [2]

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM

  3. [3]

    Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

    Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

  4. [4]

    Re- act: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  5. [5]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  6. [6]

    Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. URLhttps://arxiv.org/abs/2308.01825

  7. [7]

    Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri...

  8. [8]

    Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128, 2023

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128, 2023

  9. [9]

    Eric Zelikman, Georges Raif Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman. Quiet-STaR: Language models can teach themselves to think before speaking. InConference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=oRXPiSOGH9

  10. [10]

    V-STaR: Training verifiers for self-taught reasoners

    Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-STaR: Training verifiers for self-taught reasoners. InConference on Language Modeling, 2024. URL https://openreview.net/forum?id=stmqBSW2dV

  11. [11]

    Symbolic chain-of-thought distillation: Small models can also “think” step-by-step

    Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–

  12. [12]

    doi: 10.18653/v1/2023.acl-long.150

    Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.150. URL https://aclanthology.org/2023.acl-long.150/

  13. [13]

    Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495, 2025

    Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, et al. Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495, 2025

  14. [14]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  15. [15]

    Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

  16. [16]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10

  17. [17]

    Group sequence policy optimization, 2025

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

  18. [18]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  19. [19]

    Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  20. [20]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.arXiv preprint arXiv:2505.19914, 2025

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.arXiv preprint arXiv:2505.19914, 2025

  21. [21]

    Logicpuzzlerl: Cultivating robust mathematical reasoning in llms via reinforcement learning.arXiv preprint arXiv:2506.04821, 2025

    Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, and Wentao Zhang. Logicpuzzlerl: Cultivating robust mathematical reasoning in llms via reinforcement learning.arXiv preprint arXiv:2506.04821, 2025

  22. [22]

    Satbench: Benchmarking llms’ logical reasoning via automated puzzle generation from sat formulas.arXiv preprint arXiv:2505.14615, 2025

    Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, and Alex Aiken. Satbench: Benchmarking llms’ logical reasoning via automated puzzle generation from sat formulas.arXiv preprint arXiv:2505.14615, 2025

  23. [23]

    Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models.arXiv preprint arXiv:2502.16906, 2025

    Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models.arXiv preprint arXiv:2502.16906, 2025

  24. [24]

    Nature625, 476–482 (2024).https://doi.org/ 10.1038/s41586-023-06747-5

    Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 2024. doi: 10.1038/s41586-023-06747-5

  25. [25]

    Ai achieves silver-medal standard solving in- ternational mathematical olympiad problems

    AlphaProof and AlphaGeometry teams. Ai achieves silver-medal standard solving in- ternational mathematical olympiad problems. https://deepmind.google/discover/blog/ ai-solves-imo-problems-at-silver-medal-level/, July 2024. Accessed 2025-09-25

  26. [26]

    How to get your llm to generate challenging problems for evaluation.arXiv preprint arXiv:2502.14678, 2025

    Arkil Patel, Siva Reddy, and Dzmitry Bahdanau. How to get your llm to generate challenging problems for evaluation.arXiv preprint arXiv:2502.14678, 2025

  27. [27]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  28. [28]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems, 2021. arXiv:2105.09938

  29. [29]

    Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  30. [30]

    Competition-Level Code Generation with AlphaCode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Ec- cles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  31. [31]

    REASONING GYM: Reasoning environments for reinforcement learning with verifiable rewards, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. REASONING GYM: Reasoning environments for reinforcement learning with verifiable rewards, 2025. URLhttps://arxiv.org/abs/2505.24760. 11

  32. [32]

    InternBootcamp technical report: Boosting LLM reasoning with verifiable task scaling.arXiv preprint arXiv:2508.08636, 2025

    Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Xiaozhe Li, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, and Kai Chen. InternBootcamp technical report: Boosting LLM reasoning with verifiable task scaling.arXiv preprint arXiv:2508.08636, 2025. URLhttps://arxiv.org/abs...

  33. [33]

    Analysing mathematical reasoning abilities of neural models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1904.01557

  34. [34]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  35. [35]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  36. [36]

    Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training.arXiv preprint arXiv:2309.17179, 2023

    Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training.arXiv preprint arXiv:2309.17179, 2023. URLhttps://arxiv.org/abs/2309.17179

  37. [37]

    Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024

  38. [38]

    Alphamath almost zero: process supervision without process, 2024

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process, 2024. URLhttps://arxiv.org/abs/2405.03553

  39. [39]

    Step-level value preference optimization for mathematical reasoning

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7889–7903, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.463. URL https://aclanthology....

  40. [40]

    Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

  41. [41]

    Rest-mcts*: Llm self-training via process reward guided tree search.arXiv preprint arXiv:2406.03816, 2024

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search.arXiv preprint arXiv:2406.03816, 2024

  42. [42]

    Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394, 2024

    Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, and Wanli Ouyang. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394, 2024

  43. [43]

    rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519, 2025

  44. [44]

    Graph of thoughts: Solving elaborate problems with large language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024

  45. [45]

    Bandit Based Monte-Carlo Planning

    Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InMachine Learning: ECML 2006, pages 282–293. Springer, 2006. doi: 10.1007/11871842_29

  46. [46]

    Efficient selectivity and backup operators in Monte-Carlo tree search

    Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. Donkers, editors,Computers and Games: CG 2006, volume 4630 ofLecture Notes in Computer Science, pages 72–83. Springer, 2007. doi: 10.1007/978-3-540-75538-8_7

  47. [47]

    Browne, Edward Powley, Daniel Whitehouse, Simon M

    Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods.IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43,

  48. [48]

    doi: 10.1109/TCIAIG.2012.2186810. 12

  49. [49]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025

    Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and Nick Haber. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models, 2025. URL https://arxiv.org/abs/2502. 17387

  50. [50]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems. ACM, 2025. doi: 10.1145/3689031.3696075

  51. [51]

    Qwen2.5 technical report, 2024

    Qwen Team. Qwen2.5 technical report, 2024. URLhttps://arxiv.org/abs/2412.15115

  52. [52]

    Llama 3.2 3b instruct model card

    Meta. Llama 3.2 3b instruct model card. Hugging Face model card, 2024. URL https://huggingface. co/meta-llama/Llama-3.2-3B-Instruct

  53. [53]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

  54. [54]

    American mathematics competitions (AMC) 2023

    Mathematical Association of America. American mathematics competitions (AMC) 2023. Mathematical Association of America, 2023. URL https://maa.org/student-programs/amc/. Accessed 2026-06- 03

  55. [55]

    American invitational mathematics examination (AIME) 2024

    Mathematical Association of America. American invitational mathematics examination (AIME) 2024. Mathematical Association of America, 2024. URL https://maa.org/ maa-invitational-competitions/. Accessed 2026-06-03

  56. [56]

    American invitational mathematics examination (AIME) 2025

    Mathematical Association of America. American invitational mathematics examination (AIME) 2025. Mathematical Association of America, 2025. URL https://maa.org/ maa-invitational-competitions/. Accessed 2026-06-03. 13 Appendix Index Quick guide to the appendix Ref. Content AInformation-theoretic formalization of reasoning on OPT⋆ A.1 Effective optima and sc...

  57. [57]

    a prompt function that converts the current structured state into a natural-language prompt

  58. [58]

    a JSON action schema, specified by the required action keys

  59. [59]

    a step validator, which implementschk t(s, a)

  60. [60]

    a state-transition rule, which applies a valid action to the current state

  61. [61]

    a terminal-state predicate

  62. [62]

    facility

    an objective function, which implementseval t(·)on terminal states. The model is asked to provide a short reasoning trace and then exactly one action. In the default experiments, the action is parsed from {"answer": [{...}]}. The parser also records whether the response was valid JSON, whether it contained the required keys, and whether the parsed action ...

  63. [63]

    the task has not already been selected

  64. [64]

    the worker has not already been used

  65. [65]

    the worker is eligible for the task

  66. [66]

    all resource budgets remain satisfied

  67. [67]

    "=empty,

    all hard Boolean clauses remain satisfied. The rollout terminates when no additional task-worker pair can be added without violating these conditions. Objective and tie-breakThe learning objective is to maximize the total weight of satisfied soft clauses. For oracle reporting, ties are broken lexicographically: first minimize resource usage in the fixed r...

  68. [68]

    Assign exactly one candidate to each role

  69. [69]

    Each candidate can be used at most once

  70. [70]

    Maximize X chosen fit scores − X conflict penalties among selected candidates

  71. [71]

    answer": [ {

    Tie-breaker: among equal total scores, prefer the assignment with the highest minimum individual fit. Current state • Number of roles:6 • Number of candidates:7 • Already assigned role-candidate pairs:[] Fit matrix Rows are candidates and columns are roles. candidate 0: role scores {0: 9, 1: 9, 2: 1, 3: 3, 4: 8, 5: 3} candidate 1: role scores {0: 6, 1: 4,...

  72. [72]

    Maximize the total weight of satisfied soft clauses, subject to budgets, worker eligibil- ity, and hard logic

  73. [73]

    Tie-breaker A: among equal soft-clause scores, minimize resource usage lexicographi- cally

  74. [74]

    answer": [ {

    Tie-breaker B: among remaining ties, prefer fewer selected tasks. Budgets Money= 12,Time= 11. Workers worker_index 0: W1 worker_index 1: W2 worker_index 2: W3 worker_index 3: W4 Tasks task_index 0, A: costs {Money: 2, Time: 1}, eligible workers [2] task_index 1, B: costs {Money: 2, Time: 4}, eligible workers [1] task_index 2, C: costs {Money: 4, Time: 3},...