arxiv: 2604.17957 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Raffaele Pisano , Roberto Navigli

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords Process Reward ModelsPDDLPlanning Domain Definition LanguageStep-level RewardsLLM ReasoningSynthetic Data GenerationChain of Thought

0 comments

The pith

Augmenting PRM datasets with PDDL-derived reasoning steps improves performance on mathematical and non-mathematical benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Process reward models give feedback on each step of an LLM's chain of thought rather than only the final answer. Existing training data for these models is expensive to build, often contains annotation mistakes, and stays mostly within mathematics. The paper generates roughly one million reasoning steps from PDDL planning problems across multiple domains and mixes them with standard PRM datasets. Training on the combined data produces clear gains on several reasoning benchmarks that cover both math and non-math tasks. This points to planning problems as a practical, scalable source of precise step-level supervision.

Core claim

Incorporating reasoning steps generated from PDDL planning problems into existing PRM training sets yields substantial improvements in both mathematical and non-mathematical reasoning, as measured across multiple benchmarks. The approach produces a corpus of approximately one million steps that can be used directly for training.

What carries the argument

PDDL-based generation of step-level reasoning traces with explicit correctness labels for training process reward models.

If this is right

PRMs can be trained with less reliance on costly human annotations or noisy LLM self-generated data.
Planning domains provide a controllable way to create fine-grained supervision signals that transfer to general reasoning.
The method extends step-level reward modeling beyond mathematics into broader domains.
Larger PDDL corpora could further scale the precision of process supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If PDDL steps reliably simulate LLM error distributions, the same pipeline could generate targeted datasets for specific reasoning failure modes.
The approach may generalize to other formal domains such as code or logical puzzles to create additional reward-model training resources.
Benchmark gains attributed to data quality would be strengthened by ablations that isolate the PDDL contribution from other training variables.

Load-bearing premise

The distribution of correct and incorrect steps produced by PDDL planning problems closely matches the error patterns that appear in real LLM chains of thought.

What would settle it

Training identical PRMs on the standard datasets versus the PDDL-augmented versions and finding no difference or a drop in benchmark scores would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.17957 by Raffaele Pisano, Roberto Navigli.

**Figure 2.** Figure 2: Example from the MATH subset of ProcessBench. We report scores from two PRMs based on Llama + [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Example from the Medicine subset of MR-Ben. We report scores from two PRMs based on Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDDL planning gives a scalable route to step-level PRM data outside math, but the abstract's improvement claims rest on missing numbers and controls.

read the letter

The main point is that they turn PDDL planning problems into a source of roughly one million step-level reasoning traces with automatic correctness labels, then add those traces to standard PRM training sets. The abstract reports that the augmented data improves both mathematical and non-mathematical reasoning on multiple benchmarks. This is a practical move because existing PRM datasets are small, expensive to label, and mostly math-only. Planning domains supply structured steps where each action can be verified against preconditions and goals, so the labels come for free and stay precise. That directly tackles the cost and annotation-error problems the abstract flags, and it extends the data source beyond the usual math problems. The method itself looks straightforward and reproducible if the generation pipeline is described clearly. The reported gains, if real, would matter for anyone trying to make reward models more reliable on general reasoning tasks. The soft spots are the lack of any numbers, baselines, or experimental details in the abstract. Without those, it is impossible to judge how large the gains are or whether they come from the PDDL data rather than simply adding more examples. The stress-test concern holds up on the available information: planning traces may have different error patterns than the free-form mistakes LLMs actually produce in chains of thought, so any improvement could be driven by volume or domain coverage instead of better error detection. I would want to see ablations that isolate the PDDL contribution and some comparison of error distributions. This paper is for groups working on process supervision and LLM reasoning who need more training data. It deserves a serious referee because the idea is concrete and the scale is useful, even though the current evidence is thin. Send it for review but ask for the full quantitative results, controls, and any analysis of how the synthetic steps compare to real LLM outputs.

Referee Report

2 major / 2 minor

Summary. The paper proposes generating large-scale Process Reward Model (PRM) training data from PDDL planning problems, producing approximately one million reasoning steps across multiple domains. It claims that augmenting existing PRM datasets with this PDDL-derived data yields substantial improvements in both mathematical and non-mathematical LLM reasoning, as shown on multiple benchmarks.

Significance. If the empirical gains hold and generalize, the method would offer a scalable, low-cost alternative to human or LLM-generated step-level annotations, extending PRM training beyond the math domain that currently dominates the literature.

major comments (2)

[Abstract] Abstract: the claim of 'substantial improvements' from PDDL augmentation is presented without any quantitative results, baseline comparisons, statistical details, or description of evaluation metrics, so the data-to-claim link cannot be verified.
[Method and Experiments] Method and Experiments: the central assumption that PDDL planning traces produce step-level correctness signals whose error patterns (invalid preconditions, missing subgoals, etc.) are representative of the mistakes LLMs make in free-form CoTs is not validated by any error-distribution comparison or ablation that isolates the contribution of PDDL data from added volume or domain coverage.

minor comments (2)

Clarify how individual reasoning steps are extracted and labeled from PDDL traces (e.g., what constitutes a 'step' and how correctness is determined automatically).
Provide the list of PDDL domains used and the exact size of the generated corpus per domain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and valuable feedback on our work regarding the use of PDDL planning problems for generating Process Reward Model training data. We address the major comments point by point below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'substantial improvements' from PDDL augmentation is presented without any quantitative results, baseline comparisons, statistical details, or description of evaluation metrics, so the data-to-claim link cannot be verified.

Authors: The abstract is intended to provide a concise overview of the paper's contributions and findings. Specific quantitative results, including the exact improvements on mathematical and non-mathematical reasoning benchmarks, baseline comparisons, and evaluation metrics are detailed in the Experiments section of the manuscript. To address this concern and make the data-to-claim link more verifiable from the abstract, we will revise the abstract to include key quantitative highlights and a brief mention of the metrics used. revision: yes
Referee: [Method and Experiments] Method and Experiments: the central assumption that PDDL planning traces produce step-level correctness signals whose error patterns (invalid preconditions, missing subgoals, etc.) are representative of the mistakes LLMs make in free-form CoTs is not validated by any error-distribution comparison or ablation that isolates the contribution of PDDL data from added volume or domain coverage.

Authors: We agree that validating the representativeness of error patterns would strengthen the methodological claims. While the current manuscript shows empirical benefits through performance improvements on diverse benchmarks, we acknowledge the lack of direct error-distribution comparisons or volume-controlled ablations. We will add such analyses in the revised version, including a comparison of error types and an ablation to isolate the PDDL data's contribution beyond mere volume increase. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external PDDL sources and empirical benchmarks

full rationale

The paper's core chain generates ~1M step-level traces from independent PDDL planning domains, augments existing PRM datasets, trains models, and reports benchmark gains on MATH/GSM8K and non-math tasks. No equations, fitted parameters, or self-citations are described that reduce the claimed improvements to the inputs by construction. Data generation uses external logical problems rather than LLM CoTs or model outputs; results are presented as empirical outcomes of standard training and evaluation. This matches the reader's assessment of no circularity and satisfies the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that PDDL planning problems supply automatically labelable, precise, and representative reasoning steps; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Planning problems expressed in PDDL allow automatic, error-free labeling of individual reasoning steps as correct or incorrect.
The method depends on this property to generate scalable, precise training data without human annotation.

pith-pipeline@v0.9.0 · 5475 in / 1357 out tokens · 33634 ms · 2026-05-10T04:37:42.638816+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 29 canonical work pages · 10 internal anchors

[1]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo and Yinxiao Liu and Rosanne Liu and Samrat Phatale and Harsh Lara and Yunxuan Li and Lei Shu and Yun Zhu and Lei Meng and Jiao Sun and Abhinav Rastogi , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.06592 , eprinttype =. 2406.06592 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2406.06592 2024
[2]

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision , booktitle =

Zhiqing Sun and Longhui Yu and Yikang Shen and Weiyang Liu and Yiming Yang and Sean Welleck and Chuang Gan , editor =. Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision , booktitle =. 2024 , url =

2024
[3]

OpenAI o1 System Card

OpenAI , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.16720 , eprinttype =. 2412.16720 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[4]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[5]

The Twelfth International Conference on Learning Representations,

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[6]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

Peiyi Wang and Lei Li and Zhihong Shao and Runxin Xu and Damai Dai and Yifei Li and Deli Chen and Yu Wu and Zhifang Sui , editor =. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.510 , timestamp =

work page doi:10.18653/v1/2024.acl-long.510 2024
[7]

The Lessons of Developing Process Reward Models in Mathematical Reasoning , booktitle =

Zhenru Zhang and Chujie Zheng and Yangzhen Wu and Beichen Zhang and Runji Lin and Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin , editor =. The Lessons of Developing Process Reward Models in Mathematical Reasoning , booktitle =. 2025 , url =

2025
[8]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato and Nate Kushman and Ramana Kumar and H. Francis Song and Noah Y. Siegel and Lisa Wang and Antonia Creswell and Geoffrey Irving and Irina Higgins , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.14275 , eprinttype =. 2211.14275 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275 2022
[9]

Process Reinforcement through Implicit Rewards

Ganqu Cui and Lifan Yuan and Zefan Wang and Hanbin Wang and Wendi Li and Bingxiang He and Yuchen Fan and Tianyu Yu and Qixin Xu and Weize Chen and Jiarui Yuan and Huayu Chen and Kaiyan Zhang and Xingtai Lv and Shuo Wang and Yuan Yao and Xu Han and Hao Peng and Yu Cheng and Zhiyuan Liu and Maosong Sun and Bowen Zhou and Ning Ding , title =. CoRR , volume =...

work page internal anchor Pith review doi:10.48550/arxiv.2502.01456 2025
[10]

Free Process Rewards without Process Labels , booktitle =

Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng , editor =. Free Process Rewards without Process Labels , booktitle =. 2025 , url =

2025
[11]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Weiyun Wang and Zhangwei Gao and Lianjie Chen and Zhe Chen and Jinguo Zhu and Xiangyu Zhao and Yangzhou Liu and Yue Cao and Shenglong Ye and Xizhou Zhu and Lewei Lu and Haodong Duan and Yu Qiao and Jifeng Dai and Wenhai Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.10291 , eprinttype =. 2503.10291 , timestamp =

work page doi:10.48550/arxiv.2503.10291 2025
[12]

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data , booktitle =

Thomas Zeng and Shuibai Zhang and Shutong Wu and Christian Classen and Daewon Chae and Ethan Ewer and Minjae Lee and Heeju Kim and Wonjun Kang and Jackson Kunde and Ying Fan and Jungtaek Kim and Hyung Il Koo and Kannan Ramchandran and Dimitris Papailiopoulos and Kangwook Lee , editor =. VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning D...

2025
[13]

CoRR , volume =

Wei Sun and Qianlong Du and Fuwei Cui and Jiajun Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.02382 , eprinttype =. 2503.02382 , timestamp =

work page doi:10.48550/arxiv.2503.02382 2025
[14]

PRMBench:

Mingyang Song and Zhaochen Su and Xiaoye Qu and Jiawei Zhou and Yu Cheng , editor =. PRMBench:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025
[15]

ProcessBench: Identifying Process Errors in Mathematical Reasoning , booktitle =

Chujie Zheng and Zhenru Zhang and Beichen Zhang and Runji Lin and Keming Lu and Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin , editor =. ProcessBench: Identifying Process Errors in Mathematical Reasoning , booktitle =. 2025 , url =

2025
[16]

Zhongshen Zeng and Yinhong Liu and Yingjia Wan and Jingyao Li and Pengguang Chen and Jianbo Dai and Yuxuan Yao and Rongwu Xu and Zehan Qi and Wanru Zhao and Linling Shen and Jianqiao Lu and Haochen Tan and Yukang Chen and Hao Zhang and Zhan Shi and Bailin Wang and Zhijiang Guo and Jiaya Jia , editor =. MR-Ben:. Advances in Neural Information Processing Sy...

2024
[17]

1998 , url =

Drew McDermott and Malik Ghallab and Adele Howe and Craig Knoblock and Ashwin Ram and Manuela Veloso and Daniel Weld and David Wilkins , title =. 1998 , url =

1998
[18]

CoRR , volume =

Marco Aiello and Ilche Georgievski , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.11642 , eprinttype =. 2412.11642 , timestamp =

work page doi:10.48550/arxiv.2412.11642 2024
[19]

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , booktitle =

Karthik Valmeekam and Matthew Marquez and Alberto Olmo Hernandez and Sarath Sreedharan and Subbarao Kambhampati , editor =. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , booktitle =. 2023 , url =

2023
[20]

On the Planning Abilities of Large Language Models -

Karthik Valmeekam and Matthew Marquez and Sarath Sreedharan and Subbarao Kambhampati , editor =. On the Planning Abilities of Large Language Models -. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023
[21]

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , booktitle =

Lin Guan and Karthik Valmeekam and Sarath Sreedharan and Subbarao Kambhampati , editor =. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , booktitle =. 2023 , url =

2023
[22]

Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , booktitle =

Subbarao Kambhampati and Karthik Valmeekam and Lin Guan and Mudit Verma and Kaya Stechly and Siddhant Bhambri and Lucas Saldyt and Anil Murthy , editor =. Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , booktitle =. 2024 , url =

2024
[23]

The Thirteenth International Conference on Learning Representations,

Kaya Stechly and Karthik Valmeekam and Subbarao Kambhampati , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[24]

Chain of Thoughtlessness? An Analysis of CoT in Planning , booktitle =

Kaya Stechly and Karthik Valmeekam and Subbarao Kambhampati , editor =. Chain of Thoughtlessness? An Analysis of CoT in Planning , booktitle =. 2024 , url =

2024
[25]

Shah , title =

Pulkit Verma and Ngoc La and Anthony Favier and Swaroop Mishra and Julie A. Shah , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.13351 , eprinttype =. 2509.13351 , timestamp =

work page doi:10.48550/arxiv.2509.13351 2025
[26]

Oswald, K

James T. Oswald and Kavitha Srinivas and Harsha Kokel and Junkyu Lee and Michael Katz and Shirin Sohrabi , editor =. Large Language Models as Planning Domain Generators (Student Abstract) , booktitle =. 2024 , url =. doi:10.1609/AAAI.V38I21.30491 , timestamp =

work page doi:10.1609/aaai.v38i21.30491 2024
[27]

An Extensive Evaluation of

Kaustubh Vyas and Damien Graux and S. An Extensive Evaluation of. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.20175 , eprinttype =. 2502.20175 , timestamp =

work page doi:10.48550/arxiv.2502.20175 2025
[28]

Translating natural language to planning goals with large-language models,

Yaqi Xie and Chen Yu and Tongyao Zhu and Jinbin Bai and Ze Gong and Harold Soh , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2302.05128 , eprinttype =. 2302.05128 , timestamp =

work page doi:10.48550/arxiv.2302.05128 2023
[29]

CoRR , volume =

Kai Goebel and Patrik Zips , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.23589 , eprinttype =. 2507.23589 , timestamp =

work page doi:10.48550/arxiv.2507.23589 2025
[30]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen and Libo Qin and Jinhao Liu and Dengyun Peng and Jiannan Guan and Peng Wang and Mengkang Hu and Yuhang Zhou and Te Gao and Wanxiang Che , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.09567 , eprinttype =. 2503.09567 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2503.09567 2025
[31]

CoRR , volume =

Marcus Tantakoun and Xiaodan Zhu and Christian Muise , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.18971 , eprinttype =. 2503.18971 , timestamp =

work page doi:10.48550/arxiv.2503.18971 2025
[32]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang and Beichen Zhang and Binyuan Hui and Bofei Gao and Bowen Yu and Chengpeng Li and Dayiheng Liu and Jianhong Tu and Jingren Zhou and Junyang Lin and Keming Lu and Mingfeng Xue and Runji Lin and Tianyu Liu and Xingzhang Ren and Zhenru Zhang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.12122 , eprinttype =. 2409.12122 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2409.12122 2024
[33]

Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and L. Mistral 7B , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.06825 , eprinttype =. 2310.06825 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[34]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[35]

2024 , howpublished =

2024
[36]

Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

2022
[37]

2025 , howpublished =

2025
[38]

Malte Helmert , title =. J. Artif. Intell. Res. , volume =. 2006 , url =. doi:10.1613/JAIR.1705 , timestamp =

work page doi:10.1613/jair.1705 2006
[39]

Nau and Paolo Traverso , title =

Malik Ghallab and Dana S. Nau and Paolo Traverso , title =. 2004 , isbn =

2004
[40]

Unified Planning: Modeling, manipulating and solving

Andrea Micheli and Arthur Bit. Unified Planning: Modeling, manipulating and solving. SoftwareX , volume =. 2025 , url =. doi:10.1016/J.SOFTX.2024.102012 , timestamp =

work page doi:10.1016/j.softx.2024.102012 2025
[41]

Openr: An open source framework for advanced reasoning with large language models.arXiv preprint arXiv:2410.09671, 2024

Jun Wang and Meng Fang and Ziyu Wan and Muning Wen and Jiachen Zhu and Anjie Liu and Ziqin Gong and Yan Song and Lei Chen and Lionel M. Ni and Linyi Yang and Ying Wen and Weinan Zhang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.09671 , eprinttype =. 2410.09671 , timestamp =

work page doi:10.48550/arxiv.2410.09671 2024
[42]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , editor =. Measuring Mathematical Problem Solving With the. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , year =

2021
[44]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He and Renjie Luo and Yuzhuo Bai and Shengding Hu and Zhen Leng Thai and Junhao Shen and Jinyi Hu and Xu Han and Yujie Huang and Yuxiang Zhang and Jie Liu and Lei Qi and Zhiyuan Liu and Maosong Sun , editor =. OlympiadBench:. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , ur...

work page doi:10.18653/v1/2024.acl-long.211 2024
[45]

The Thirteenth International Conference on Learning Representations,

Bofei Gao and Feifan Song and Zhe Yang and Zefan Cai and Yibo Miao and Qingxiu Dong and Lei Li and Chenghao Ma and Liang Chen and Runxin Xu and Zhengyang Tang and Benyou Wang and Daoguang Zan and Shanghaoran Quan and Ge Zhang and Lei Sha and Yichang Zhang and Xuancheng Ren and Tianyu Liu and Baobao Chang , title =. The Thirteenth International Conference ...

2025
[46]

Qwen3 Technical Report

CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[47]

Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al

Magistral , journal =. 2025 , url =. doi:10.48550/ARXIV.2506.10910 , eprinttype =. 2506.10910 , timestamp =

work page doi:10.48550/arxiv.2506.10910 2025
[48]

Bowman , editor =

Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , editor =. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , booktitle =. 2023 , url =

2023
[49]

CoRR , volume =

Kaya Stechly and Karthik Valmeekam and Atharva Gundawar and Vardhan Palod and Subbarao Kambhampati , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.13775 , eprinttype =. 2505.13775 , timestamp =

work page doi:10.48550/arxiv.2505.13775 2025
[50]

Skywork open reasoner 1 technical report

Jujie He and Jiacai Liu and Chris Yuhao Liu and Rui Yan and Chaojie Wang and Peng Cheng and Xiaoyu Zhang and Fuxiang Zhang and Jiacheng Xu and Wei Shen and Siyuan Li and Liang Zeng and Tianwen Wei and Cheng Cheng and Bo An and Yang Liu and Yahui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22312 , eprinttype =. 2505.22312 , timestamp =

work page doi:10.48550/arxiv.2505.22312 2025
[51]

The Third International Planning Competition: Temporal and Metric Planning , booktitle =

Maria Fox and Derek Long , editor =. The Third International Planning Competition: Temporal and Metric Planning , booktitle =. 2002 , url =

2002
[52]

Rina Dechter and Judea Pearl , title =. J. 1985 , url =. doi:10.1145/3828.3830 , timestamp =

work page doi:10.1145/3828.3830 1985
[53]

Atkins and Edmund H

Ella M. Atkins and Edmund H. Durfee and Kang G. Shin , editor =. Detecting and Reacting to Unplanned-for World States , booktitle =. 1997 , url =

1997
[54]

Landmarks, Critical Paths and Abstractions: What's the Difference Anyway? , booktitle =

Malte Helmert and Carmel Domshlak , editor =. Landmarks, Critical Paths and Abstractions: What's the Difference Anyway? , booktitle =. 2009 , url =

2009
[55]

ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Molfese, Francesco Maria and Moroni, Luca and Porcaro, Ciro and Conia, Simone and Navigli, Roberto. ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering. Association for Computational Linguistics: ACL 2026. 2026

2026