Recognition: unknown
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Pith reviewed 2026-05-10 04:37 UTC · model grok-4.3
The pith
Augmenting PRM datasets with PDDL-derived reasoning steps improves performance on mathematical and non-mathematical benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Incorporating reasoning steps generated from PDDL planning problems into existing PRM training sets yields substantial improvements in both mathematical and non-mathematical reasoning, as measured across multiple benchmarks. The approach produces a corpus of approximately one million steps that can be used directly for training.
What carries the argument
PDDL-based generation of step-level reasoning traces with explicit correctness labels for training process reward models.
If this is right
- PRMs can be trained with less reliance on costly human annotations or noisy LLM self-generated data.
- Planning domains provide a controllable way to create fine-grained supervision signals that transfer to general reasoning.
- The method extends step-level reward modeling beyond mathematics into broader domains.
- Larger PDDL corpora could further scale the precision of process supervision.
Where Pith is reading between the lines
- If PDDL steps reliably simulate LLM error distributions, the same pipeline could generate targeted datasets for specific reasoning failure modes.
- The approach may generalize to other formal domains such as code or logical puzzles to create additional reward-model training resources.
- Benchmark gains attributed to data quality would be strengthened by ablations that isolate the PDDL contribution from other training variables.
Load-bearing premise
The distribution of correct and incorrect steps produced by PDDL planning problems closely matches the error patterns that appear in real LLM chains of thought.
What would settle it
Training identical PRMs on the standard datasets versus the PDDL-augmented versions and finding no difference or a drop in benchmark scores would falsify the claim.
Figures
read the original abstract
Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes generating large-scale Process Reward Model (PRM) training data from PDDL planning problems, producing approximately one million reasoning steps across multiple domains. It claims that augmenting existing PRM datasets with this PDDL-derived data yields substantial improvements in both mathematical and non-mathematical LLM reasoning, as shown on multiple benchmarks.
Significance. If the empirical gains hold and generalize, the method would offer a scalable, low-cost alternative to human or LLM-generated step-level annotations, extending PRM training beyond the math domain that currently dominates the literature.
major comments (2)
- [Abstract] Abstract: the claim of 'substantial improvements' from PDDL augmentation is presented without any quantitative results, baseline comparisons, statistical details, or description of evaluation metrics, so the data-to-claim link cannot be verified.
- [Method and Experiments] Method and Experiments: the central assumption that PDDL planning traces produce step-level correctness signals whose error patterns (invalid preconditions, missing subgoals, etc.) are representative of the mistakes LLMs make in free-form CoTs is not validated by any error-distribution comparison or ablation that isolates the contribution of PDDL data from added volume or domain coverage.
minor comments (2)
- Clarify how individual reasoning steps are extracted and labeled from PDDL traces (e.g., what constitutes a 'step' and how correctness is determined automatically).
- Provide the list of PDDL domains used and the exact size of the generated corpus per domain.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and valuable feedback on our work regarding the use of PDDL planning problems for generating Process Reward Model training data. We address the major comments point by point below, providing clarifications and outlining revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'substantial improvements' from PDDL augmentation is presented without any quantitative results, baseline comparisons, statistical details, or description of evaluation metrics, so the data-to-claim link cannot be verified.
Authors: The abstract is intended to provide a concise overview of the paper's contributions and findings. Specific quantitative results, including the exact improvements on mathematical and non-mathematical reasoning benchmarks, baseline comparisons, and evaluation metrics are detailed in the Experiments section of the manuscript. To address this concern and make the data-to-claim link more verifiable from the abstract, we will revise the abstract to include key quantitative highlights and a brief mention of the metrics used. revision: yes
-
Referee: [Method and Experiments] Method and Experiments: the central assumption that PDDL planning traces produce step-level correctness signals whose error patterns (invalid preconditions, missing subgoals, etc.) are representative of the mistakes LLMs make in free-form CoTs is not validated by any error-distribution comparison or ablation that isolates the contribution of PDDL data from added volume or domain coverage.
Authors: We agree that validating the representativeness of error patterns would strengthen the methodological claims. While the current manuscript shows empirical benefits through performance improvements on diverse benchmarks, we acknowledge the lack of direct error-distribution comparisons or volume-controlled ablations. We will add such analyses in the revised version, including a comparison of error types and an ablation to isolate the PDDL data's contribution beyond mere volume increase. revision: yes
Circularity Check
No significant circularity; derivation relies on external PDDL sources and empirical benchmarks
full rationale
The paper's core chain generates ~1M step-level traces from independent PDDL planning domains, augments existing PRM datasets, trains models, and reports benchmark gains on MATH/GSM8K and non-math tasks. No equations, fitted parameters, or self-citations are described that reduce the claimed improvements to the inputs by construction. Data generation uses external logical problems rather than LLM CoTs or model outputs; results are presented as empirical outcomes of standard training and evaluation. This matches the reader's assessment of no circularity and satisfies the criteria for a self-contained, non-circular derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Planning problems expressed in PDDL allow automatic, error-free labeling of individual reasoning steps as correct or incorrect.
Reference graph
Works this paper leans on
-
[1]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo and Yinxiao Liu and Rosanne Liu and Samrat Phatale and Harsh Lara and Yunxuan Li and Lei Shu and Yun Zhu and Lei Meng and Jiao Sun and Abhinav Rastogi , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.06592 , eprinttype =. 2406.06592 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2406.06592 2024
-
[2]
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision , booktitle =
Zhiqing Sun and Longhui Yu and Yikang Shen and Weiyang Liu and Yiming Yang and Sean Welleck and Chuang Gan , editor =. Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision , booktitle =. 2024 , url =
2024
-
[3]
OpenAI , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.16720 , eprinttype =. 2412.16720 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
-
[4]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[5]
The Twelfth International Conference on Learning Representations,
Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
2024
-
[6]
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations
Peiyi Wang and Lei Li and Zhihong Shao and Runxin Xu and Damai Dai and Yifei Li and Deli Chen and Yu Wu and Zhifang Sui , editor =. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.510 , timestamp =
-
[7]
The Lessons of Developing Process Reward Models in Mathematical Reasoning , booktitle =
Zhenru Zhang and Chujie Zheng and Yangzhen Wu and Beichen Zhang and Runji Lin and Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin , editor =. The Lessons of Developing Process Reward Models in Mathematical Reasoning , booktitle =. 2025 , url =
2025
-
[8]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato and Nate Kushman and Ramana Kumar and H. Francis Song and Noah Y. Siegel and Lisa Wang and Antonia Creswell and Geoffrey Irving and Irina Higgins , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2211.14275 , eprinttype =. 2211.14275 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275 2022
-
[9]
Process Reinforcement through Implicit Rewards
Ganqu Cui and Lifan Yuan and Zefan Wang and Hanbin Wang and Wendi Li and Bingxiang He and Yuchen Fan and Tianyu Yu and Qixin Xu and Weize Chen and Jiarui Yuan and Huayu Chen and Kaiyan Zhang and Xingtai Lv and Shuo Wang and Yuan Yao and Xu Han and Hao Peng and Yu Cheng and Zhiyuan Liu and Maosong Sun and Bowen Zhou and Ning Ding , title =. CoRR , volume =...
work page internal anchor Pith review doi:10.48550/arxiv.2502.01456 2025
-
[10]
Free Process Rewards without Process Labels , booktitle =
Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng , editor =. Free Process Rewards without Process Labels , booktitle =. 2025 , url =
2025
-
[11]
Weiyun Wang and Zhangwei Gao and Lianjie Chen and Zhe Chen and Jinguo Zhu and Xiangyu Zhao and Yangzhou Liu and Yue Cao and Shenglong Ye and Xizhou Zhu and Lewei Lu and Haodong Duan and Yu Qiao and Jifeng Dai and Wenhai Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.10291 , eprinttype =. 2503.10291 , timestamp =
-
[12]
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data , booktitle =
Thomas Zeng and Shuibai Zhang and Shutong Wu and Christian Classen and Daewon Chae and Ethan Ewer and Minjae Lee and Heeju Kim and Wonjun Kang and Jackson Kunde and Ying Fan and Jungtaek Kim and Hyung Il Koo and Kannan Ramchandran and Dimitris Papailiopoulos and Kangwook Lee , editor =. VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning D...
2025
-
[13]
Wei Sun and Qianlong Du and Fuwei Cui and Jiajun Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.02382 , eprinttype =. 2503.02382 , timestamp =
-
[14]
PRMBench:
Mingyang Song and Zhaochen Su and Xiaoye Qu and Jiawei Zhou and Yu Cheng , editor =. PRMBench:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =
2025
-
[15]
ProcessBench: Identifying Process Errors in Mathematical Reasoning , booktitle =
Chujie Zheng and Zhenru Zhang and Beichen Zhang and Runji Lin and Keming Lu and Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin , editor =. ProcessBench: Identifying Process Errors in Mathematical Reasoning , booktitle =. 2025 , url =
2025
-
[16]
Zhongshen Zeng and Yinhong Liu and Yingjia Wan and Jingyao Li and Pengguang Chen and Jianbo Dai and Yuxuan Yao and Rongwu Xu and Zehan Qi and Wanru Zhao and Linling Shen and Jianqiao Lu and Haochen Tan and Yukang Chen and Hao Zhang and Zhan Shi and Bailin Wang and Zhijiang Guo and Jiaya Jia , editor =. MR-Ben:. Advances in Neural Information Processing Sy...
2024
-
[17]
1998 , url =
Drew McDermott and Malik Ghallab and Adele Howe and Craig Knoblock and Ashwin Ram and Manuela Veloso and Daniel Weld and David Wilkins , title =. 1998 , url =
1998
-
[18]
Marco Aiello and Ilche Georgievski , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.11642 , eprinttype =. 2412.11642 , timestamp =
-
[19]
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , booktitle =
Karthik Valmeekam and Matthew Marquez and Alberto Olmo Hernandez and Sarath Sreedharan and Subbarao Kambhampati , editor =. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , booktitle =. 2023 , url =
2023
-
[20]
On the Planning Abilities of Large Language Models -
Karthik Valmeekam and Matthew Marquez and Sarath Sreedharan and Subbarao Kambhampati , editor =. On the Planning Abilities of Large Language Models -. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =
2023
-
[21]
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , booktitle =
Lin Guan and Karthik Valmeekam and Sarath Sreedharan and Subbarao Kambhampati , editor =. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , booktitle =. 2023 , url =
2023
-
[22]
Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , booktitle =
Subbarao Kambhampati and Karthik Valmeekam and Lin Guan and Mudit Verma and Kaya Stechly and Siddhant Bhambri and Lucas Saldyt and Anil Murthy , editor =. Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , booktitle =. 2024 , url =
2024
-
[23]
The Thirteenth International Conference on Learning Representations,
Kaya Stechly and Karthik Valmeekam and Subbarao Kambhampati , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[24]
Chain of Thoughtlessness? An Analysis of CoT in Planning , booktitle =
Kaya Stechly and Karthik Valmeekam and Subbarao Kambhampati , editor =. Chain of Thoughtlessness? An Analysis of CoT in Planning , booktitle =. 2024 , url =
2024
-
[25]
Pulkit Verma and Ngoc La and Anthony Favier and Swaroop Mishra and Julie A. Shah , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.13351 , eprinttype =. 2509.13351 , timestamp =
-
[26]
James T. Oswald and Kavitha Srinivas and Harsha Kokel and Junkyu Lee and Michael Katz and Shirin Sohrabi , editor =. Large Language Models as Planning Domain Generators (Student Abstract) , booktitle =. 2024 , url =. doi:10.1609/AAAI.V38I21.30491 , timestamp =
-
[27]
Kaustubh Vyas and Damien Graux and S. An Extensive Evaluation of. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.20175 , eprinttype =. 2502.20175 , timestamp =
-
[28]
Translating natural language to planning goals with large-language models,
Yaqi Xie and Chen Yu and Tongyao Zhu and Jinbin Bai and Ze Gong and Harold Soh , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2302.05128 , eprinttype =. 2302.05128 , timestamp =
-
[29]
Kai Goebel and Patrik Zips , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.23589 , eprinttype =. 2507.23589 , timestamp =
-
[30]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen and Libo Qin and Jinhao Liu and Dengyun Peng and Jiannan Guan and Peng Wang and Mengkang Hu and Yuhang Zhou and Te Gao and Wanxiang Che , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.09567 , eprinttype =. 2503.09567 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2503.09567 2025
-
[31]
Marcus Tantakoun and Xiaodan Zhu and Christian Muise , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.18971 , eprinttype =. 2503.18971 , timestamp =
-
[32]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang and Beichen Zhang and Binyuan Hui and Bofei Gao and Bowen Yu and Chengpeng Li and Dayiheng Liu and Jianhong Tu and Jingren Zhou and Junyang Lin and Keming Lu and Mingfeng Xue and Runji Lin and Tianyu Liu and Xingzhang Ren and Zhenru Zhang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2409.12122 , eprinttype =. 2409.12122 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2409.12122 2024
-
[33]
Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and L. Mistral 7B , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.06825 , eprinttype =. 2310.06825 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[34]
The Llama 3 Herd of Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[35]
2024 , howpublished =
2024
-
[36]
Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen
Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =
2022
-
[37]
2025 , howpublished =
2025
-
[38]
Malte Helmert , title =. J. Artif. Intell. Res. , volume =. 2006 , url =. doi:10.1613/JAIR.1705 , timestamp =
-
[39]
Nau and Paolo Traverso , title =
Malik Ghallab and Dana S. Nau and Paolo Traverso , title =. 2004 , isbn =
2004
-
[40]
Unified Planning: Modeling, manipulating and solving
Andrea Micheli and Arthur Bit. Unified Planning: Modeling, manipulating and solving. SoftwareX , volume =. 2025 , url =. doi:10.1016/J.SOFTX.2024.102012 , timestamp =
-
[41]
Jun Wang and Meng Fang and Ziyu Wan and Muning Wen and Jiachen Zhu and Anjie Liu and Ziqin Gong and Yan Song and Lei Chen and Lionel M. Ni and Linyi Yang and Ying Wen and Weinan Zhang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.09671 , eprinttype =. 2410.09671 , timestamp =
-
[42]
Training Verifiers to Solve Math Word Problems
Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[43]
Measuring Mathematical Problem Solving With the
Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , editor =. Measuring Mathematical Problem Solving With the. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , year =
2021
-
[44]
Chaoqun He and Renjie Luo and Yuzhuo Bai and Shengding Hu and Zhen Leng Thai and Junhao Shen and Jinyi Hu and Xu Han and Yujie Huang and Yuxiang Zhang and Jie Liu and Lei Qi and Zhiyuan Liu and Maosong Sun , editor =. OlympiadBench:. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , ur...
-
[45]
The Thirteenth International Conference on Learning Representations,
Bofei Gao and Feifan Song and Zhe Yang and Zefan Cai and Yibo Miao and Qingxiu Dong and Lei Li and Chenghao Ma and Liang Chen and Runxin Xu and Zhengyang Tang and Benyou Wang and Daoguang Zan and Shanghaoran Quan and Ge Zhang and Lei Sha and Yichang Zhang and Xuancheng Ren and Tianyu Liu and Baobao Chang , title =. The Thirteenth International Conference ...
2025
-
[46]
CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.09388 , eprinttype =. 2505.09388 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[47]
Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al
Magistral , journal =. 2025 , url =. doi:10.48550/ARXIV.2506.10910 , eprinttype =. 2506.10910 , timestamp =
-
[48]
Bowman , editor =
Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , editor =. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , booktitle =. 2023 , url =
2023
-
[49]
Kaya Stechly and Karthik Valmeekam and Atharva Gundawar and Vardhan Palod and Subbarao Kambhampati , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.13775 , eprinttype =. 2505.13775 , timestamp =
-
[50]
Skywork open reasoner 1 technical report
Jujie He and Jiacai Liu and Chris Yuhao Liu and Rui Yan and Chaojie Wang and Peng Cheng and Xiaoyu Zhang and Fuxiang Zhang and Jiacheng Xu and Wei Shen and Siyuan Li and Liang Zeng and Tianwen Wei and Cheng Cheng and Bo An and Yang Liu and Yahui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22312 , eprinttype =. 2505.22312 , timestamp =
-
[51]
The Third International Planning Competition: Temporal and Metric Planning , booktitle =
Maria Fox and Derek Long , editor =. The Third International Planning Competition: Temporal and Metric Planning , booktitle =. 2002 , url =
2002
-
[52]
Rina Dechter and Judea Pearl , title =. J. 1985 , url =. doi:10.1145/3828.3830 , timestamp =
-
[53]
Atkins and Edmund H
Ella M. Atkins and Edmund H. Durfee and Kang G. Shin , editor =. Detecting and Reacting to Unplanned-for World States , booktitle =. 1997 , url =
1997
-
[54]
Landmarks, Critical Paths and Abstractions: What's the Difference Anyway? , booktitle =
Malte Helmert and Carmel Domshlak , editor =. Landmarks, Critical Paths and Abstractions: What's the Difference Anyway? , booktitle =. 2009 , url =
2009
-
[55]
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
Molfese, Francesco Maria and Moroni, Luca and Porcaro, Ciro and Conia, Simone and Navigli, Roberto. ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering. Association for Computational Linguistics: ACL 2026. 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.