CLORE: Content-Level Optimization for Reasoning Efficiency
Pith reviewed 2026-05-22 05:47 UTC · model grok-4.3
The pith
Editing correct reasoning traces to remove repetition improves the accuracy-efficiency trade-off in language model post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLORE improves reasoning efficiency by editing correct on-policy rollouts. An external augmentation model deletes repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented-original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch.
What carries the argument
Content-level editing of correct on-policy rollouts by an external augmentation model, followed by auxiliary reference-free DPO optimization.
If this is right
- The accuracy-efficiency trade-off improves on mathematical reasoning benchmarks for models such as DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B.
- The method remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune.
- Content analyses show reductions in repetitive reasoning, illegible content, and post-answer exploration.
- Content-level supervision acts as a complementary direction to length-level control.
Where Pith is reading between the lines
- The editing approach could be tested on non-mathematical tasks such as code generation or scientific explanation to check for broader applicability.
- Jointly training the augmentation model with the main policy might reduce dependence on a separate external model.
- Direct content supervision of this kind could be combined with length-based rewards to further refine reasoning traces.
Load-bearing premise
The external augmentation model can reliably delete only repetitive, illegible, or post-answer segments from correct trajectories without introducing factual errors or changing the final answer, and the resulting edited rollouts remain sufficiently close to the policy distribution to avoid harmful off-policy effects.
What would settle it
If running CLORE on the five mathematical reasoning benchmarks produces lower accuracy or worse efficiency metrics than standard RL post-training without the editing step, the central claim would be falsified.
Figures
read the original abstract
Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLORE, a content-level optimization framework for LLM reasoning efficiency. It uses an external augmentation model to edit correct on-policy rollouts via local deletions of repetitive, illegible, or post-answer segments (while preserving the final answer), then optimizes the resulting augmented-original pairs with an auxiliary reference-free DPO objective alongside standard policy-gradient training. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks report improved accuracy-efficiency trade-offs, with compatibility to GRPO, DAPO, Training Efficient, and ThinkPrune; content analyses show reductions in repetitive and post-answer content.
Significance. If the augmentation preserves reasoning validity, CLORE supplies a complementary content-level supervision signal to existing length-based or reward-shaping approaches for efficient reasoning. The reported compatibility with multiple RL methods and the supporting content-level analyses would strengthen the case for content supervision as a distinct direction.
major comments (2)
- [Abstract and Methods] Abstract and Methods: The central claim that CLORE improves the accuracy-efficiency trade-off via content-level supervision rests on the external augmentation model performing only safe, local deletions on correct trajectories without introducing factual errors, changing solution paths, or altering the final answer. No quantitative checks (edit-induced answer change rate, semantic similarity, or human-rated reasoning quality) are reported on the augmented data, leaving open the possibility that observed gains arise from implicit filtering or length reduction rather than genuine content supervision.
- [Experiments] Experiments: The abstract reports accuracy-efficiency gains on five benchmarks but provides no details on data splits, run-to-run variance, statistical significance testing, or full hyperparameter settings for the auxiliary DPO component, preventing verification of the robustness of the claimed improvements.
minor comments (2)
- [Methods] Clarify in the methods section how the reference-free DPO loss is exactly formulated and weighted relative to the primary policy-gradient objective.
- [Experiments] Ensure all benchmark results include both accuracy and efficiency metrics (e.g., token count or latency) in the same table for direct trade-off comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results and experimental details.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The central claim that CLORE improves the accuracy-efficiency trade-off via content-level supervision rests on the external augmentation model performing only safe, local deletions on correct trajectories without introducing factual errors, changing solution paths, or altering the final answer. No quantitative checks (edit-induced answer change rate, semantic similarity, or human-rated reasoning quality) are reported on the augmented data, leaving open the possibility that observed gains arise from implicit filtering or length reduction rather than genuine content supervision.
Authors: We agree that quantitative validation of the augmentation process would provide stronger support for the claim that gains stem from content-level supervision rather than incidental effects. By construction, augmentation is restricted to correct on-policy rollouts and limited to local deletions that preserve the final answer, which keeps edited trajectories close to the original distribution. However, we acknowledge the absence of explicit metrics in the current version. In the revised manuscript, we will add an analysis subsection reporting: (i) the edit-induced answer change rate (verified to be zero by construction but quantified across the dataset), (ii) semantic similarity (e.g., embedding cosine similarity) between original and augmented trajectories, and (iii) a small-scale human evaluation of reasoning quality on a random sample of 100 pairs. These additions will appear in the Methods and Experiments sections to address the concern directly. revision: yes
-
Referee: [Experiments] Experiments: The abstract reports accuracy-efficiency gains on five benchmarks but provides no details on data splits, run-to-run variance, statistical significance testing, or full hyperparameter settings for the auxiliary DPO component, preventing verification of the robustness of the claimed improvements.
Authors: We concur that additional experimental details are necessary for reproducibility and to demonstrate robustness. In the revised manuscript, we will expand the Experiments section to include: (i) explicit description of training/validation/test splits for each benchmark, (ii) mean and standard deviation of accuracy and efficiency metrics across at least three independent runs with different random seeds, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing CLORE against baselines, and (iv) complete hyperparameter settings for the auxiliary DPO objective, including the DPO beta coefficient, learning rate, batch size, and number of epochs. These details will also be summarized in a new table or appendix for clarity. revision: yes
Circularity Check
No circularity: CLORE framework is empirically grounded without self-referential derivations
full rationale
The paper introduces CLORE as a practical framework that augments correct on-policy rollouts via an external model and applies auxiliary DPO alongside standard policy-gradient training. No mathematical derivations, equations, or first-principles predictions are presented that reduce the accuracy-efficiency improvements to fitted parameters defined by the result itself or to self-citation chains. The method explicitly builds on existing techniques (GRPO, DAPO, DPO) with stated assumptions about the augmentation model's behavior, and experimental results on multiple benchmarks provide independent empirical support. The central claims do not collapse into tautological redefinitions or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption External augmentation model deletes only superfluous content while preserving correctness and final answer
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented–original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy-efficiency trade-off
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
work page 2024
-
[3]
Large language models for mathematical reasoning: Progresses and challenges
Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges.arXiv preprint arXiv:2402.00157, 2024
-
[4]
Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025
Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463, 2025
-
[5]
Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia, Ni Li, Yan Tao, and Haiwei Wang. Trims: Real-time tracking of minimal sufficient length for efficient reasoning via rl.arXiv preprint arXiv:2603.17449, 2026
-
[6]
Do not think that much for 2+ 3=? on the overthinking of long reasoning models
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[7]
Jhouben Cuesta-Ramirez, Samuel Beaussant, and Mehdi Mounsif. Large reasoning mod- els are not thinking straight: on the unreliability of thinking trajectories.arXiv preprint arXiv:2507.00711, 2025
-
[8]
Stable reinforcement learning for efficient reasoning
Muzhi Dai, Shixuan Liu, and Qingyi Si. Stable reinforcement learning for efficient reasoning. arXiv preprint arXiv:2505.18086, 2025
-
[9]
Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
-
[10]
A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025
Scott Emmons, Roland S Zimmermann, David K Elson, and Rohin Shah. A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025
-
[11]
Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Token-budget-aware llm reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025
work page 2025
-
[14]
Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025
-
[15]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
work page 2024
-
[16]
Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, and Jiangjie Chen. Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025. 10
-
[17]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 1(3):5, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting
Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting. arXiv preprint arXiv:2505.18822, 2025
-
[21]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025
work page 2025
-
[23]
Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025
Arun Jose. Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025
-
[24]
Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024
Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024
-
[25]
Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025
Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025
-
[26]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
work page 2022
-
[27]
Ruosen Li, Ziming Luo, Quan Zhang, Ruochen Li, Ben Zhou, Ali Payani, and Xinya Du. Aalc: Large language model efficient reasoning via adaptive accuracy-length control.arXiv preprint arXiv:2506.20160, 2025
-
[28]
Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025
-
[29]
SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[31]
Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025
-
[32]
Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, et al. Craft: Calibrated reasoning with answer-faithful traces via reinforcement learning for multi-hop question answering.arXiv preprint arXiv:2602.01348, 2026. 11
-
[33]
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025
-
[34]
Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024
-
[35]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[36]
Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, and Dacheng Tao. Revisiting overthinking in long chain-of-thought from the perspective of self-doubt.arXiv preprint arXiv:2505.23480, 2025
-
[37]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[38]
A principled approach to chain-of-thought monitorability in reasoning models
Avinash Reddy, Souradip Chakraborty, and Amrit Singh Bedi. A principled approach to chain-of-thought monitorability in reasoning models
-
[39]
Dani Roytburg, Shreya Sridhar, and Daphne Ippolito. Measuring reasoning trace legibility: Can those who understand teach?arXiv preprint arXiv:2603.20508, 2026
-
[40]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Jinyan Su and Claire Cardie. Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.arXiv preprint arXiv:2505.18298, 2025
-
[43]
Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025
-
[44]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[47]
Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019
-
[48]
Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024
-
[49]
The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026
Taiqiang Wu, Zenan Zu, Bo Zhou, and Ngai Wong. The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026
-
[50]
Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. 12
-
[51]
Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark
Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liang- ming Pan. How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13340–13358, 2025
work page 2025
-
[55]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023
work page 2023
-
[56]
Demystifying Long Chain-of-Thought Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Jingyang Yi, Jiazheng Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.arXiv preprint arXiv:2504.21370, 2025
-
[58]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, and Dongyan Zhao. Efficient rl training for reasoning models via length-aware optimization.arXiv preprint arXiv:2505.12284, 2025
-
[60]
Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, and Shyam Upadhyay. Do llms really need 10+ thoughts for" find the time 1000 days later"? towards structural understanding of llm overthinking.arXiv preprint arXiv:2510.07880, 2025
-
[61]
Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?Advances in Neural Information Processing Systems, 37:123846–123910, 2024. 13 Appendix: Table of Contents A. Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
work page 2024
-
[62]
The possible sums he can end up with are125, 126, 127
What is the most probable sum he can get? ROLLOUT(EXCERPT). . . . The possible sums he can end up with are125, 126, 127. . . . The probability of ending with a sum of 127 is the probability of picking the card with value 127 first, which is 1 7 . . . . From the above analysis, the most probable sum is126, as it has the highest probability of occurring. im...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.