Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
Pith reviewed 2026-05-18 22:23 UTC · model grok-4.3
The pith
Length Controlled Preference Optimization shortens large reasoning model outputs by over 50% without accuracy loss
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LCPO can effectively learn length preference with limited data and training. By analyzing generation path distributions, filtering trajectories, and balancing the implicit reward related to NLL loss under the Bradley-Terry framework, the method reduces average output length of LRMs by over 50% across multiple benchmarks while maintaining reasoning performance.
What carries the argument
Length Controlled Preference Optimization (LCPO), a preference objective that directly balances the implicit reward related to NLL loss to enforce shorter reasoning trajectories from filtered data.
Where Pith is reading between the lines
- The same filtering-plus-balancing pattern could be tested on non-reasoning generation tasks such as summarization or code completion.
- Length control via NLL reward balancing might combine with other efficiency techniques like quantization without further accuracy loss.
- If the Bradley-Terry convergence insight holds more broadly, similar small-scale preference methods could prune other forms of verbose model output.
Load-bearing premise
The convergence analysis of preference objectives under the unified Bradley-Terry loss framework identifies a length-control signal that does not trade off against reasoning accuracy when applied to filtered trajectories.
What would settle it
Apply LCPO to a new set of benchmarks and measure whether average output length drops by roughly 50% while accuracy on reasoning tasks stays within a few percent of the original model.
Figures
read the original abstract
Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our method significantly reduces the average output length of LRMs by over 50\% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Length Controlled Preference Optimization (LCPO) to prune long Chain-of-Thought outputs in Large Reasoning Models. It filters trajectories via difficulty estimation, analyzes convergence properties of preference objectives under a unified Bradley-Terry loss framework, and introduces LCPO to directly balance an implicit NLL-related reward. The central empirical claim is that LCPO achieves >50% average output length reduction across benchmarks while preserving reasoning performance, using only limited data and training.
Significance. If the results hold under rigorous controls, the work would be significant for practical deployment of LRMs by offering a low-resource method to control generation length and mitigate overthinking without accuracy trade-offs. The unified BT-loss convergence analysis and the emphasis on small-scale tuning are potentially valuable contributions to efficient reasoning research.
major comments (2)
- [Abstract / Experiments] Abstract and experimental sections: the reported >50% length reduction with maintained performance provides no error bars, no explicit data exclusion rules for the filtered trajectories, and no comparison against strong length-regularized baselines; these omissions make it difficult to assess robustness of the central claim that length control does not trade off against accuracy.
- [Method / Convergence Analysis] Convergence analysis under unified Bradley-Terry loss: the claim that LCPO isolates a pure length-control signal orthogonal to reasoning accuracy on difficulty-filtered trajectories is not fully supported, because difficulty estimation may correlate path length with solution quality in the selected data, allowing a hidden trade-off to persist in the learned preference.
minor comments (1)
- [Method] Notation for the implicit NLL-related reward in the LCPO objective could be clarified with an explicit equation reference to avoid ambiguity when comparing to standard DPO or IPO formulations.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to improve the clarity and robustness of our work.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental sections: the reported >50% length reduction with maintained performance provides no error bars, no explicit data exclusion rules for the filtered trajectories, and no comparison against strong length-regularized baselines; these omissions make it difficult to assess robustness of the central claim that length control does not trade off against accuracy.
Authors: We agree that the inclusion of error bars and explicit details on data filtering would enhance the transparency and allow better evaluation of the results' robustness. In the revised manuscript, we will add error bars (standard deviations from multiple seeds where applicable) to the reported metrics in the experimental sections and abstract if space permits. We will also provide a detailed description of the data exclusion rules used in the difficulty estimation filtering process in the Method section. Regarding comparisons to strong length-regularized baselines, our current experiments include relevant preference optimization and length control methods; however, we acknowledge that additional baselines could further strengthen the evaluation. We will include a discussion of this and, if feasible within our computational budget, add one or two more baselines in the revision. These changes will better support the central claim. revision: yes
-
Referee: [Method / Convergence Analysis] Convergence analysis under unified Bradley-Terry loss: the claim that LCPO isolates a pure length-control signal orthogonal to reasoning accuracy on difficulty-filtered trajectories is not fully supported, because difficulty estimation may correlate path length with solution quality in the selected data, allowing a hidden trade-off to persist in the learned preference.
Authors: We appreciate this insightful observation regarding potential correlations in the filtered data. To clarify, our difficulty estimation is based on the model's ability to solve the problem correctly rather than directly on path length, and we select trajectories where shorter paths still lead to correct solutions. We will revise the convergence analysis section to include an explicit analysis or appendix demonstrating the low correlation between path length and solution quality in the selected subset, thereby supporting that the length preference is isolated. This will address the concern about hidden trade-offs and strengthen the theoretical justification for LCPO. revision: partial
Circularity Check
No significant circularity in LCPO derivation chain
full rationale
The paper filters trajectories via difficulty estimation then derives LCPO from convergence analysis of preference objectives under a unified Bradley-Terry loss framework, proposing to balance the implicit NLL-related reward. No load-bearing step reduces by construction to its inputs: length preference emerges from optimization on the filtered data rather than being defined into the loss or fitted directly to the target metric. The central claim of >50% length reduction without accuracy loss is presented as an empirical outcome verified across benchmarks, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- length preference weight in LCPO
axioms (1)
- domain assumption Bradley-Terry model accurately ranks length-preferring trajectories
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reformulate the objective functions of different methods into a log-sigmoid function form: −log σ(R(yw, yl, |x)). ... we find that the implicit reward associated with NLL loss can hinder length preference equationment. ... LLCPO = −λ log σ(log(pθ(yw|x)/(1−pθ(yw|x))) − log(pθ(yl|x)/(1−pθ(yl|x))))
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Qwq-32b: Embracing the power of reinforcement learning,
Q. Team, “Qwq-32b: Embracing the power of reinforcement learning,” March 2025. [Online]. Available: https://qwenlm.github.io/blog/qwq-32b/
work page 2025
-
[3]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Chain-of- thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of- thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[5]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[6]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,” arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025
S. Feng, G. Fang, X. Ma, and X. Wang, “Efficient reasoning models: A survey,”arXiv preprint arXiv:2504.10903, 2025
-
[10]
X. Qu, Y . Li, Z. Su, W. Sun, J. Yan, D. Liu, G. Cui, D. Liu, S. Liang, J. Heet al., “A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond,”arXiv preprint arXiv:2503.21614, 2025
-
[11]
Efficient inference for large reasoning models: A survey,
Y . Liu, J. Wu, Y . He, H. Gao, H. Chen, B. Bi, J. Zhang, Z. Huang, and B. Hooi, “Efficient inference for large reasoning models: A survey,”arXiv preprint arXiv:2503.23077, 2025
-
[12]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen et al., “Stop overthinking: A survey on efficient reasoning for large language models,”arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
arXiv preprint arXiv:2502.18600 , year=
S. Xu, W. Xie, L. Zhao, and P. He, “Chain of draft: Thinking faster by writing less,” arXiv preprint arXiv:2502.18600, 2025
-
[14]
N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candes, and T. Hashimoto, “s1: Simple test-time scaling,” inWorkshop on Reasoning and Planning for Large Language Models, 2025
work page 2025
-
[15]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
P. Aggarwal and S. Welleck, “L1: Controlling how long a reasoning model thinks with rein- forcement learning,”arXiv preprint arXiv:2503.04697, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao et al., “Kimi k1. 5: Scaling reinforcement learning with llms,” arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
B. Hou, Y . Zhang, J. Ji, Y . Liu, K. Qian, J. Andreas, and S. Chang, “Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning,”arXiv preprint arXiv:2504.01296, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025
D. Arora and A. Zanette, “Training language models to reason efficiently,” 2025. [Online]. Available: https://arxiv.org/abs/2502.04463
-
[19]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liuet al., “Dapo: An open-source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang et al., “Do not think that much for 2+ 3=? on the overthinking of o1-like llms,” arXiv preprint arXiv:2412.21187, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?” arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Dast: Difficulty-adaptive slow-thinking for large reasoning models,
Y . Shen, J. Zhang, J. Huang, S. Shi, W. Zhang, J. Yan, N. Wang, K. Wang, and S. Lian, “Dast: Difficulty-adaptive slow-thinking for large reasoning models,”arXiv preprint arXiv:2503.04472, 2025
-
[23]
A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney et al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Introducing openai o3 and o4-mini,
OpenAI, “Introducing openai o3 and o4-mini,” https://openai.com/index/ introducing-o3-and-o4-mini/, 2025, accessed: September 11, 2025
work page 2025
-
[25]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023
work page 2023
-
[27]
Simpo: Simple preference optimization with a reference-free reward,
Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024
work page 2024
-
[28]
Simper: A minimalist approach to preference alignment without hyperparameters,
T. Xiao, Y . Yuan, Z. Chen, M. Li, S. Liang, Z. Ren, and V . G. Honavar, “Simper: A minimalist approach to preference alignment without hyperparameters,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[29]
Orpo: Monolithic preference optimization without reference model,
J. Hong, N. Lee, and J. Thorne, “Orpo: Monolithic preference optimization without reference model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11 170–11 189
work page 2024
-
[30]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onl...
work page 2022
-
[31]
Self-consistency improves chain of thought reasoning in language models,
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[32]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, Z. Guo, Y . Wang, I. King, X. Liu, and C. Ma, “What, how, where, and how well? a survey on test-time scaling in large language models,” arXiv preprint arXiv:2503.24235, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Z. Zeng, Q. Cheng, Z. Yin, Y . Zhou, and X. Qiu, “Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?” arXiv preprint arXiv:2502.12215, 2025. 12
-
[34]
Distilling system 2 into system 1,
P. Yu, J. Xu, J. E. Weston, and I. Kulikov, “Distilling system 2 into system 1,” in The First Workshop on System-2 Reasoning at Scale, NeurIPS’24, 2024
work page 2024
-
[35]
C3ot: Generating shorter chain-of-thought without compromising effectiveness,
Y . Kang, X. Sun, L. Chen, and W. Zou, “C3ot: Generating shorter chain-of-thought without compromising effectiveness,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 23, 2025, pp. 24 312–24 320
work page 2025
-
[36]
Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886,
X. Li, H. Zou, and P. Liu, “Limr: Less is more for rl scaling,”arXiv preprint arXiv:2502.11886, 2025
-
[37]
Understanding R1-Zero-Like Training: A Critical Perspective
Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin, “Understanding r1-zero-like training: A critical perspective,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20783
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Rank analysis of incomplete block designs: I. the method of paired comparisons,
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952
work page 1952
-
[39]
On information and sufficiency,
S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951
work page 1951
-
[40]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[41]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Solving quantitative reasoning problems with language models,
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al., “Solving quantitative reasoning problems with language models,” Advances in neural information processing systems, vol. 35, pp. 3843–3857, 2022
work page 2022
-
[43]
A. Online, “Art of problem solving,” https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions, 2025, accessed: September 11, 2025
work page 2025
-
[44]
American mathematics competitions,
M. A. of America, “American mathematics competitions,” https://maa.org/student-programs/ amc/, 2025, accessed: September 11, 2025
work page 2025
-
[45]
C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang et al., “Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 3828–3850
work page 2024
-
[46]
Measuring massive multitask language understanding,
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[47]
Aligning ai with shared human values,
D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,”Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[48]
Measuring mathematical problem solving with the math dataset,
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,”NeurIPS, 2021
work page 2021
-
[49]
Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,
M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica, “Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,” https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025, notion Blog
work page 2025
-
[50]
Process Reinforcement through Implicit Rewards
G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y . Fan, T. Yu, Q. Xu, W. Chenet al., “Process reinforcement through implicit rewards,”arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Adam: A Method for Stochastic Optimization
D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014. 13
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[52]
Automatic differentiation in pytorch,
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017
work page 2017
-
[53]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) . Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: ht...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Efficient memory management for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[55]
Perplexity—a measure of the difficulty of speech recognition tasks,
F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker, “Perplexity—a measure of the difficulty of speech recognition tasks,” The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63–S63, 1977. A Brief Introduction of Datasets and Baselines Math datasets We use 6 math datasets covering in domain and out of domain data for evaluation. The ...
work page 1977
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.