Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Bin Hong; Jianwen Sun; Jiayu Liu; Kai Zhang; Mengdi Zhang; Zhenya Huang

arxiv: 2508.10164 · v2 · submitted 2025-08-13 · 💻 cs.AI

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Bin Hong , Jiayu Liu , Kai Zhang , Jianwen Sun , Mengdi Zhang , Zhenya Huang This is my paper

Pith reviewed 2026-05-18 22:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords large reasoning modelschain-of-thoughtpreference optimizationlength reductionefficient reasoningsmall-scale tuning

0 comments

The pith

Length Controlled Preference Optimization shortens large reasoning model outputs by over 50% without accuracy loss

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large reasoning models can be trained to produce much shorter chain-of-thought responses without losing accuracy on complex tasks. It filters generated trajectories by difficulty estimation and then applies a new method called Length Controlled Preference Optimization. LCPO is built from convergence analysis of preference objectives under a unified Bradley-Terry loss framework and directly balances the implicit reward tied to negative log likelihood loss. A sympathetic reader cares because shorter outputs lower compute costs and reduce overthinking while the method needs only limited data and training steps. This contrasts with prior approaches that often trade away quality or demand heavy resources.

Core claim

The central claim is that LCPO can effectively learn length preference with limited data and training. By analyzing generation path distributions, filtering trajectories, and balancing the implicit reward related to NLL loss under the Bradley-Terry framework, the method reduces average output length of LRMs by over 50% across multiple benchmarks while maintaining reasoning performance.

What carries the argument

Length Controlled Preference Optimization (LCPO), a preference objective that directly balances the implicit reward related to NLL loss to enforce shorter reasoning trajectories from filtered data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering-plus-balancing pattern could be tested on non-reasoning generation tasks such as summarization or code completion.
Length control via NLL reward balancing might combine with other efficiency techniques like quantization without further accuracy loss.
If the Bradley-Terry convergence insight holds more broadly, similar small-scale preference methods could prune other forms of verbose model output.

Load-bearing premise

The convergence analysis of preference objectives under the unified Bradley-Terry loss framework identifies a length-control signal that does not trade off against reasoning accuracy when applied to filtered trajectories.

What would settle it

Apply LCPO to a new set of benchmarks and measure whether average output length drops by roughly 50% while accuracy on reasoning tasks stays within a few percent of the original model.

Figures

Figures reproduced from arXiv: 2508.10164 by Bin Hong, Jianwen Sun, Jiayu Liu, Kai Zhang, Mengdi Zhang, Zhenya Huang.

**Figure 2.** Figure 2: (a) As the number of output tokens increases, model performance tends to deteriorate. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The distribution of output length on MATH-500 shifts after preference optimization. Distribution is shifted by preference optimization [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Changes of accuracy and generation length of DeepSeek-R1-Distill-Qwen-7B across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation prompt. Dataset generate from LIMR In this paper, we leverage a heuristic way of performing selfdistillation (rollout), filtering and ranking to data. Throughout this process, we get three split: easy, medium and hard, with respect to different models used for rollout. Here we list all the statistics of the dataset generate along this line in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our method significantly reduces the average output length of LRMs by over 50\% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LCPO gives a practical, low-data way to shorten long-CoT outputs in reasoning models by over 50 percent while keeping benchmark scores, but the supporting experiments still need tighter controls and clearer baselines.

read the letter

The main takeaway is that this work gives a usable lever for trimming inference cost in long-CoT models without the usual accuracy hit. They filter trajectories by difficulty, then train a length-controlled preference objective derived from a Bradley-Terry convergence analysis that directly balances the NLL-related reward term. The result is a method that runs on limited data and cuts average output length substantially across several benchmarks while holding reasoning performance steady.

Referee Report

2 major / 1 minor

Summary. The paper proposes Length Controlled Preference Optimization (LCPO) to prune long Chain-of-Thought outputs in Large Reasoning Models. It filters trajectories via difficulty estimation, analyzes convergence properties of preference objectives under a unified Bradley-Terry loss framework, and introduces LCPO to directly balance an implicit NLL-related reward. The central empirical claim is that LCPO achieves >50% average output length reduction across benchmarks while preserving reasoning performance, using only limited data and training.

Significance. If the results hold under rigorous controls, the work would be significant for practical deployment of LRMs by offering a low-resource method to control generation length and mitigate overthinking without accuracy trade-offs. The unified BT-loss convergence analysis and the emphasis on small-scale tuning are potentially valuable contributions to efficient reasoning research.

major comments (2)

[Abstract / Experiments] Abstract and experimental sections: the reported >50% length reduction with maintained performance provides no error bars, no explicit data exclusion rules for the filtered trajectories, and no comparison against strong length-regularized baselines; these omissions make it difficult to assess robustness of the central claim that length control does not trade off against accuracy.
[Method / Convergence Analysis] Convergence analysis under unified Bradley-Terry loss: the claim that LCPO isolates a pure length-control signal orthogonal to reasoning accuracy on difficulty-filtered trajectories is not fully supported, because difficulty estimation may correlate path length with solution quality in the selected data, allowing a hidden trade-off to persist in the learned preference.

minor comments (1)

[Method] Notation for the implicit NLL-related reward in the LCPO objective could be clarified with an explicit equation reference to avoid ambiguity when comparing to standard DPO or IPO formulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to improve the clarity and robustness of our work.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental sections: the reported >50% length reduction with maintained performance provides no error bars, no explicit data exclusion rules for the filtered trajectories, and no comparison against strong length-regularized baselines; these omissions make it difficult to assess robustness of the central claim that length control does not trade off against accuracy.

Authors: We agree that the inclusion of error bars and explicit details on data filtering would enhance the transparency and allow better evaluation of the results' robustness. In the revised manuscript, we will add error bars (standard deviations from multiple seeds where applicable) to the reported metrics in the experimental sections and abstract if space permits. We will also provide a detailed description of the data exclusion rules used in the difficulty estimation filtering process in the Method section. Regarding comparisons to strong length-regularized baselines, our current experiments include relevant preference optimization and length control methods; however, we acknowledge that additional baselines could further strengthen the evaluation. We will include a discussion of this and, if feasible within our computational budget, add one or two more baselines in the revision. These changes will better support the central claim. revision: yes
Referee: [Method / Convergence Analysis] Convergence analysis under unified Bradley-Terry loss: the claim that LCPO isolates a pure length-control signal orthogonal to reasoning accuracy on difficulty-filtered trajectories is not fully supported, because difficulty estimation may correlate path length with solution quality in the selected data, allowing a hidden trade-off to persist in the learned preference.

Authors: We appreciate this insightful observation regarding potential correlations in the filtered data. To clarify, our difficulty estimation is based on the model's ability to solve the problem correctly rather than directly on path length, and we select trajectories where shorter paths still lead to correct solutions. We will revise the convergence analysis section to include an explicit analysis or appendix demonstrating the low correlation between path length and solution quality in the selected subset, thereby supporting that the length preference is isolated. This will address the concern about hidden trade-offs and strengthen the theoretical justification for LCPO. revision: partial

Circularity Check

0 steps flagged

No significant circularity in LCPO derivation chain

full rationale

The paper filters trajectories via difficulty estimation then derives LCPO from convergence analysis of preference objectives under a unified Bradley-Terry loss framework, proposing to balance the implicit NLL-related reward. No load-bearing step reduces by construction to its inputs: length preference emerges from optimization on the filtered data rather than being defined into the loss or fitted directly to the target metric. The central claim of >50% length reduction without accuracy loss is presented as an empirical outcome verified across benchmarks, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Bradley-Terry preference model, the validity of difficulty-based trajectory filtering, and the assumption that NLL-related implicit reward can be balanced against length without external validation.

free parameters (1)

length preference weight in LCPO
Hyper-parameter that directly trades off length against correctness in the unified loss; value chosen to achieve reported 50% reduction.

axioms (1)

domain assumption Bradley-Terry model accurately ranks length-preferring trajectories
Invoked when unifying preference objectives under a single loss framework.

pith-pipeline@v0.9.0 · 5720 in / 1198 out tokens · 44763 ms · 2026-05-18T22:23:11.483591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We reformulate the objective functions of different methods into a log-sigmoid function form: −log σ(R(yw, yl, |x)). ... we find that the implicit reward associated with NLL loss can hinder length preference equationment. ... LLCPO = −λ log σ(log(pθ(yw|x)/(1−pθ(yw|x))) − log(pθ(yl|x)/(1−pθ(yl|x))))
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 20 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwq-32b: Embracing the power of reinforcement learning,

Q. Team, “Qwq-32b: Embracing the power of reinforcement learning,” March 2025. [Online]. Available: https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[3]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Chain-of- thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of- thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[5]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,” arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

S. Feng, G. Fang, X. Ma, and X. Wang, “Efficient reasoning models: A survey,”arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025
[10]

Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, and 1 others

X. Qu, Y . Li, Z. Su, W. Sun, J. Yan, D. Liu, G. Cui, D. Liu, S. Liang, J. Heet al., “A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond,”arXiv preprint arXiv:2503.21614, 2025

work page arXiv 2025
[11]

Efficient inference for large reasoning models: A survey,

Y . Liu, J. Wu, Y . He, H. Gao, H. Chen, B. Bi, J. Zhang, Z. Huang, and B. Hooi, “Efficient inference for large reasoning models: A survey,”arXiv preprint arXiv:2503.23077, 2025

work page arXiv 2025
[12]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen et al., “Stop overthinking: A survey on efficient reasoning for large language models,”arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

arXiv preprint arXiv:2502.18600 , year=

S. Xu, W. Xie, L. Zhao, and P. He, “Chain of draft: Thinking faster by writing less,” arXiv preprint arXiv:2502.18600, 2025

work page arXiv 2025
[14]

s1: Simple test-time scaling,

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candes, and T. Hashimoto, “s1: Simple test-time scaling,” inWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025
[15]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

P. Aggarwal and S. Welleck, “L1: Controlling how long a reasoning model thinks with rein- forcement learning,”arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao et al., “Kimi k1. 5: Scaling reinforcement learning with llms,” arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

B. Hou, Y . Zhang, J. Ji, Y . Liu, K. Qian, J. Andreas, and S. Chang, “Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning,”arXiv preprint arXiv:2504.01296, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

D. Arora and A. Zanette, “Training language models to reason efficiently,” 2025. [Online]. Available: https://arxiv.org/abs/2502.04463

work page arXiv 2025
[19]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liuet al., “Dapo: An open-source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang et al., “Do not think that much for 2+ 3=? on the overthinking of o1-like llms,” arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?” arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Dast: Difficulty-adaptive slow-thinking for large reasoning models,

Y . Shen, J. Zhang, J. Huang, S. Shi, W. Zhang, J. Yan, N. Wang, K. Wang, and S. Lian, “Dast: Difficulty-adaptive slow-thinking for large reasoning models,”arXiv preprint arXiv:2503.04472, 2025

work page arXiv 2025
[23]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney et al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Introducing openai o3 and o4-mini,

OpenAI, “Introducing openai o3 and o4-mini,” https://openai.com/index/ introducing-o3-and-o4-mini/, 2025, accessed: September 11, 2025

work page 2025
[25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023
[27]

Simpo: Simple preference optimization with a reference-free reward,

Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024

work page 2024
[28]

Simper: A minimalist approach to preference alignment without hyperparameters,

T. Xiao, Y . Yuan, Z. Chen, M. Li, S. Liang, Z. Ren, and V . G. Honavar, “Simper: A minimalist approach to preference alignment without hyperparameters,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[29]

Orpo: Monolithic preference optimization without reference model,

J. Hong, N. Lee, and J. Thorne, “Orpo: Monolithic preference optimization without reference model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11 170–11 189

work page 2024
[30]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onl...

work page 2022
[31]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[32]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, Z. Guo, Y . Wang, I. King, X. Liu, and C. Ma, “What, how, where, and how well? a survey on test-time scaling in large language models,” arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?arXiv preprint arXiv:2502.12215,

Z. Zeng, Q. Cheng, Z. Yin, Y . Zhou, and X. Qiu, “Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?” arXiv preprint arXiv:2502.12215, 2025. 12

work page arXiv 2025
[34]

Distilling system 2 into system 1,

P. Yu, J. Xu, J. E. Weston, and I. Kulikov, “Distilling system 2 into system 1,” in The First Workshop on System-2 Reasoning at Scale, NeurIPS’24, 2024

work page 2024
[35]

C3ot: Generating shorter chain-of-thought without compromising effectiveness,

Y . Kang, X. Sun, L. Chen, and W. Zou, “C3ot: Generating shorter chain-of-thought without compromising effectiveness,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 23, 2025, pp. 24 312–24 320

work page 2025
[36]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886,

X. Li, H. Zou, and P. Liu, “Limr: Less is more for rl scaling,”arXiv preprint arXiv:2502.11886, 2025

work page arXiv 2025
[37]

Understanding R1-Zero-Like Training: A Critical Perspective

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin, “Understanding r1-zero-like training: A critical perspective,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952
[39]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951

work page 1951
[40]

Let’s verify step by step,

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[41]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al., “Solving quantitative reasoning problems with language models,” Advances in neural information processing systems, vol. 35, pp. 3843–3857, 2022

work page 2022
[43]

Art of problem solving,

A. Online, “Art of problem solving,” https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions, 2025, accessed: September 11, 2025

work page 2025
[44]

American mathematics competitions,

M. A. of America, “American mathematics competitions,” https://maa.org/student-programs/ amc/, 2025, accessed: September 11, 2025

work page 2025
[45]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,

C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang et al., “Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 3828–3850

work page 2024
[46]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[47]

Aligning ai with shared human values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,”Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[48]

Measuring mathematical problem solving with the math dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,”NeurIPS, 2021

work page 2021
[49]

Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,

M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica, “Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,” https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025, notion Blog

work page 2025
[50]

Process Reinforcement through Implicit Rewards

G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y . Fan, T. Yu, Q. Xu, W. Chenet al., “Process reinforcement through implicit rewards,”arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014. 13

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017

work page 2017
[53]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) . Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: ht...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[55]

Perplexity—a measure of the difficulty of speech recognition tasks,

F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker, “Perplexity—a measure of the difficulty of speech recognition tasks,” The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63–S63, 1977. A Brief Introduction of Datasets and Baselines Math datasets We use 6 math datasets covering in domain and out of domain data for evaluation. The ...

work page 1977

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwq-32b: Embracing the power of reinforcement learning,

Q. Team, “Qwq-32b: Embracing the power of reinforcement learning,” March 2025. [Online]. Available: https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[3] [3]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Chain-of- thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of- thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022

[5] [5]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022

[6] [6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,” arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

S. Feng, G. Fang, X. Ma, and X. Wang, “Efficient reasoning models: A survey,”arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025

[10] [10]

Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, and 1 others

X. Qu, Y . Li, Z. Su, W. Sun, J. Yan, D. Liu, G. Cui, D. Liu, S. Liang, J. Heet al., “A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond,”arXiv preprint arXiv:2503.21614, 2025

work page arXiv 2025

[11] [11]

Efficient inference for large reasoning models: A survey,

Y . Liu, J. Wu, Y . He, H. Gao, H. Chen, B. Bi, J. Zhang, Z. Huang, and B. Hooi, “Efficient inference for large reasoning models: A survey,”arXiv preprint arXiv:2503.23077, 2025

work page arXiv 2025

[12] [12]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen et al., “Stop overthinking: A survey on efficient reasoning for large language models,”arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

arXiv preprint arXiv:2502.18600 , year=

S. Xu, W. Xie, L. Zhao, and P. He, “Chain of draft: Thinking faster by writing less,” arXiv preprint arXiv:2502.18600, 2025

work page arXiv 2025

[14] [14]

s1: Simple test-time scaling,

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candes, and T. Hashimoto, “s1: Simple test-time scaling,” inWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025

[15] [15]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

P. Aggarwal and S. Welleck, “L1: Controlling how long a reasoning model thinks with rein- forcement learning,”arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao et al., “Kimi k1. 5: Scaling reinforcement learning with llms,” arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

B. Hou, Y . Zhang, J. Ji, Y . Liu, K. Qian, J. Andreas, and S. Chang, “Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning,”arXiv preprint arXiv:2504.01296, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

D. Arora and A. Zanette, “Training language models to reason efficiently,” 2025. [Online]. Available: https://arxiv.org/abs/2502.04463

work page arXiv 2025

[19] [19]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liuet al., “Dapo: An open-source llm reinforcement learning system at scale,” arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang et al., “Do not think that much for 2+ 3=? on the overthinking of o1-like llms,” arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?” arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Dast: Difficulty-adaptive slow-thinking for large reasoning models,

Y . Shen, J. Zhang, J. Huang, S. Shi, W. Zhang, J. Yan, N. Wang, K. Wang, and S. Lian, “Dast: Difficulty-adaptive slow-thinking for large reasoning models,”arXiv preprint arXiv:2503.04472, 2025

work page arXiv 2025

[23] [23]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney et al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Introducing openai o3 and o4-mini,

OpenAI, “Introducing openai o3 and o4-mini,” https://openai.com/index/ introducing-o3-and-o4-mini/, 2025, accessed: September 11, 2025

work page 2025

[25] [25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023

[27] [27]

Simpo: Simple preference optimization with a reference-free reward,

Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,”Advances in Neural Information Processing Systems, vol. 37, pp. 124 198–124 235, 2024

work page 2024

[28] [28]

Simper: A minimalist approach to preference alignment without hyperparameters,

T. Xiao, Y . Yuan, Z. Chen, M. Li, S. Liang, Z. Ren, and V . G. Honavar, “Simper: A minimalist approach to preference alignment without hyperparameters,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[29] [29]

Orpo: Monolithic preference optimization without reference model,

J. Hong, N. Lee, and J. Thorne, “Orpo: Monolithic preference optimization without reference model,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11 170–11 189

work page 2024

[30] [30]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onl...

work page 2022

[31] [31]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2022

work page 2022

[32] [32]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, Z. Guo, Y . Wang, I. King, X. Liu, and C. Ma, “What, how, where, and how well? a survey on test-time scaling in large language models,” arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?arXiv preprint arXiv:2502.12215,

Z. Zeng, Q. Cheng, Z. Yin, Y . Zhou, and X. Qiu, “Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?” arXiv preprint arXiv:2502.12215, 2025. 12

work page arXiv 2025

[34] [34]

Distilling system 2 into system 1,

P. Yu, J. Xu, J. E. Weston, and I. Kulikov, “Distilling system 2 into system 1,” in The First Workshop on System-2 Reasoning at Scale, NeurIPS’24, 2024

work page 2024

[35] [35]

C3ot: Generating shorter chain-of-thought without compromising effectiveness,

Y . Kang, X. Sun, L. Chen, and W. Zou, “C3ot: Generating shorter chain-of-thought without compromising effectiveness,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 23, 2025, pp. 24 312–24 320

work page 2025

[36] [36]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886,

X. Li, H. Zou, and P. Liu, “Limr: Less is more for rl scaling,”arXiv preprint arXiv:2502.11886, 2025

work page arXiv 2025

[37] [37]

Understanding R1-Zero-Like Training: A Critical Perspective

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin, “Understanding r1-zero-like training: A critical perspective,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952

[39] [39]

On information and sufficiency,

S. Kullback and R. A. Leibler, “On information and sufficiency,”The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951

work page 1951

[40] [40]

Let’s verify step by step,

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[41] [41]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al., “Solving quantitative reasoning problems with language models,” Advances in neural information processing systems, vol. 35, pp. 3843–3857, 2022

work page 2022

[43] [43]

Art of problem solving,

A. Online, “Art of problem solving,” https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions, 2025, accessed: September 11, 2025

work page 2025

[44] [44]

American mathematics competitions,

M. A. of America, “American mathematics competitions,” https://maa.org/student-programs/ amc/, 2025, accessed: September 11, 2025

work page 2025

[45] [45]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,

C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang et al., “Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 3828–3850

work page 2024

[46] [46]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[47] [47]

Aligning ai with shared human values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,”Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[48] [48]

Measuring mathematical problem solving with the math dataset,

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,”NeurIPS, 2021

work page 2021

[49] [49]

Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,

M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica, “Deepscaler: Surpassing o1- preview with a 1.5b model by scaling rl,” https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025, notion Blog

work page 2025

[50] [50]

Process Reinforcement through Implicit Rewards

G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y . Fan, T. Yu, Q. Xu, W. Chenet al., “Process reinforcement through implicit rewards,”arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014. 13

work page internal anchor Pith review Pith/arXiv arXiv 2014

[52] [52]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017

work page 2017

[53] [53]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) . Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: ht...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[55] [55]

Perplexity—a measure of the difficulty of speech recognition tasks,

F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker, “Perplexity—a measure of the difficulty of speech recognition tasks,” The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63–S63, 1977. A Brief Introduction of Datasets and Baselines Math datasets We use 6 math datasets covering in domain and out of domain data for evaluation. The ...

work page 1977