On the Optimal Reasoning Length for RL-Trained Language Models

Daisuke Nohara; Rio Yokota; Taishi Nakamura

arxiv: 2602.09591 · v3 · pith:3XZMI2FZnew · submitted 2026-02-10 · 💻 cs.CL · cs.AI· cs.LG

On the Optimal Reasoning Length for RL-Trained Language Models

Daisuke Nohara , Taishi Nakamura , Rio Yokota This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords accuracylengthmodelsreasoninglanguagelength-accuracylength-controlmethods

0 comments

read the original abstract

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
cs.AI 2026-05 unverdicted novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...