pith. sign in

arxiv: 2602.09591 · v3 · pith:3XZMI2FZnew · submitted 2026-02-10 · 💻 cs.CL · cs.AI· cs.LG

On the Optimal Reasoning Length for RL-Trained Language Models

classification 💻 cs.CL cs.AIcs.LG
keywords accuracylengthmodelsreasoninglanguagelength-accuracylength-controlmethods
0
0 comments X
read the original abstract

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

    cs.AI 2026-05 unverdicted novelty 6.0

    ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...