RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
read the original abstract
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.