arxiv: 2604.25011 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

Dan Shi , Zhuowen Han , Simon Ostermann , Renren Jin , Josef van Genabith , Deyi Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learninglarge language modelsgeneralizationmechanistic interpretabilityfeature analysispost-trainingsupervised fine-tuningcausal interventions

0 comments

The pith

Reinforcement learning post-training improves LLM generalization by inducing a compact set of task-agnostic features that mediate performance on out-of-domain tasks, unlike the many specialized features introduced by supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning (RL) post-training often boosts large language models' reasoning beyond the training domain, while supervised fine-tuning (SFT) tends to erode general capabilities. The paper compares RL- and SFT-tuned models started from the same base on identical data to isolate the difference. It shows RL produces restrained, ongoing feature changes that keep most base representations intact, whereas SFT quickly adds many narrow, stable features. On samples where only the RL model succeeds, the authors isolate a small collection of features shared across tasks that appear to drive the gains. Direct interventions confirm causality: turning these features off harms RL generalization, and turning them up helps the base model.

Core claim

RL induces more restrained and continually evolving feature changes that largely preserve base models' representations, in contrast to SFT which rapidly introduces many highly specialized features that stabilize early in training. On samples where RL succeeds but the base model fails, a compact, task-agnostic set of features directly mediates generalization across diverse tasks, and feature-level interventions establish their causal role by showing that disabling them degrades RL performance while amplifying them improves base-model performance.

What carries the argument

Alignment of internal activations from RL and SFT models into a shared feature space, followed by identification of a compact task-agnostic feature set and targeted interventions that disable or amplify those features to test effects on generalization.

If this is right

RL post-training succeeds by making limited, ongoing adjustments rather than overwriting base representations with specialized ones.
A small number of features can be edited to control whether a model generalizes on reasoning tasks it previously failed.
The same features appear to operate across multiple distinct tasks, suggesting a shared mechanism for RL-driven generalization.
SFT's early stabilization of many narrow features explains why it often reduces capabilities outside the fine-tuning distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures could be designed to target induction of this small feature set directly instead of relying on full RL.
The contrast between RL and SFT may extend to other post-training methods such as preference optimization or continued pre-training.
If these features can be located early in training, models might be monitored or steered toward generalization without full RL runs.

Load-bearing premise

Aligning activations from different models in one feature space captures the true mechanistic differences without artifacts from the alignment procedure or the choice of how features are extracted.

What would settle it

An experiment in which the identified features are disabled inside a trained RL model yet its generalization accuracy on held-out tasks stays essentially unchanged, or in which amplifying the same features inside a base model produces no measurable gain.

Figures

Figures reproduced from arXiv: 2604.25011 by Dan Shi, Deyi Xiong, Josef van Genabith, Renren Jin, Simon Ostermann, Zhuowen Han.

**Figure 1.** Figure 1: Distribution of Normalized Relative Norms across different training methods and different model scales. view at source ↗

**Figure 2.** Figure 2: Feature overlap heatmaps across training checkpoints under different training paradigms. view at source ↗

**Figure 3.** Figure 3: Distribution of Model Attribution Scores across different training methods on Qwen3-4B-Base. view at source ↗

**Figure 4.** Figure 4: Overlap of identified generalization-controlling features across tasks. Left: Qwen3-4B-RL vs. Qwen3-4B view at source ↗

**Figure 5.** Figure 5: Distribution of Model Attribution Scores across different training methods on Qwen2.5-7B. view at source ↗

**Figure 7.** Figure 7: Feature overlap heatmaps across training view at source ↗

**Figure 8.** Figure 8: Feature rank shift across adjacent checkpoints during SFT on Qwen3-4B-Base. view at source ↗

**Figure 9.** Figure 9: Feature rank shift across adjacent checkpoints during RL on Qwen3-4B-Base. view at source ↗

**Figure 10.** Figure 10: Feature rank shift across adjacent checkpoints during SFT on Qwen2.5-7B. view at source ↗

**Figure 11.** Figure 11: Feature rank shift across adjacent checkpoints during RL on Qwen2.5-7B. view at source ↗

**Figure 12.** Figure 12: Feature rank shift across adjacent checkpoints during SFT on Llama3.1-8B-Instruct. view at source ↗

**Figure 13.** Figure 13: Feature rank shift across adjacent checkpoints during RL on Llama3.1-8B-Instruct. view at source ↗

read the original abstract

Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at https://github.com/danshi777/RL-generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL generalizes better than SFT because it changes fewer features and keeps a small task-agnostic set that causally supports out-of-domain reasoning, though the shared alignment step could be fragile.

read the letter

The paper's core finding is that RL post-training succeeds on out-of-domain tasks by making restrained, ongoing adjustments to the base model's features, while SFT quickly locks in many specialized ones that hurt transfer. They back this with a same-base, same-data setup, activation alignment, and direct interventions that disable the key features in RL models or boost them in the base model. The interventions are the strongest part: they show measurable drops and gains in generalization, which moves the claim past correlation. That controlled comparison plus causal tests is genuinely new relative to earlier activation studies on fine-tuning. The code release helps too. The alignment into one shared feature space is the soft spot. The abstract does not spell out how sensitive the compact feature set is to the choice of map, regularization, or backbone, so it is possible the reported mediators partly reflect the alignment procedure rather than intrinsic differences. Without robustness checks or alternative alignments, the causal story rests on an untested assumption. Minor issues include the lack of error bars on the intervention effects and details on how the task-agnostic subset was chosen from the larger pool. This work is for people who care about mechanistic accounts of LLM post-training rather than new benchmarks. Readers who already follow interpretability papers on RL versus SFT will get the most out of the interventions and the controlled design. It is solid enough to deserve a serious referee, mainly because the experimental controls and causal tests are in place even if the alignment needs more scrutiny. I would send it for review and ask the authors to add alignment sensitivity tests and clearer selection criteria for the feature set.

Referee Report

1 major / 2 minor

Summary. The paper claims that RL post-training improves LLM generalization over SFT by inducing more restrained, continually evolving feature changes that largely preserve base-model representations, while SFT introduces many specialized features early. Using a controlled setup with RL- and SFT-tuned models trained from the same base on identical data, the authors align internal activations in a shared feature space, identify a compact task-agnostic set of features that mediate generalization on RL-success/base-failure samples, and validate causality via interventions: disabling the features degrades RL generalization while amplifying them improves base-model performance.

Significance. If the central claims hold, the work supplies a mechanistic account of why RL generalizes where SFT does not, grounded in a controlled experimental design, direct feature-level interventions, and open code. These elements could guide more effective post-training techniques and advance interpretability research on LLM capabilities.

major comments (1)

[interpretability framework / activation alignment] The shared feature space alignment procedure (described in the interpretability framework section) is load-bearing for the identification of the compact task-agnostic feature set and all subsequent causal claims. The manuscript does not report sensitivity analyses to the alignment objective, regularization, choice of backbone, or alternative maps (linear vs. nonlinear). If the selected features or intervention effects vary under different alignment choices, the reported mediation and causality results could be artifacts rather than intrinsic differences.

minor comments (2)

Expand the methods section with full details on feature extraction, error analysis, and explicit checks for confounds in feature selection to make the evidence more conclusive.
Clarify how the task-agnostic property of the identified features is quantified across the diverse tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about the robustness of the activation alignment procedure is substantive, and we address it directly below while committing to revisions that will strengthen the evidence.

read point-by-point responses

Referee: [interpretability framework / activation alignment] The shared feature space alignment procedure (described in the interpretability framework section) is load-bearing for the identification of the compact task-agnostic feature set and all subsequent causal claims. The manuscript does not report sensitivity analyses to the alignment objective, regularization, choice of backbone, or alternative maps (linear vs. nonlinear). If the selected features or intervention effects vary under different alignment choices, the reported mediation and causality results could be artifacts rather than intrinsic differences.

Authors: We agree that the alignment procedure is central to identifying the task-agnostic features and to the subsequent causal interventions, and that the lack of reported sensitivity analyses is a genuine limitation of the current manuscript. The linear alignment was selected for its interpretability and empirical stability in preserving cross-model correspondences on our controlled setup, but this choice requires explicit validation. In the revised version we will add a dedicated sensitivity subsection that systematically varies the alignment objective (including cosine-based and contrastive losses), regularization strength, backbone model, and alignment map (linear versus nonlinear MLPs or kernel methods). For each variant we will quantify overlap in the recovered feature set, consistency of mediation on RL-success/base-failure samples, and stability of the intervention effects. These results will be reported alongside the original findings so that readers can evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical mechanistic chain

full rationale

The paper's derivation proceeds via controlled training of RL/SFT models from the same base on identical data, activation alignment into a shared space, selection of features differing on RL-success/base-failure samples, and direct causal interventions (disabling/amplifying features). None of these steps reduce by the paper's own descriptions or equations to self-defined quantities, fitted inputs renamed as predictions, or load-bearing self-citations; the interventions supply independent falsifiable evidence outside the identification procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that feature alignment across models is meaningful and on the empirical validity of the intervention results; no free parameters or invented entities are introduced to derive the claim.

axioms (1)

domain assumption Internal activations of different models can be aligned in a shared feature space that preserves mechanistic meaning.
Invoked when comparing RL and SFT models and when identifying the mediating features.

pith-pipeline@v0.9.0 · 5526 in / 1396 out tokens · 70223 ms · 2026-05-08T03:29:21.733648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Boyi Deng, Yu Wan, Baosong Yang, Yidan Zhang, and Fuli Feng

Association for Computational Linguistics. Boyi Deng, Yu Wan, Baosong Yang, Yidan Zhang, and Fuli Feng. 2025. Unveiling language-specific fea- tures in large language models via sparse autoen- coders. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 4563–4608. Andrey Galichin, Alexe...

work page arXiv 2025
[2]

Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025

Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Gra- ham Neubig, and Xiang Yue. 2025. Does math rea- soning improve general LLM capabilities? under- standing transferability of LLM reasoning.arXiv preprin...

work page arXiv 2025
[3]

Notion Blog

DeepScaleR: Surpassing o1-preview with a 1.5b model by scaling RL. Notion Blog. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2025. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representa- tions. Todor Miha...

2025
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Automatic evals for LLMs. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Ya...

work page internal anchor Pith review arXiv 2024
[5]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599. David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A healthcare dataset for complex reason- ing. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy. Association for Computa- tional Linguistics. ...

work page internal anchor Pith review arXiv 2019
[6]

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Safe-SAIL: Towards a fine-grained safety landscape of large language models via sparse au- toencoder interpretation framework.arXiv preprint arXiv:2509.18127. Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, and Deyi Xiong. 2023. DEPN: Detecting and editing privacy neurons in pre- trained language models. InProceedings of the 2023...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, and 25 oth- ers. 2024. Qwen2.5 technical report.arXiv preprin...

work page internal anchor Pith review arXiv 2024