STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Bo Zhang; Guojian Zhan; Jiang Wu; Jingliang Duan; Kehua Sheng; Keqiang Li; Letian Tao; Shengbo Eben Li; Shiqi Liu; Yang Guan

arxiv: 2602.15620 · v5 · pith:SGTFOMT5new · submitted 2026-02-17 · 💻 cs.CL · cs.AI

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu , Zeyu He , Guojian Zhan , Letian Tao , Zhilong Zheng , Jiang Wu , Yinuo Wang , Yang Guan

show 5 more authors

Kehua Sheng Bo Zhang Keqiang Li Jingliang Duan Shengbo Eben Li

This is my paper

Pith reviewed 2026-05-15 21:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reinforcement learninglarge language modelsspurious tokenspolicy optimizationmathematical reasoningentropy stabilitygradient suppressionfine-tuning stability

0 comments

The pith

Silencing gradients from a tiny fraction of spurious tokens stabilizes RL fine-tuning of LLMs and raises math reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that roughly 0.01 percent of tokens in LLM reasoning traces receive full sequence-level rewards despite contributing almost nothing to the final answer, which inflates their gradients and drives entropy spikes followed by performance collapse late in training. It introduces the Silencing Spurious Tokens mechanism to zero out those gradients selectively and folds the change into a group-based policy objective called STAPO. Across Qwen 1.7B, 8B, and 14B models on six math benchmarks, the method keeps entropy flat and lifts average accuracy by 11.49 percent under greedy sampling and 3.73 percent under nucleus sampling relative to prior RL baselines. A sympathetic reader cares because current RL recipes for reasoning models still rely on ad-hoc fixes that fail at scale, and a targeted token-level intervention could remove the need for them.

Core claim

The central claim is that a small set of spurious tokens inherits the full outcome reward, producing outsized gradient updates that destabilize the policy and degrade reasoning quality. The authors define a unified evaluation of token-level effects across spurious risk, gradient norm, and entropy change, then propose the S2T mechanism to suppress gradients from these tokens inside a group-relative objective. The resulting STAPO algorithm produces stable entropy trajectories and consistent accuracy gains on mathematical reasoning tasks for Qwen models of three sizes.

What carries the argument

The Silencing Spurious Tokens (S2T) mechanism, which identifies low-contribution tokens and suppresses their gradient contributions within the group-based policy update.

If this is right

Late-stage performance collapse in RL fine-tuning of reasoning models can be prevented by token-level gradient editing rather than global entropy regularization.
The same S2T logic can be added to other group-relative objectives without changing their sampling or reward structure.
Entropy remains controlled across training without extra regularization terms once spurious gradient contributions are removed.
Accuracy gains appear consistently across 1.7B to 14B model scales on math benchmarks under both full and top-p sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to non-math RL tasks such as code generation where similar low-value tokens might receive oversized credit.
Detecting spurious tokens automatically rather than by fixed frequency thresholds would make the method easier to apply to new domains.
If spurious tokens also appear in preference data, the same silencing step might reduce reward-model exploitation in standard RLHF.

Load-bearing premise

That the identified spurious tokens are the dominant source of instability and that zeroing their gradients removes noise without discarding useful reasoning information or creating new biases.

What would settle it

Run identical STAPO training on the same Qwen models but disable S2T gradient suppression; if entropy still stays flat and accuracy matches the reported gains, the causal role of spurious tokens would be falsified.

read the original abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($\rho_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($\rho_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAPO gives a concrete token-silencing trick that stabilizes RL training and lifts math scores, but the causal role of the specific spurious set still needs a random-mask control.

read the letter

The core idea is straightforward: a tiny slice of tokens (about 0.01%) picks up outsized gradients during RL because it inherits the full sequence reward, and muting their updates inside a group objective keeps training from collapsing late. STAPO packages this into a new objective that combines the silencing step with the usual policy gradient. On Qwen 1.7B–14B models across six math benchmarks it reports clearer entropy curves and average gains of 11.49% and 3.73% over GRPO, 20-Entropy, and JustRL under two sampling regimes. That is the usable takeaway for anyone running large-scale reasoning RL right now. The unified token-impact framework (spurious risk plus gradient norm plus entropy shift) is the part that feels fresh relative to the baselines they cite. It gives a reproducible way to flag the tokens instead of relying on hand-tuned entropy bonuses. The experiments cover multiple model sizes and report consistent directionality, which is better than many RL-for-LLM papers that only show one scale. The soft spot is exactly the one the stress-test flags. Without an ablation that replaces the identified set with a random 0.01% mask while holding everything else fixed, it is still possible the gains come from generic low-frequency suppression rather than from correctly locating the “spurious” tokens. The abstract does not describe that control, so the specificity claim rests on the identification rule alone. Minor gaps include missing error bars and limited detail on how token selection interacts with the reward model. This paper is for groups already running RL fine-tuning on reasoning models and looking for a drop-in stabilizer. It is coherent on its own terms and shows clear engineering thinking, so it deserves a full referee rather than a desk reject. I would send it to review with a request for the random-mask ablation and a bit more on whether silencing changes downstream reasoning quality.

Referee Report

3 major / 2 minor

Summary. The paper claims that a small fraction (~0.01%) of spurious tokens cause instability in RL fine-tuning of LLMs by receiving amplified gradients from sequence-level rewards. They introduce a unified framework to identify these tokens based on spurious risk, gradient norms, and entropy, and propose the S2T mechanism to silence their gradients. This is incorporated into STAPO, a group-based policy optimization method, which shows superior entropy stability and performance gains of 11.49% (ρ_T=1.0, top-p=1.0) and 3.73% (ρ_T=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL on six math reasoning benchmarks with Qwen 1.7B, 8B, and 14B models.

Significance. If the results hold and the improvements are specifically due to silencing the identified spurious tokens rather than generic regularization, the work could provide a targeted approach to stabilizing RL training for LLMs, reducing reliance on heuristic entropy methods and improving reliability for scaling reasoning in large models. The cross-model-size empirical results would be a strength if the attribution is validated.

major comments (3)

Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.

minor comments (2)

Abstract: The phrase 'consistent gains' should be qualified with whether improvements hold on every benchmark or are driven by averages.
Notation: The parameters ρ_T and top-p appear in the results tables but their precise definitions and selection process could be stated more explicitly in the main text for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the paper. We address each major comment below and will incorporate revisions to provide more rigorous empirical support.

read point-by-point responses

Referee: Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.

Authors: We fully agree that error bars, multiple runs, and statistical tests are essential to substantiate the performance claims. In the revised manuscript, we will rerun the experiments with at least 3 different random seeds, report mean and standard deviation for all metrics, and include p-values from statistical tests (such as Wilcoxon signed-rank test) to demonstrate the significance of the improvements over baselines. revision: yes
Referee: Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.

Authors: This is a valid concern for attributing the benefits specifically to our framework. We will add a new ablation experiment in the revised paper where we randomly select and silence an equivalent fraction (0.01%) of tokens without using our identification criteria, and compare the results to STAPO on both stability and benchmark performance. This control will help confirm that the targeted silencing of spurious tokens is key. revision: yes
Referee: S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.

Authors: We appreciate this point and will enhance the manuscript with additional verification. Specifically, we will include experiments showing the effect of silencing on individual reasoning steps, such as by comparing the correctness of generated solutions with and without the S2T mechanism in controlled settings, and analyze potential biases by examining the distribution of generated tokens or reward signals post-silencing. This will support that reasoning quality is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper motivates STAPO via an empirical analysis of token-level statistics (spurious risk, gradient norms, entropy changes) to flag ~0.01% spurious tokens, then defines a silencing mechanism inside a group-based policy objective. Performance gains are reported as experimental outcomes on held-out benchmarks rather than as quantities derived from fitted parameters that reduce to the identification rule by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the token-selection rule is not shown to be a direct function of the same reward signal used for the final policy update. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that sequence-level rewards are the source of spurious-token amplification and on the empirical observation that 0.01% of tokens dominate gradient disruption.

free parameters (2)

ρ_T
Token silencing threshold used in the reported runs (values 0.7 and 1.0)
top-p
Sampling parameter varied in the two reported settings

axioms (1)

domain assumption A small fraction of tokens inherit the full sequence reward yet contribute negligibly to the final reasoning outcome
Stated as the key factor behind instability

invented entities (2)

Spurious tokens no independent evidence
purpose: Explain source of gradient instability
Defined as ~0.01% of tokens with low contribution but high gradient impact
S2T mechanism no independent evidence
purpose: Suppress gradient perturbations from spurious tokens
New component introduced to implement silencing

pith-pipeline@v0.9.0 · 5613 in / 1391 out tokens · 43067 ms · 2026-05-15T21:38:10.962118+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
cs.LG 2026-05 unverdicted novelty 6.0

Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.