STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Pith reviewed 2026-05-15 21:38 UTC · model grok-4.3
The pith
Silencing gradients from a tiny fraction of spurious tokens stabilizes RL fine-tuning of LLMs and raises math reasoning performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a small set of spurious tokens inherits the full outcome reward, producing outsized gradient updates that destabilize the policy and degrade reasoning quality. The authors define a unified evaluation of token-level effects across spurious risk, gradient norm, and entropy change, then propose the S2T mechanism to suppress gradients from these tokens inside a group-relative objective. The resulting STAPO algorithm produces stable entropy trajectories and consistent accuracy gains on mathematical reasoning tasks for Qwen models of three sizes.
What carries the argument
The Silencing Spurious Tokens (S2T) mechanism, which identifies low-contribution tokens and suppresses their gradient contributions within the group-based policy update.
If this is right
- Late-stage performance collapse in RL fine-tuning of reasoning models can be prevented by token-level gradient editing rather than global entropy regularization.
- The same S2T logic can be added to other group-relative objectives without changing their sampling or reward structure.
- Entropy remains controlled across training without extra regularization terms once spurious gradient contributions are removed.
- Accuracy gains appear consistently across 1.7B to 14B model scales on math benchmarks under both full and top-p sampling.
Where Pith is reading between the lines
- The approach could transfer to non-math RL tasks such as code generation where similar low-value tokens might receive oversized credit.
- Detecting spurious tokens automatically rather than by fixed frequency thresholds would make the method easier to apply to new domains.
- If spurious tokens also appear in preference data, the same silencing step might reduce reward-model exploitation in standard RLHF.
Load-bearing premise
That the identified spurious tokens are the dominant source of instability and that zeroing their gradients removes noise without discarding useful reasoning information or creating new biases.
What would settle it
Run identical STAPO training on the same Qwen models but disable S2T gradient suppression; if entropy still stays flat and accuracy matches the reported gains, the causal role of spurious tokens would be falsified.
read the original abstract
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($\rho_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($\rho_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a small fraction (~0.01%) of spurious tokens cause instability in RL fine-tuning of LLMs by receiving amplified gradients from sequence-level rewards. They introduce a unified framework to identify these tokens based on spurious risk, gradient norms, and entropy, and propose the S2T mechanism to silence their gradients. This is incorporated into STAPO, a group-based policy optimization method, which shows superior entropy stability and performance gains of 11.49% (ρ_T=1.0, top-p=1.0) and 3.73% (ρ_T=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL on six math reasoning benchmarks with Qwen 1.7B, 8B, and 14B models.
Significance. If the results hold and the improvements are specifically due to silencing the identified spurious tokens rather than generic regularization, the work could provide a targeted approach to stabilizing RL training for LLMs, reducing reliance on heuristic entropy methods and improving reliability for scaling reasoning in large models. The cross-model-size empirical results would be a strength if the attribution is validated.
major comments (3)
- Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
- Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
- S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.
minor comments (2)
- Abstract: The phrase 'consistent gains' should be qualified with whether improvements hold on every benchmark or are driven by averages.
- Notation: The parameters ρ_T and top-p appear in the results tables but their precise definitions and selection process could be stated more explicitly in the main text for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas to strengthen the paper. We address each major comment below and will incorporate revisions to provide more rigorous empirical support.
read point-by-point responses
-
Referee: Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
Authors: We fully agree that error bars, multiple runs, and statistical tests are essential to substantiate the performance claims. In the revised manuscript, we will rerun the experiments with at least 3 different random seeds, report mean and standard deviation for all metrics, and include p-values from statistical tests (such as Wilcoxon signed-rank test) to demonstrate the significance of the improvements over baselines. revision: yes
-
Referee: Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
Authors: This is a valid concern for attributing the benefits specifically to our framework. We will add a new ablation experiment in the revised paper where we randomly select and silence an equivalent fraction (0.01%) of tokens without using our identification criteria, and compare the results to STAPO on both stability and benchmark performance. This control will help confirm that the targeted silencing of spurious tokens is key. revision: yes
-
Referee: S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.
Authors: We appreciate this point and will enhance the manuscript with additional verification. Specifically, we will include experiments showing the effect of silencing on individual reasoning steps, such as by comparing the correctness of generated solutions with and without the S2T mechanism in controlled settings, and analyze potential biases by examining the distribution of generated tokens or reward signals post-silencing. This will support that reasoning quality is preserved. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper motivates STAPO via an empirical analysis of token-level statistics (spurious risk, gradient norms, entropy changes) to flag ~0.01% spurious tokens, then defines a silencing mechanism inside a group-based policy objective. Performance gains are reported as experimental outcomes on held-out benchmarks rather than as quantities derived from fitted parameters that reduce to the identification rule by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the token-selection rule is not shown to be a direct function of the same reward signal used for the final policy update. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- ρ_T
- top-p
axioms (1)
- domain assumption A small fraction of tokens inherit the full sequence reward yet contribute negligibly to the final reasoning outcome
invented entities (2)
-
Spurious tokens
no independent evidence
-
S2T mechanism
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.