pith. sign in

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL training. To explain this phenomenon, we further show that a token's change in probability is not fully determined by its own advantage; coupled gradient interactions with other tokens also play a non-negligible role. Specifically, these token coupling effects occur primarily between identical tokens that are both predicted with low confidence. Building upon this analysis, we propose the cancellation hypothesis: as a result of coupling, opposing signals cancel out for tokens shared by positive and negative rollouts, while tokens more specific to successful rollouts receive stronger reinforcement, thereby inducing hidden token-level credit assignment from rollout-level rewards. We support this hypothesis with complementary empirical evidence. (1) Compared with training on only positive rollouts, critic-free RL shifts updates from template and formatting tokens toward reasoning tokens; (2) Tokens boosted by critic-free RL consistently demonstrate higher value than suppressed tokens, regardless of whether they originate from positive or negative rollouts. Guided by this view, we implement two batching interventions to encourage or preserve cancellation in critic-free RL training: query-preserved mini-batching and reward-balanced batching. Despite their simplicity, these interventions improve RLVR training across multiple model scales, supporting cancellation as both an explanatory principle and a practical design criterion for critic-free RL training.

fields

cs.LG 2

years

2026 2

verdicts

UNVERDICTED 2

clear filters

representative citing papers

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

Rethinking Groups in Critic-Free RLVR

cs.LG · 2026-06-15 · unverdicted · novelty 6.0

Negative token filtering enables single-rollout critic-free RL training by avoiding false penalties on negative samples, matching group-based methods on reasoning tasks and exceeding them on agentic tasks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL cs.LG · 2026-07-01 · unverdicted · none · ref 12 · internal anchor

    FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

  • Rethinking Groups in Critic-Free RLVR cs.LG · 2026-06-15 · unverdicted · none · ref 24 · internal anchor

    Negative token filtering enables single-rollout critic-free RL training by avoiding false penalties on negative samples, matching group-based methods on reasoning tasks and exceeding them on agentic tasks.