Group Sequence Policy Optimization

An Yang; Bowen Yu; Chang Gao; Chujie Zheng; Jingren Zhou; Junyang Lin; Kai Dang; Mingze Li; Rui Men; Shixuan Liu

arxiv: 2507.18071 · v2 · submitted 2025-07-24 · 💻 cs.LG · cs.AI· cs.CL

Group Sequence Policy Optimization

Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu

show 4 more authors

Rui Men An Yang Jingren Zhou Junyang Lin

This is my paper

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelspolicy optimizationsequence levelimportance samplingmixture of expertsGSPOGRPO

0 comments

The pith

GSPO optimizes LLM policies using sequence-level importance ratios and clipping instead of token-level operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Group Sequence Policy Optimization as a reinforcement learning method for large language models that computes importance ratios from the likelihood of complete output sequences. It performs clipping, rewarding, and policy updates at the sequence level rather than breaking them down token by token. This produces better training efficiency and final performance than the prior GRPO approach, with particular gains in stabilizing training runs on Mixture-of-Experts architectures. The design also points toward simpler reinforcement learning pipelines that require less custom token handling.

Core claim

GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Unlike token-level methods, this yields superior training efficiency and performance compared to the GRPO algorithm, stabilizes Mixture-of-Experts RL training, and has the potential for simplifying the design of RL infrastructure.

What carries the argument

The sequence-level importance ratio, defined as the ratio of the current policy probability to the old policy probability over an entire sequence, which drives importance sampling, clipping, and gradient updates.

If this is right

GSPO achieves superior training efficiency and performance compared to the GRPO algorithm.
GSPO notably stabilizes Mixture-of-Experts RL training.
GSPO has the potential for simplifying the design of RL infrastructure.
These changes contributed to the remarkable improvements observed in the latest Qwen3 models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sequence-level ratios could reduce the sensitivity of training to individual token sampling noise, allowing more consistent updates on long generations.
The approach might let practitioners drop some token-level masking logic in existing RLHF codebases.
Testing GSPO on non-MoE dense models would clarify whether the reported stability gains are tied specifically to expert routing dynamics.

Load-bearing premise

Shifting importance sampling, clipping, and optimization from the token level to the full sequence level will reliably improve stability and performance without creating new biases or optimization issues.

What would settle it

A controlled experiment on the same MoE model and tasks where GSPO produces equal or higher instability and lower final scores than GRPO under matched hyperparameters and compute budgets.

read the original abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSPO moves importance ratios and clipping to the full sequence level instead of per-token, claiming better stability for MoE RL training, but the abstract supplies no numbers or ablations to check whether the change actually delivers.

read the letter

GSPO is GRPO with the importance ratio, clipping, reward, and optimization all computed over the entire sequence likelihood rather than token by token. The authors report that this produces more efficient training, stronger final performance, and notably better stability when the policy is a Mixture-of-Experts model. They also state that the same change contributed to the Qwen3 release. That is the concrete technical move on the table. If the full experiments hold, the formulation could reduce the need for some of the per-token bookkeeping that current RL pipelines carry. The paper does a reasonable job of spelling out why token-level ratios can be fragile at scale and why treating the whole response as one unit might simplify gradient flow and infrastructure. The link to a deployed model family gives the claim some external grounding that pure theory papers lack. The main weakness is the missing evidence. The abstract asserts superiority and stabilization benefits without reporting any quantitative deltas, baseline tables, statistical tests, or ablations that isolate the sequence-level change from other implementation details. That leaves open the exact worry in the stress-test note: whether sequence-level ratios introduce higher variance or distort the gradient on partial trajectories. Until those results are shown, it is hard to know whether the method is a net improvement or simply trades one set of pathologies for another. This paper is aimed at engineers and researchers who run RL post-training on large LLMs and who care about MoE stability and training cost. A reader who wants a concrete alternative formulation to try can get value from the description even before the numbers are fully vetted. It deserves peer review because the idea is clearly stated, tied to real model work, and addresses a practical pain point, though the experimental section will need substantial strengthening before it can be evaluated properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces Group Sequence Policy Optimization (GSPO), an RL algorithm for LLMs that replaces token-level importance ratios (as in GRPO) with sequence-level likelihood ratios, followed by sequence-level clipping, reward aggregation, and optimization. It claims this yields superior training efficiency and performance, stabilizes RL for Mixture-of-Experts models, simplifies RL infrastructure, and contributed to the Qwen3 models.

Significance. If the empirical and theoretical claims hold, GSPO could meaningfully simplify and stabilize RLHF pipelines for large-scale LLMs, especially MoE architectures where token-level methods reportedly suffer instability. The infrastructure-simplification angle is practically attractive, but the significance remains provisional given the absence of detailed quantitative support, variance analysis, or ablations in the provided manuscript.

major comments (3)

[Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
[§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
[§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.

minor comments (2)

[§3] Notation for the sequence likelihood ratio is introduced without an explicit equation number or comparison to the standard token-level ratio; adding a side-by-side definition would improve clarity.
[§2] The manuscript cites GRPO but does not include a concise recap of its token-level clipping rule; a short comparison table would help readers follow the claimed differences.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.

Authors: We agree that the abstract and §1 would benefit from explicit quantitative support. The current manuscript asserts the benefits based on the experiments in §4, but does not embed specific numbers, baselines, or ablations in the introductory sections. We will revise the abstract and §1 to include key reported metrics on training efficiency, performance gains, and MoE stability observations, with direct references to the corresponding results and any statistical details available in §4. revision: yes
Referee: [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.

Authors: The referee correctly notes the absence of a formal derivation or variance bound in §3. The sequence-level formulation is chosen to align the importance ratio with the sequence-level reward, avoiding token-level mismatch. We will expand §3 with a discussion of this motivation and its empirical implications for variance, but we do not currently have a rigorous proof or bound establishing unbiasedness or controlled variance relative to the token-level case. revision: partial
Referee: [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.

Authors: We acknowledge that the existing comparisons do not isolate the sequence- versus token-level treatment through controlled ablations, nor do they report gradient variance or effective sample size. We will add such ablations to §4 while holding other factors fixed, and include the requested metrics to provide quantitative support for the stability and efficiency claims. revision: yes

standing simulated objections not resolved

Absence of a derivation or bound establishing that the sequence-level importance ratio is unbiased or exhibits controlled variance relative to the token-level ratio.

Circularity Check

0 steps flagged

No circularity: GSPO defined by explicit sequence-level change with no self-referential derivation or fitted prediction.

full rationale

The paper introduces GSPO by directly defining the importance ratio on full sequence likelihood (rather than token-level) and applying sequence-level clipping/optimization. No equations, derivations, or parameter fits are shown that reduce the claimed advantages back to the inputs by construction. The comparison to GRPO is presented as an empirical demonstration rather than a mathematical necessity derived from prior self-work. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The algorithm is self-contained as a straightforward redefinition of the policy gradient components at sequence granularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, so no free parameters, axioms, or invented entities can be identified. The central claim rests entirely on an empirical comparison whose details are not provided.

pith-pipeline@v0.9.0 · 5422 in / 1050 out tokens · 44587 ms · 2026-05-10T19:17:24.969774+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the sequence-level importance weight πθ(yi|x)/πθold(yi|x) has a clear theoretical meaning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
cs.AI 2026-05 unverdicted novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
cs.LG 2026-05 unverdicted novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...
Learning from Language Feedback via Variational Policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
cs.LG 2026-05 unverdicted novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
cs.CV 2026-05 unverdicted novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
cs.IR 2026-04 unverdicted novelty 7.0

Beam-search negatives induce partial AUC optimization in GRPO for LLM recommenders; Windowed Partial AUC and TAWin improve Top-K alignment on four datasets.
ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
cs.LG 2026-04 unverdicted novelty 7.0

ReCast repairs all-zero groups and uses contrastive updates on strongest positives and hardest negatives to improve RL in generative recommendation, yielding up to 36.6% better Pass@1 with only 4.1% of baseline rollou...
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
cs.LG 2026-04 unverdicted novelty 7.0

QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while...
Motion-o: Trajectory-Grounded Video Reasoning
cs.CV 2026-03 conditional novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation
cs.DB 2026-03 unverdicted novelty 7.0

EvoMQL uses iterative Draft-Refine-Optimize cycles with execution feedback to reach 76.6% accuracy on EAI and 83.1% on TEND benchmarks for natural language to MongoDB query generation.
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
cs.LG 2026-03 unverdicted novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
cs.LG 2025-11 unverdicted novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO computes token-level advantages via Bayesian recursion on oracle signals, recovering distillation methods as a special case and improving over GRPO on math and code benchmarks.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...
Towards Context-Invariant Safety Alignment for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math ...
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
cs.LG 2026-05 unverdicted novelty 6.0

Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
cs.AI 2026-05 unverdicted novelty 6.0

SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
cs.LG 2026-05 conditional novelty 6.0

Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
Self-Supervised On-Policy Distillation for Reasoning Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
OProver: A Unified Framework for Agentic Formal Theorem Proving
cs.CL 2026-05 unverdicted novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
cs.CV 2026-05 unverdicted novelty 6.0

Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agent...

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 170 Pith papers · 5 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[3]

Learning to reason with LLMs , 2024

OpenAI . Learning to reason with LLMs , 2024. URL https://openai.com/index/learning-to-reason-with-llms/

work page 2024
[4]

Qwen3 Technical Report

Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwq-32b: Embracing the power of reinforcement learning, March 2025 b

Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025 b . URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[6]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Click: Controllable text generation with sequence likelihood contrastive learning

Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://aclanthology.org/2023.findings-acl.65/

work page 2023

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025

[3] [3]

Learning to reason with LLMs , 2024

OpenAI . Learning to reason with LLMs , 2024. URL https://openai.com/index/learning-to-reason-with-llms/

work page 2024

[4] [4]

Qwen3 Technical Report

Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwq-32b: Embracing the power of reinforcement learning, March 2025 b

Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025 b . URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[6] [6]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Click: Controllable text generation with sequence likelihood contrastive learning

Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://aclanthology.org/2023.findings-acl.65/

work page 2023