arxiv: 2507.18071 · v2 · submitted 2025-07-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Group Sequence Policy Optimization

Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu

show 4 more authors

Rui Men An Yang Jingren Zhou Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelspolicy optimizationsequence levelimportance samplingmixture of expertsGSPOGRPO

0 comments

The pith

GSPO optimizes LLM policies using sequence-level importance ratios and clipping instead of token-level operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Group Sequence Policy Optimization as a reinforcement learning method for large language models that computes importance ratios from the likelihood of complete output sequences. It performs clipping, rewarding, and policy updates at the sequence level rather than breaking them down token by token. This produces better training efficiency and final performance than the prior GRPO approach, with particular gains in stabilizing training runs on Mixture-of-Experts architectures. The design also points toward simpler reinforcement learning pipelines that require less custom token handling.

Core claim

GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Unlike token-level methods, this yields superior training efficiency and performance compared to the GRPO algorithm, stabilizes Mixture-of-Experts RL training, and has the potential for simplifying the design of RL infrastructure.

What carries the argument

The sequence-level importance ratio, defined as the ratio of the current policy probability to the old policy probability over an entire sequence, which drives importance sampling, clipping, and gradient updates.

If this is right

GSPO achieves superior training efficiency and performance compared to the GRPO algorithm.
GSPO notably stabilizes Mixture-of-Experts RL training.
GSPO has the potential for simplifying the design of RL infrastructure.
These changes contributed to the remarkable improvements observed in the latest Qwen3 models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sequence-level ratios could reduce the sensitivity of training to individual token sampling noise, allowing more consistent updates on long generations.
The approach might let practitioners drop some token-level masking logic in existing RLHF codebases.
Testing GSPO on non-MoE dense models would clarify whether the reported stability gains are tied specifically to expert routing dynamics.

Load-bearing premise

Shifting importance sampling, clipping, and optimization from the token level to the full sequence level will reliably improve stability and performance without creating new biases or optimization issues.

What would settle it

A controlled experiment on the same MoE model and tasks where GSPO produces equal or higher instability and lower final scores than GRPO under matched hyperparameters and compute budgets.

read the original abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSPO moves importance ratios and clipping to the full sequence level instead of per-token, claiming better stability for MoE RL training, but the abstract supplies no numbers or ablations to check whether the change actually delivers.

read the letter

GSPO is GRPO with the importance ratio, clipping, reward, and optimization all computed over the entire sequence likelihood rather than token by token. The authors report that this produces more efficient training, stronger final performance, and notably better stability when the policy is a Mixture-of-Experts model. They also state that the same change contributed to the Qwen3 release. That is the concrete technical move on the table. If the full experiments hold, the formulation could reduce the need for some of the per-token bookkeeping that current RL pipelines carry. The paper does a reasonable job of spelling out why token-level ratios can be fragile at scale and why treating the whole response as one unit might simplify gradient flow and infrastructure. The link to a deployed model family gives the claim some external grounding that pure theory papers lack. The main weakness is the missing evidence. The abstract asserts superiority and stabilization benefits without reporting any quantitative deltas, baseline tables, statistical tests, or ablations that isolate the sequence-level change from other implementation details. That leaves open the exact worry in the stress-test note: whether sequence-level ratios introduce higher variance or distort the gradient on partial trajectories. Until those results are shown, it is hard to know whether the method is a net improvement or simply trades one set of pathologies for another. This paper is aimed at engineers and researchers who run RL post-training on large LLMs and who care about MoE stability and training cost. A reader who wants a concrete alternative formulation to try can get value from the description even before the numbers are fully vetted. It deserves peer review because the idea is clearly stated, tied to real model work, and addresses a practical pain point, though the experimental section will need substantial strengthening before it can be evaluated properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces Group Sequence Policy Optimization (GSPO), an RL algorithm for LLMs that replaces token-level importance ratios (as in GRPO) with sequence-level likelihood ratios, followed by sequence-level clipping, reward aggregation, and optimization. It claims this yields superior training efficiency and performance, stabilizes RL for Mixture-of-Experts models, simplifies RL infrastructure, and contributed to the Qwen3 models.

Significance. If the empirical and theoretical claims hold, GSPO could meaningfully simplify and stabilize RLHF pipelines for large-scale LLMs, especially MoE architectures where token-level methods reportedly suffer instability. The infrastructure-simplification angle is practically attractive, but the significance remains provisional given the absence of detailed quantitative support, variance analysis, or ablations in the provided manuscript.

major comments (3)

[Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
[§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
[§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.

minor comments (2)

[§3] Notation for the sequence likelihood ratio is introduced without an explicit equation number or comparison to the standard token-level ratio; adding a side-by-side definition would improve clarity.
[§2] The manuscript cites GRPO but does not include a concise recap of its token-level clipping rule; a short comparison table would help readers follow the claimed differences.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.

Authors: We agree that the abstract and §1 would benefit from explicit quantitative support. The current manuscript asserts the benefits based on the experiments in §4, but does not embed specific numbers, baselines, or ablations in the introductory sections. We will revise the abstract and §1 to include key reported metrics on training efficiency, performance gains, and MoE stability observations, with direct references to the corresponding results and any statistical details available in §4. revision: yes
Referee: [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.

Authors: The referee correctly notes the absence of a formal derivation or variance bound in §3. The sequence-level formulation is chosen to align the importance ratio with the sequence-level reward, avoiding token-level mismatch. We will expand §3 with a discussion of this motivation and its empirical implications for variance, but we do not currently have a rigorous proof or bound establishing unbiasedness or controlled variance relative to the token-level case. revision: partial
Referee: [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.

Authors: We acknowledge that the existing comparisons do not isolate the sequence- versus token-level treatment through controlled ablations, nor do they report gradient variance or effective sample size. We will add such ablations to §4 while holding other factors fixed, and include the requested metrics to provide quantitative support for the stability and efficiency claims. revision: yes

standing simulated objections not resolved

Absence of a derivation or bound establishing that the sequence-level importance ratio is unbiased or exhibits controlled variance relative to the token-level ratio.

Circularity Check

0 steps flagged

No circularity: GSPO defined by explicit sequence-level change with no self-referential derivation or fitted prediction.

full rationale

The paper introduces GSPO by directly defining the importance ratio on full sequence likelihood (rather than token-level) and applying sequence-level clipping/optimization. No equations, derivations, or parameter fits are shown that reduce the claimed advantages back to the inputs by construction. The comparison to GRPO is presented as an empirical demonstration rather than a mathematical necessity derived from prior self-work. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The algorithm is self-contained as a straightforward redefinition of the policy gradient components at sequence granularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, so no free parameters, axioms, or invented entities can be identified. The central claim rests entirely on an empirical comparison whose details are not provided.

pith-pipeline@v0.9.0 · 5422 in / 1050 out tokens · 44587 ms · 2026-05-10T19:17:24.969774+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the sequence-level importance weight πθ(yi|x)/πθold(yi|x) has a clear theoretical meaning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
cs.LG 2026-05 unverdicted novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
cs.CV 2026-05 unverdicted novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
cs.IR 2026-04 unverdicted novelty 7.0

Beam-search negatives induce partial AUC optimization in GRPO for LLM recommenders; Windowed Partial AUC and TAWin improve Top-K alignment on four datasets.
ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
cs.LG 2026-04 unverdicted novelty 7.0

ReCast repairs all-zero groups and uses contrastive updates on strongest positives and hardest negatives to improve RL in generative recommendation, yielding up to 36.6% better Pass@1 with only 4.1% of baseline rollou...
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
cs.LG 2026-04 unverdicted novelty 7.0

QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while...
Motion-o: Trajectory-Grounded Video Reasoning
cs.CV 2026-03 conditional novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation
cs.DB 2026-03 unverdicted novelty 7.0

EvoMQL uses iterative Draft-Refine-Optimize cycles with execution feedback to reach 76.6% accuracy on EAI and 83.1% on TEND benchmarks for natural language to MongoDB query generation.
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
cs.LG 2026-03 unverdicted novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
cs.LG 2026-05 unverdicted novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 6.0

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
cs.CL 2026-05 unverdicted novelty 6.0

Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
cs.CL 2026-05 unverdicted novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...
Optimal Transport for LLM Reward Modeling from Noisy Preference
cs.LG 2026-05 unverdicted novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
cs.LG 2026-05 unverdicted novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 111 Pith papers · 5 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[3]

Learning to reason with LLMs , 2024

OpenAI . Learning to reason with LLMs , 2024. URL https://openai.com/index/learning-to-reason-with-llms/

work page 2024
[4]

Qwen3 Technical Report

Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwq-32b: Embracing the power of reinforcement learning, March 2025 b

Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025 b . URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[6]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Click: Controllable text generation with sequence likelihood contrastive learning

Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://aclanthology.org/2023.findings-acl.65/

work page 2023