Group Sequence Policy Optimization
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
GSPO optimizes LLM policies using sequence-level importance ratios and clipping instead of token-level operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Unlike token-level methods, this yields superior training efficiency and performance compared to the GRPO algorithm, stabilizes Mixture-of-Experts RL training, and has the potential for simplifying the design of RL infrastructure.
What carries the argument
The sequence-level importance ratio, defined as the ratio of the current policy probability to the old policy probability over an entire sequence, which drives importance sampling, clipping, and gradient updates.
If this is right
- GSPO achieves superior training efficiency and performance compared to the GRPO algorithm.
- GSPO notably stabilizes Mixture-of-Experts RL training.
- GSPO has the potential for simplifying the design of RL infrastructure.
- These changes contributed to the remarkable improvements observed in the latest Qwen3 models.
Where Pith is reading between the lines
- Sequence-level ratios could reduce the sensitivity of training to individual token sampling noise, allowing more consistent updates on long generations.
- The approach might let practitioners drop some token-level masking logic in existing RLHF codebases.
- Testing GSPO on non-MoE dense models would clarify whether the reported stability gains are tied specifically to expert routing dynamics.
Load-bearing premise
Shifting importance sampling, clipping, and optimization from the token level to the full sequence level will reliably improve stability and performance without creating new biases or optimization issues.
What would settle it
A controlled experiment on the same MoE model and tasks where GSPO produces equal or higher instability and lower final scores than GRPO under matched hyperparameters and compute budgets.
read the original abstract
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Group Sequence Policy Optimization (GSPO), an RL algorithm for LLMs that replaces token-level importance ratios (as in GRPO) with sequence-level likelihood ratios, followed by sequence-level clipping, reward aggregation, and optimization. It claims this yields superior training efficiency and performance, stabilizes RL for Mixture-of-Experts models, simplifies RL infrastructure, and contributed to the Qwen3 models.
Significance. If the empirical and theoretical claims hold, GSPO could meaningfully simplify and stabilize RLHF pipelines for large-scale LLMs, especially MoE architectures where token-level methods reportedly suffer instability. The infrastructure-simplification angle is practically attractive, but the significance remains provisional given the absence of detailed quantitative support, variance analysis, or ablations in the provided manuscript.
major comments (3)
- [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
- [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
- [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.
minor comments (2)
- [§3] Notation for the sequence likelihood ratio is introduced without an explicit equation number or comparison to the standard token-level ratio; adding a side-by-side definition would improve clarity.
- [§2] The manuscript cites GRPO but does not include a concise recap of its token-level clipping rule; a short comparison table would help readers follow the claimed differences.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
Authors: We agree that the abstract and §1 would benefit from explicit quantitative support. The current manuscript asserts the benefits based on the experiments in §4, but does not embed specific numbers, baselines, or ablations in the introductory sections. We will revise the abstract and §1 to include key reported metrics on training efficiency, performance gains, and MoE stability observations, with direct references to the corresponding results and any statistical details available in §4. revision: yes
-
Referee: [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
Authors: The referee correctly notes the absence of a formal derivation or variance bound in §3. The sequence-level formulation is chosen to align the importance ratio with the sequence-level reward, avoiding token-level mismatch. We will expand §3 with a discussion of this motivation and its empirical implications for variance, but we do not currently have a rigorous proof or bound establishing unbiasedness or controlled variance relative to the token-level case. revision: partial
-
Referee: [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.
Authors: We acknowledge that the existing comparisons do not isolate the sequence- versus token-level treatment through controlled ablations, nor do they report gradient variance or effective sample size. We will add such ablations to §4 while holding other factors fixed, and include the requested metrics to provide quantitative support for the stability and efficiency claims. revision: yes
- Absence of a derivation or bound establishing that the sequence-level importance ratio is unbiased or exhibits controlled variance relative to the token-level ratio.
Circularity Check
No circularity: GSPO defined by explicit sequence-level change with no self-referential derivation or fitted prediction.
full rationale
The paper introduces GSPO by directly defining the importance ratio on full sequence likelihood (rather than token-level) and applying sequence-level clipping/optimization. No equations, derivations, or parameter fits are shown that reduce the claimed advantages back to the inputs by construction. The comparison to GRPO is presented as an empirical demonstration rather than a mathematical necessity derived from prior self-work. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The algorithm is self-contained as a straightforward redefinition of the policy gradient components at sequence granularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the sequence-level importance weight πθ(yi|x)/πθold(yi|x) has a clear theoretical meaning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
-
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
-
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
-
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
-
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
Beam-search negatives induce partial AUC optimization in GRPO for LLM recommenders; Windowed Partial AUC and TAWin improve Top-K alignment on four datasets.
-
ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
ReCast repairs all-zero groups and uses contrastive updates on strongest positives and hardest negatives to improve RL in generative recommendation, yielding up to 36.6% better Pass@1 with only 4.1% of baseline rollou...
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
-
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while...
-
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
-
Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation
EvoMQL uses iterative Draft-Refine-Optimize cycles with execution feedback to reach 76.6% accuracy on EAI and 83.1% on TEND benchmarks for natural language to MongoDB query generation.
-
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
-
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
-
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
-
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
-
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
OPPO computes token-level advantages via Bayesian recursion on oracle signals, recovering distillation methods as a special case and improving over GRPO on math and code benchmarks.
-
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
-
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math ...
-
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
-
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
-
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
-
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agent...
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Learning to reason with LLMs , 2024
OpenAI . Learning to reason with LLMs , 2024. URL https://openai.com/index/learning-to-reason-with-llms/
work page 2024
-
[4]
Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Qwq-32b: Embracing the power of reinforcement learning, March 2025 b
Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025 b . URL https://qwenlm.github.io/blog/qwq-32b/
work page 2025
-
[6]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Click: Controllable text generation with sequence likelihood contrastive learning
Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://aclanthology.org/2023.findings-acl.65/
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.