Recognition: 2 theorem links
· Lean TheoremGroup Sequence Policy Optimization
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
GSPO optimizes LLM policies using sequence-level importance ratios and clipping instead of token-level operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Unlike token-level methods, this yields superior training efficiency and performance compared to the GRPO algorithm, stabilizes Mixture-of-Experts RL training, and has the potential for simplifying the design of RL infrastructure.
What carries the argument
The sequence-level importance ratio, defined as the ratio of the current policy probability to the old policy probability over an entire sequence, which drives importance sampling, clipping, and gradient updates.
If this is right
- GSPO achieves superior training efficiency and performance compared to the GRPO algorithm.
- GSPO notably stabilizes Mixture-of-Experts RL training.
- GSPO has the potential for simplifying the design of RL infrastructure.
- These changes contributed to the remarkable improvements observed in the latest Qwen3 models.
Where Pith is reading between the lines
- Sequence-level ratios could reduce the sensitivity of training to individual token sampling noise, allowing more consistent updates on long generations.
- The approach might let practitioners drop some token-level masking logic in existing RLHF codebases.
- Testing GSPO on non-MoE dense models would clarify whether the reported stability gains are tied specifically to expert routing dynamics.
Load-bearing premise
Shifting importance sampling, clipping, and optimization from the token level to the full sequence level will reliably improve stability and performance without creating new biases or optimization issues.
What would settle it
A controlled experiment on the same MoE model and tasks where GSPO produces equal or higher instability and lower final scores than GRPO under matched hyperparameters and compute budgets.
read the original abstract
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Group Sequence Policy Optimization (GSPO), an RL algorithm for LLMs that replaces token-level importance ratios (as in GRPO) with sequence-level likelihood ratios, followed by sequence-level clipping, reward aggregation, and optimization. It claims this yields superior training efficiency and performance, stabilizes RL for Mixture-of-Experts models, simplifies RL infrastructure, and contributed to the Qwen3 models.
Significance. If the empirical and theoretical claims hold, GSPO could meaningfully simplify and stabilize RLHF pipelines for large-scale LLMs, especially MoE architectures where token-level methods reportedly suffer instability. The infrastructure-simplification angle is practically attractive, but the significance remains provisional given the absence of detailed quantitative support, variance analysis, or ablations in the provided manuscript.
major comments (3)
- [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
- [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
- [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.
minor comments (2)
- [§3] Notation for the sequence likelihood ratio is introduced without an explicit equation number or comparison to the standard token-level ratio; adding a side-by-side definition would improve clarity.
- [§2] The manuscript cites GRPO but does not include a concise recap of its token-level clipping rule; a short comparison table would help readers follow the claimed differences.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: the central claim that sequence-level importance sampling and clipping produce better efficiency and MoE stability than token-level GRPO is asserted without any reported numbers, baselines, statistical tests, or ablation isolating the sequence-level change; this directly undermines assessment of the claim and leaves the skeptic's variance/bias concern unaddressed.
Authors: We agree that the abstract and §1 would benefit from explicit quantitative support. The current manuscript asserts the benefits based on the experiments in §4, but does not embed specific numbers, baselines, or ablations in the introductory sections. We will revise the abstract and §1 to include key reported metrics on training efficiency, performance gains, and MoE stability observations, with direct references to the corresponding results and any statistical details available in §4. revision: yes
-
Referee: [§3] §3 (Algorithm): no derivation or bound is provided showing that the sequence-level importance ratio remains unbiased or has controlled variance relative to the token-level ratio; the manuscript therefore does not establish that the proposed change avoids the high-variance pathology the stress-test note flags.
Authors: The referee correctly notes the absence of a formal derivation or variance bound in §3. The sequence-level formulation is chosen to align the importance ratio with the sequence-level reward, avoiding token-level mismatch. We will expand §3 with a discussion of this motivation and its empirical implications for variance, but we do not currently have a rigorous proof or bound establishing unbiasedness or controlled variance relative to the token-level case. revision: partial
-
Referee: [§4] §4 (Experiments): the reported comparisons to GRPO lack ablations that hold all other factors fixed while varying only the sequence- vs. token-level treatment, and no variance-of-gradient or effective-sample-size metrics are shown; without these, the stability and efficiency claims cannot be evaluated as load-bearing evidence.
Authors: We acknowledge that the existing comparisons do not isolate the sequence- versus token-level treatment through controlled ablations, nor do they report gradient variance or effective sample size. We will add such ablations to §4 while holding other factors fixed, and include the requested metrics to provide quantitative support for the stability and efficiency claims. revision: yes
- Absence of a derivation or bound establishing that the sequence-level importance ratio is unbiased or exhibits controlled variance relative to the token-level ratio.
Circularity Check
No circularity: GSPO defined by explicit sequence-level change with no self-referential derivation or fitted prediction.
full rationale
The paper introduces GSPO by directly defining the importance ratio on full sequence likelihood (rather than token-level) and applying sequence-level clipping/optimization. No equations, derivations, or parameter fits are shown that reduce the claimed advantages back to the inputs by construction. The comparison to GRPO is presented as an empirical demonstration rather than a mathematical necessity derived from prior self-work. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The algorithm is self-contained as a straightforward redefinition of the policy gradient components at sequence granularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the sequence-level importance weight πθ(yi|x)/πθold(yi|x) has a clear theoretical meaning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
-
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
Beam-search negatives induce partial AUC optimization in GRPO for LLM recommenders; Windowed Partial AUC and TAWin improve Top-K alignment on four datasets.
-
ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
ReCast repairs all-zero groups and uses contrastive updates on strongest positives and hardest negatives to improve RL in generative recommendation, yielding up to 36.6% better Pass@1 with only 4.1% of baseline rollou...
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
-
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while...
-
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
-
Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation
EvoMQL uses iterative Draft-Refine-Optimize cycles with execution feedback to reach 76.6% accuracy on EAI and 83.1% on TEND benchmarks for natural language to MongoDB query generation.
-
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
-
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
-
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
-
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
-
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Learning to reason with LLMs , 2024
OpenAI . Learning to reason with LLMs , 2024. URL https://openai.com/index/learning-to-reason-with-llms/
work page 2024
-
[4]
Team Qwen. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Qwq-32b: Embracing the power of reinforcement learning, March 2025 b
Team Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025 b . URL https://qwenlm.github.io/blog/qwq-32b/
work page 2025
-
[6]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Click: Controllable text generation with sequence likelihood contrastive learning
Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://aclanthology.org/2023.findings-acl.65/
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.