ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Mingxuan Yuan; Tao Zhong; Xiaojin Fu; Yuxuan Yao; Zehua Liu

arxiv: 2606.03070 · v3 · pith:4KOVMX3Jnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Zehua Liu , Yuxuan Yao , Xiaojin Fu , Tao Zhong , Mingxuan Yuan This is my paper

Pith reviewed 2026-06-28 11:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords asynchronous reinforcement learningLLM post-trainingpolicy optimizationscale imbalancebehavior-free methodsgroup-relative RLmathematical reasoning

0 comments

The pith

ASymPO stabilizes asynchronous LLM post-training by normalizing token losses with only current-policy probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that asynchronous reinforcement learning for language models creates a scale-imbalance failure mode when stale responses are scored under the current policy, so that positive and negative loss terms no longer cancel at the response level. It introduces Asymmetric-Scale Policy Optimization, which divides each response's token loss by that response's average token negative log-probability under the current policy. The resulting objective needs no behavior-policy probabilities, restores response-level zero-sum balance, and keeps a nonzero learning signal. The method is evaluated on mathematical-reasoning post-training together with a simpler fixed-scale baseline.

Core claim

When stale responses are evaluated under the current policy, positive and negative loss terms appear at different negative-log-probability scales, breaking the zero-sum property of group-relative advantages; ASymPO restores balance by normalizing each response's token loss by its own current average token negative log-probability, eliminating any requirement for behavior-policy information while preserving a learning signal.

What carries the argument

Asymmetric-Scale Policy Normalization, which divides each response's token loss by its current average token negative log-probability to equalize loss scales across responses.

If this is right

Asynchronous training can proceed without storing or aligning behavior log-probabilities across rollout and learner systems.
Response-level zero-sum balance is recovered even when data are stale.
A nonzero learning signal remains after normalization.
The same current-policy-only approach can be applied to other group-relative objectives in delayed-update settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The normalization may simplify distributed training pipelines that previously required synchronized versioned probability storage.
Similar per-response scaling could be tested on non-group-relative objectives to see whether the same scale-imbalance issue appears.
The fixed-scale baseline SPO provides a cheap reference point for measuring how much of the gain comes from the adaptive normalization itself.

Load-bearing premise

Normalizing each response's loss by its average token negative log-probability is enough to fix the scale imbalance without adding bias or removing the learning signal.

What would settle it

Measure the summed normalized loss contributions of positive and negative responses within each group on stale data; if the sum is not near zero while the un-normalized sum is not, the correction has failed.

Figures

Figures reproduced from arXiv: 2606.03070 by Mingxuan Yuan, Tao Zhong, Xiaojin Fu, Yuxuan Yao, Zehua Liu.

read the original abstract

Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASymPO is a normalization fix for scale imbalance in current-policy-only async RL that avoids behavior logprobs, but the abstract alone gives no way to check if the math or results hold.

read the letter

The main takeaway is that ASymPO normalizes each response's token loss by its average token negative log-probability under the current policy. This is meant to restore response-level zero-sum balance when stale responses create mismatched loss scales for positive and negative terms.

What the paper does is identify a concrete failure mode in asynchronous group-relative RL: without behavior probabilities, importance ratios, or clipping, the loss contributions can become unbalanced even when advantages sum to zero. The proposed normalization uses only current-policy quantities, which removes the need for token-aligned, versioned behavior logs across rollout and learner. They also add SPO as a fixed negative-scaling baseline and test both on asynchronous math reasoning post-training.

The practical angle is useful. Many production setups decouple generation from optimization for throughput, and dropping the behavior-policy requirement simplifies the pipeline. The diagnosis of scale imbalance follows directly from how negative log-probabilities behave on stale data.

The soft spot is that the abstract contains no equations, no derivation of why this particular normalization preserves a nonzero signal, and no experimental numbers. Without those, it is impossible to verify whether the fix actually works or whether it trades one imbalance for another, such as over-weighting short responses. The claim that it restores balance sounds plausible on the surface but needs the full math and ablations to confirm.

This is for groups already running async RL post-training on reasoning tasks and looking for simpler alternatives to importance sampling. It deserves peer review because the underlying constraint is real and the proposed solution is lightweight; a referee can check the derivations and results that the abstract omits.

Referee Report

2 major / 0 minor

Summary. The paper claims that asynchronous group-relative RL for LLM post-training suffers from a scale-imbalance failure mode when stale responses are evaluated under the current policy, causing positive and negative loss terms to appear at different negative log-probability scales. It proposes ASymPO, which normalizes each response's token loss by its current average token negative log-probability, to restore response-level zero-sum balance without requiring behavior-policy probabilities, importance ratios, or clipping. A fixed negative-scaling baseline (SPO) is also introduced, and both are evaluated in asynchronous mathematical reasoning post-training.

Significance. If the normalization step in ASymPO can be shown to correct the identified imbalance while preserving a nonzero learning signal and avoiding bias, the method would remove a key practical barrier (need for versioned, token-aligned behavior log-probabilities) in asynchronous LLM post-training, potentially enabling higher throughput without distribution-drift controls.

major comments (2)

[Abstract] Abstract (paragraph on scale-imbalance and ASymPO proposal): the claim that normalizing each response's token loss by its current average token negative log-probability is sufficient to restore zero-sum balance and preserve a nonzero learning signal is presented without any derivation, equation, or proof; the weakest assumption identified in the reader's report therefore remains unaddressed and load-bearing for the central contribution.
[Abstract] Abstract (evaluation paragraph): no experimental results, tables, or quantitative comparisons are provided to demonstrate that ASymPO or SPO outperform standard behavior-corrected methods or avoid introducing bias in the asynchronous math-reasoning setting, so the practical effectiveness cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on the manuscript. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on scale-imbalance and ASymPO proposal): the claim that normalizing each response's token loss by its current average token negative log-probability is sufficient to restore zero-sum balance and preserve a nonzero learning signal is presented without any derivation, equation, or proof; the weakest assumption identified in the reader's report therefore remains unaddressed and load-bearing for the central contribution.

Authors: The abstract is a high-level summary. The scale-imbalance failure mode is identified and the effect of the proposed normalization (dividing each response's token loss by its current average token negative log-probability) is derived with supporting equations in Section 3, showing restoration of response-level zero-sum balance while keeping a nonzero learning signal. We can add a short equation or explicit reference to Section 3 in the abstract. revision: partial
Referee: [Abstract] Abstract (evaluation paragraph): no experimental results, tables, or quantitative comparisons are provided to demonstrate that ASymPO or SPO outperform standard behavior-corrected methods or avoid introducing bias in the asynchronous math-reasoning setting, so the practical effectiveness cannot be assessed.

Authors: Abstracts conventionally summarize findings at a high level without tables or detailed numbers. Quantitative comparisons of ASymPO and SPO against behavior-corrected baselines, including performance metrics and bias analysis in the asynchronous mathematical reasoning post-training setting, appear with tables and figures in Section 4. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description present a conceptual proposal for ASymPO normalization without exhibiting any equations, derivations, fitted parameters, or self-citations. No load-bearing step reduces to its own inputs by construction, no uniqueness theorem is invoked, and no renaming of known results occurs. The derivation chain is therefore self-contained at the level of the provided text, with the normalization step offered as an independent design choice rather than a statistical or definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations or methods section to extract free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5728 in / 960 out tokens · 25330 ms · 2026-06-28T11:09:45.940856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages · 15 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling relationship on learning mathematical reasoning with large language models , author=. arXiv preprint arXiv:2308.01825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning , author=. arXiv preprint arXiv:2601.09253 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[8]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=

work page arXiv
[9]

arXiv preprint arXiv:2510.11370 , year=

Stabilizing moe reinforcement learning by aligning training and inference routers , author=. arXiv preprint arXiv:2510.11370 , year=

work page arXiv
[10]

GLM-5: from Vibe Coding to Agentic Engineering

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[12]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[13]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[14]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[16]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization, 2024 , author=. URL https://arxiv. org/abs/2402.01306 , volume=

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[19]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=
[21]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[22]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=
[23]

Massively Parallel Methods for Deep Reinforcement Learning

Massively parallel methods for deep reinforcement learning , author=. arXiv preprint arXiv:1507.04296 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[25]

International Conference on Learning Representations , year=

Distributed Prioritized Experience Replay , author=. International Conference on Learning Representations , year=
[26]

International conference on machine learning , pages=

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[27]

International conference on learning representations , year=

Recurrent experience replay in distributed reinforcement learning , author=. International conference on learning representations , year=
[28]

Advances in neural information processing systems , volume=

Safe and efficient off-policy reinforcement learning , author=. Advances in neural information processing systems , volume=
[29]

Sample Efficient Actor-Critic with Experience Replay

Sample efficient actor-critic with experience replay , author=. arXiv preprint arXiv:1611.01224 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Secrets of RLHF in Large Language Models Part I: PPO

Secrets of rlhf in large language models part i: Ppo , author=. arXiv preprint arXiv:2307.04964 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[32]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[34]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[35]

GPG: A simple and strong reinforcement learning baseline for model reasoning

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv
[36]

American Invitational Mathematics Examination (AIME) 2024 , author=

2024
[37]

American Invitational Mathematics Examination (AIME) 2025 , author=

2025
[38]

ModelScope Team , year=
[39]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[40]

arXiv preprint arXiv:2410.18252 , year=

Asynchronous rlhf: Faster and more efficient off-policy rl for language models , author=. arXiv preprint arXiv:2410.18252 , year=

work page arXiv
[41]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

2022
[43]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling relationship on learning mathematical reasoning with large language models , author=. arXiv preprint arXiv:2308.01825 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning , author=. arXiv preprint arXiv:2601.09253 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[6] [6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[8] [8]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=

work page arXiv

[9] [9]

arXiv preprint arXiv:2510.11370 , year=

Stabilizing moe reinforcement learning by aligning training and inference routers , author=. arXiv preprint arXiv:2510.11370 , year=

work page arXiv

[10] [10]

GLM-5: from Vibe Coding to Agentic Engineering

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

[12] [12]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[13] [13]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

[14] [14]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[16] [16]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization, 2024 , author=. URL https://arxiv. org/abs/2402.01306 , volume=

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[18] [19]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

[20] [21]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992

[21] [22]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

[22] [23]

Massively Parallel Methods for Deep Reinforcement Learning

Massively parallel methods for deep reinforcement learning , author=. arXiv preprint arXiv:1507.04296 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [24]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016

[24] [25]

International Conference on Learning Representations , year=

Distributed Prioritized Experience Replay , author=. International Conference on Learning Representations , year=

[25] [26]

International conference on machine learning , pages=

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[26] [27]

International conference on learning representations , year=

Recurrent experience replay in distributed reinforcement learning , author=. International conference on learning representations , year=

[27] [28]

Advances in neural information processing systems , volume=

Safe and efficient off-policy reinforcement learning , author=. Advances in neural information processing systems , volume=

[28] [29]

Sample Efficient Actor-Critic with Experience Replay

Sample efficient actor-critic with experience replay , author=. arXiv preprint arXiv:1611.01224 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

Secrets of RLHF in Large Language Models Part I: PPO

Secrets of rlhf in large language models part i: Ppo , author=. arXiv preprint arXiv:2307.04964 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[31] [32]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024

[33] [34]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

[34] [35]

GPG: A simple and strong reinforcement learning baseline for model reasoning

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv

[35] [36]

American Invitational Mathematics Examination (AIME) 2024 , author=

2024

[36] [37]

American Invitational Mathematics Examination (AIME) 2025 , author=

2025

[37] [38]

ModelScope Team , year=

[38] [39]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024

[39] [40]

arXiv preprint arXiv:2410.18252 , year=

Asynchronous rlhf: Faster and more efficient off-policy rl for language models , author=. arXiv preprint arXiv:2410.18252 , year=

work page arXiv

[40] [41]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [42]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

2022

[42] [43]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=