AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Bokun Wang; Daren Zha; Jun Xiao; Miaobo Hu; Ruohan Wang; Shuhao Hu; Xiaobo Guo; Xin Wang

arxiv: 2605.20722 · v1 · pith:UORISE2Rnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Miaobo Hu , Shuhao Hu , Bokun Wang , Ruohan Wang , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao This is my paper

Pith reviewed 2026-05-21 07:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Adaptive Group Policy OptimizationGRPOLLM reasoningadaptive clippingtemperature samplingreinforcement learninggroup statisticsmath benchmarks

0 comments

The pith

AGPO adapts clipping and temperature from group statistics to improve LLM reasoning over fixed PPO and GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning sharpens large language models on reasoning tasks, yet PPO and GRPO often rely on fixed clipping ranges and sampling temperatures that make training unstable and require heavy manual tuning. AGPO replaces those fixed choices with two controllers that read group-level statistics such as reward dispersion, skewness, probe vote entropy, policy entropy, and step-wise KL drift. One controller sets the trust-region size for policy updates; the other raises or lowers decoding temperature around a baseline according to how uncertain the current responses are. If the statistics supply clean signals, training stays stable while exploration adjusts automatically, and the same token budget yields higher accuracy on math and STEM problems. The paper reports gains on nine English and Chinese benchmarks, with the method transferring across model families and both controllers proving necessary in ablations.

Core claim

AGPO is a critic-free refinement of GRPO that maintains a shared probe-derived statistical state to drive adaptive clipping, which determines update magnitude from reward dispersion, skewness, entropies, and KL drift, together with bidirectional adaptive temperature sampling that heats or cools responses around a base temperature according to centered uncertainty relative to a running baseline. When applied to Qwen2.5-14B under a fixed generated-token budget, this produces 67.3 percent on GSM8K and 40.5 percent on MATH, exceeding PPO and GRPO results, with similar lifts observed when the same procedure is applied to Llama-3-8B and Gemma-2-9B.

What carries the argument

Dual statistical feedback controllers that translate group-level reward and entropy measures into dynamic clipping thresholds and bidirectional temperature adjustments.

If this is right

Training requires less manual tuning of clipping and temperature for stable RL on reasoning tasks.
The same generated-token budget produces higher accuracy on English and Chinese math and STEM benchmarks.
Gains transfer to other base models including Llama-3-8B and Gemma-2-9B.
Ablation experiments show the clipping and temperature controllers are complementary.
An open-source implementation makes the method available for direct replication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Group statistics may serve as a general online signal for adjusting other hyperparameters such as learning rate in policy optimization.
The approach could lower the cost of hyperparameter search when applying RL to new reasoning domains or languages.
Similar intra-group variance monitoring might stabilize training in non-math tasks where reward signals are noisier.
One could test whether the same statistical state remains useful when the underlying reward model changes.

Load-bearing premise

Group-level statistics on reward spread, skewness, and various entropies supply reliable low-noise signals that can be mapped directly to clipping and temperature values without hidden instabilities or extra tuning.

What would settle it

Training Qwen2.5-14B with AGPO on GSM8K and MATH under the same token budget but obtaining scores at or below the GRPO baseline would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2605.20722 by Bokun Wang, Daren Zha, Jun Xiao, Miaobo Hu, Ruohan Wang, Shuhao Hu, Xiaobo Guo, Xin Wang.

**Figure 1.** Figure 1: Two-phase AGPO with ATS. A probe at τbase estimates group statistics (including reward dispersion σˆ); a train phase then uses τt and the adaptive clip εadaptive to update the policy via Eq. (7). This coupling yields exploration when uncertain and stable updates otherwise. AGPO Objective Substituting εadaptive into the GRPO objective yields the AGPO loss: LAGPO(θ) = −Ex, {ei}G i=1 " 1 G X G i=1 min ρi(… view at source ↗

**Figure 2.** Figure 2: Maj@k exact-match accuracy on GSM8K dev for k ∈ {1, 4, 16, 64}. AGPO shows larger gains at intermediate sampling budgets, indicating a better diversity– consistency trade-off. ATS behavior [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Mean KL divergence and adaptive clip radius εt over 300 training steps (3 seeds). AGPO keeps KL bounded while shrinking εt as training stabilizes. Metrics. Besides accuracy, we report training stability: 1. Mean KL to reference DKL(π∥πref). 2. % of steps hitting clip bounds (clip saturation rate). 3. Gradient-norm variance Var(∥∇L∥2) (per 1k steps). Leave-One-Out Ablations [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 4.** Figure 4: Batch accuracy versus adaptive temperature τt during training. ATS heats early uncertain batches and cools later, more confident batches [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGPO adds dual adaptive controllers for clipping and temperature driven by group statistics to GRPO, with reported benchmark gains but thin validation on controller stability.

read the letter

The main takeaway is that AGPO adapts both the clipping threshold and the decoding temperature in a group policy optimization setup using statistics like reward dispersion and entropy measures. The authors claim this leads to stronger results on math reasoning benchmarks compared to fixed versions of PPO and GRPO. What the paper does is combine those two adaptations through a shared statistical probe. This is a step beyond standard GRPO. They back it with results on Qwen2.5-14B reaching 67.3 percent on GSM8K and 40.5 percent on MATH, plus transfers to other base models and ablations that support the complementarity of the two controllers. Making the code public is helpful. The experiments cover English and Chinese benchmarks, which adds some breadth. If the implementation details check out, this could be a useful practical adjustment for reducing training sensitivity in LLM reasoning. The concern is that the controllers rely directly on the group statistics without shown analysis of how sensitive the performance is to noise in those inputs or to the exact form of the mapping. The stress-test note highlights the lack of sensitivity analysis or stability bounds, and nothing in the description suggests those were added. This leaves open whether the gains hold up under different conditions or if they depend on careful choice of the adaptation rules. People working on reinforcement learning for large language models, especially those focused on reasoning tasks, would be the natural audience. It offers a concrete method with reported improvements and open code, so a reader trying to improve their own RL setups might pick up ideas here. The work has enough substance to go to peer review. I recommend sending it for review, with the expectation that referees will ask for more validation on the adaptive rules.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO for RL-based LLM reasoning. It uses a shared probe-derived statistical state (reward dispersion, skewness, probe vote entropy, policy entropy, step-wise KL drift) to drive two controllers: adaptive clipping that sets trust-region size from these statistics, and bidirectional adaptive temperature sampling that heats or cools decoding around a base temperature based on centered uncertainty relative to a running baseline. The paper reports that Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under fixed token budget on nine English and Chinese math/STEM benchmarks (e.g., 67.3% on GSM8K, 40.5% on MATH), with gains transferring to Llama-3-8B and Gemma-2-9B, complementary ablations, and public code release.

Significance. If the dual controllers deliver stable, low-noise improvements without hidden instabilities or post-hoc tuning, AGPO could meaningfully reduce the hyperparameter sensitivity and brittleness of standard PPO/GRPO in LLM post-training. The public implementation and transfer results across model families are positive for reproducibility and broader applicability.

major comments (3)

[§3 (Controllers)] §3 (Controllers): The adaptive clipping and temperature mappings are defined directly from the same group statistics they regulate. No sensitivity analysis, perturbation tests, or regime-specific bounds are provided (e.g., under high skewness, low group size, or drifting baselines), which is load-bearing for the central claim that these statistics supply reliable signals without introducing new instabilities or reducing to post-hoc adjustment.
[§5 (Experiments)] §5 (Experiments): Benchmark results (67.3% GSM8K, 40.5% MATH) are presented without statistical significance tests, run-to-run variance, exact baseline implementation details, data split information, or confirmation that adaptive rules were not tuned on test sets. This undermines support for the outperformance claim under the same generated-token budget.
[Ablation studies] Ablation studies: Complementarity of the two modules is shown, but the studies do not isolate whether gains persist when statistical inputs are noisy or when the running temperature baseline drifts, leaving the weakest assumption untested.

minor comments (2)

[§3] Clarify the precise formulas for the probe-derived state, centered uncertainty computation, and how the running baseline is maintained (including any hyperparameters).
[Related Work] Add discussion of related adaptive RL or temperature-scheduling methods in the LLM literature for better context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§3 (Controllers)] §3 (Controllers): The adaptive clipping and temperature mappings are defined directly from the same group statistics they regulate. No sensitivity analysis, perturbation tests, or regime-specific bounds are provided (e.g., under high skewness, low group size, or drifting baselines), which is load-bearing for the central claim that these statistics supply reliable signals without introducing new instabilities or reducing to post-hoc adjustment.

Authors: We thank the referee for highlighting this important aspect. The design of the controllers is intended to use the statistics as direct signals for adaptation, with built-in normalization and bounds to prevent instability. However, we agree that explicit sensitivity analysis would strengthen the paper. In the revised manuscript, we have added a new subsection in §3 with perturbation tests under various regimes, including high skewness and low group sizes, demonstrating that the controllers remain stable and do not introduce additional instabilities. revision: yes
Referee: [§5 (Experiments)] §5 (Experiments): Benchmark results (67.3% GSM8K, 40.5% MATH) are presented without statistical significance tests, run-to-run variance, exact baseline implementation details, data split information, or confirmation that adaptive rules were not tuned on test sets. This undermines support for the outperformance claim under the same generated-token budget.

Authors: This is a valid concern. We have updated §5 to include results from multiple random seeds with mean and standard deviation, along with statistical significance tests (e.g., paired t-tests). We have also provided exact details on baseline implementations, data splits, and confirmed that the adaptive rules are derived solely from training-time statistics without access to test sets. The public code release includes the exact configurations used. revision: yes
Referee: [Ablation studies] Ablation studies: Complementarity of the two modules is shown, but the studies do not isolate whether gains persist when statistical inputs are noisy or when the running temperature baseline drifts, leaving the weakest assumption untested.

Authors: We appreciate this observation. To address it, we have extended the ablation studies in the revised version to include experiments with noisy statistical inputs (e.g., by adding Gaussian noise to the probe-derived features) and scenarios with drifting baselines. These additional results show that the performance gains persist, supporting the robustness of the approach. revision: yes

Circularity Check

0 steps flagged

No circularity: AGPO is a direct algorithmic definition with empirical validation

full rationale

The paper proposes AGPO as a critic-free refinement of GRPO that explicitly defines adaptive clipping and bidirectional temperature controllers from group-level statistics (reward dispersion, skewness, probe vote entropy, policy entropy, KL drift, and centered uncertainty relative to a running baseline). This constitutes the core of the method itself rather than any derivation in which a claimed result or prediction reduces to its inputs by construction. Performance gains on GSM8K, MATH, and other benchmarks are presented as empirical outcomes under fixed token budgets, supported by ablations confirming complementarity of the two modules. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the derivation chain. The algorithm is self-contained and externally testable on the stated benchmarks; absence of sensitivity analysis or bounds is a robustness concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that the chosen group statistics are sufficient and stable signals for adaptation; no free parameters are explicitly named in the abstract, but the controllers themselves introduce mapping rules whose calibration is not detailed.

axioms (1)

domain assumption Group-level statistics reliably indicate the appropriate trust-region size and exploration level for the current training step
The adaptive clipping and temperature controllers are driven directly by these statistics without additional theoretical justification or validation that they avoid over- or under-correction.

pith-pipeline@v0.9.0 · 5748 in / 1387 out tokens · 59354 ms · 2026-05-21T07:07:27.295867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping... from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; (ii) bidirectional adaptive temperature sampling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 17 internal anchors

[1]

Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M.D., McAleer, S., Jiang, A.Q., Deng, J., Biderman, S., Welleck, S.: Llemma: An Open Language Model For Mathematics (Mar 2024). https://doi.org/10.48550/arXiv.2310.10631, http://arxiv.org/abs/2310.10631, arXiv:2310.10631 [cs] TLDR: Llemma is a large language model for mathematics that outperforms all know...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.10631 2024
[2]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training Verifiers to Solve Math Word Problems (Nov 2021). https://doi.org/10.48550/arXiv.2110. 14168, http://arxiv.org/abs/2110.14168, arXiv:2110.14168 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110 2021
[3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., et al.: The Llama 3 Herd of Models (Nov 2024). ht tps://doi.org/10.48550/arXiv.2407.21783, http://arxiv.org/abs/2407.21783, arXiv:2407.21783 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[4]

Measuring Massive Multitask Language Understanding

Hendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Song,D.,Steinhardt,J.: Measuring Massive Multitask Language Understanding (Jan 2021). https://doi.or g/10.48550/arXiv.2009.03300, http://arxiv.org/abs/2009.03300, arXiv:2009.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2021
[5]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring Mathematical Problem Solving With the MATH Dataset (Nov 2021). https://doi.org/10.48550/arXiv.2103.03874, http://arxiv.org/abs/21 03.03874, arXiv:2103.03874 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021
[6]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The Curious Case of Neural Text Degeneration (Feb 2020). https://doi.org/10.48550/arXiv.1904.09751, http://arxiv.org/abs/1904.09751, arXiv:1904.09751 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09751 2020
[7]

Solving Quantitative Reasoning Problems with Language Models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., Misra, V.: Solving Quantitative Reasoning Problems with Lan- guage Models (Jul 2022). https://doi.org/10.48550/arXiv.2206.14858, http: //arxiv.org/abs/2206.14858, arXiv:2206.14858 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.14858 2022
[8]

Competition-Level Code Generation with AlphaCode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A.D., Hubert, T., Choy, P., d’Autume, C.d.M., Babuschkin, I., Chen, X., Huang, P.S., Welbl, J., Gowal, S., Cherepanov, A., Mol- loy, J., Mankowitz, D.J., Robson, E.S., Kohli, P., Freitas, N.d., Kavukcuoglu, K., Vinyals, O.: Competition-Leve...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1126/science.abq1158 2022
[9]

Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization (Jan 2019), http://arxiv.org/abs/1711.05101, arXiv:1711.05101 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (Mar 2022). http s://doi.org/10.48550/arXiv.2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[11]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., et al.: Qwen2.5 Technical Report (Jan 2025). http s://doi.org/10.48550/arXiv.2412.15115, http://arxiv.org/abs/2412.15115, arXiv:2412.15115 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[12]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Jul 2024). https://doi.org/10.48550/arXiv.2305.18290, http://arxiv.org/abs/2305.182 90, arXiv:2305.18290 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2024
[13]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (May 2020). https://doi.org/10.48550 /arXiv.1910.02054, http://arxiv.org/abs/1910.02054, arXiv:1910.02054 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms (Aug 2017). https://doi.org/10.48550/arXiv.1707.06347, http://arxiv.org/abs/1707.06347, arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Apr 2024). https://doi.org/10.48550/arXiv .2402.03300, http://arxiv.org/abs/2402.03300, arXiv:2402.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[16]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., et al.: Gemma 2: Improving Open Language Models at a Practical Size (Oct 2024). https://doi.org/10.48550/arXiv.2408.00118, http: //arxiv.org/abs/2408.00118, arXiv:2408.00118 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
[17]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (Mar 2023). https://doi.org/10.48550/arXiv.2203.11171, http://arxiv.or g/abs/2203.11171, arXiv:2203.11171 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.11171 2023
[18]

https://doi.org/ 10.48550/arXiv.2306.16636, http://arxiv.org/abs/2306.16636, arXiv:2306.16636 [cs]

Wei, T., Luan, J., Liu, W., Dong, S., Wang, B.: CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? (Jun 2023). https://doi.org/ 10.48550/arXiv.2306.16636, http://arxiv.org/abs/2306.16636, arXiv:2306.16636 [cs]

work page doi:10.48550/arxiv.2306.16636 2023
[19]

In: Duh, K., Gomez, H., Bethard, S

Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., Duan,N.:AGIEval:AHuman-CentricBenchmarkforEvaluatingFoundationMod- els. In: Duh, K., Gomez, H., Bethard, S. (eds.) Findings of the Association for Computational Linguistics: NAACL 2024. pp. 2299–2314. Association for Compu- tational Linguistics, Mexico City, Mexico (Jun 2024). ...

work page doi:10.18653/v 2024

[1] [1]

Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M.D., McAleer, S., Jiang, A.Q., Deng, J., Biderman, S., Welleck, S.: Llemma: An Open Language Model For Mathematics (Mar 2024). https://doi.org/10.48550/arXiv.2310.10631, http://arxiv.org/abs/2310.10631, arXiv:2310.10631 [cs] TLDR: Llemma is a large language model for mathematics that outperforms all know...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.10631 2024

[2] [2]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training Verifiers to Solve Math Word Problems (Nov 2021). https://doi.org/10.48550/arXiv.2110. 14168, http://arxiv.org/abs/2110.14168, arXiv:2110.14168 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110 2021

[3] [3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., et al.: The Llama 3 Herd of Models (Nov 2024). ht tps://doi.org/10.48550/arXiv.2407.21783, http://arxiv.org/abs/2407.21783, arXiv:2407.21783 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[4] [4]

Measuring Massive Multitask Language Understanding

Hendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Song,D.,Steinhardt,J.: Measuring Massive Multitask Language Understanding (Jan 2021). https://doi.or g/10.48550/arXiv.2009.03300, http://arxiv.org/abs/2009.03300, arXiv:2009.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2021

[5] [5]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring Mathematical Problem Solving With the MATH Dataset (Nov 2021). https://doi.org/10.48550/arXiv.2103.03874, http://arxiv.org/abs/21 03.03874, arXiv:2103.03874 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021

[6] [6]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The Curious Case of Neural Text Degeneration (Feb 2020). https://doi.org/10.48550/arXiv.1904.09751, http://arxiv.org/abs/1904.09751, arXiv:1904.09751 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09751 2020

[7] [7]

Solving Quantitative Reasoning Problems with Language Models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., Misra, V.: Solving Quantitative Reasoning Problems with Lan- guage Models (Jul 2022). https://doi.org/10.48550/arXiv.2206.14858, http: //arxiv.org/abs/2206.14858, arXiv:2206.14858 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.14858 2022

[8] [8]

Competition-Level Code Generation with AlphaCode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A.D., Hubert, T., Choy, P., d’Autume, C.d.M., Babuschkin, I., Chen, X., Huang, P.S., Welbl, J., Gowal, S., Cherepanov, A., Mol- loy, J., Mankowitz, D.J., Robson, E.S., Kohli, P., Freitas, N.d., Kavukcuoglu, K., Vinyals, O.: Competition-Leve...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1126/science.abq1158 2022

[9] [9]

Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization (Jan 2019), http://arxiv.org/abs/1711.05101, arXiv:1711.05101 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (Mar 2022). http s://doi.org/10.48550/arXiv.2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[11] [11]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., et al.: Qwen2.5 Technical Report (Jan 2025). http s://doi.org/10.48550/arXiv.2412.15115, http://arxiv.org/abs/2412.15115, arXiv:2412.15115 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[12] [12]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Jul 2024). https://doi.org/10.48550/arXiv.2305.18290, http://arxiv.org/abs/2305.182 90, arXiv:2305.18290 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2024

[13] [13]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (May 2020). https://doi.org/10.48550 /arXiv.1910.02054, http://arxiv.org/abs/1910.02054, arXiv:1910.02054 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms (Aug 2017). https://doi.org/10.48550/arXiv.1707.06347, http://arxiv.org/abs/1707.06347, arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017

[15] [15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Apr 2024). https://doi.org/10.48550/arXiv .2402.03300, http://arxiv.org/abs/2402.03300, arXiv:2402.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[16] [16]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., et al.: Gemma 2: Improving Open Language Models at a Practical Size (Oct 2024). https://doi.org/10.48550/arXiv.2408.00118, http: //arxiv.org/abs/2408.00118, arXiv:2408.00118 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024

[17] [17]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Language Models (Mar 2023). https://doi.org/10.48550/arXiv.2203.11171, http://arxiv.or g/abs/2203.11171, arXiv:2203.11171 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.11171 2023

[18] [18]

https://doi.org/ 10.48550/arXiv.2306.16636, http://arxiv.org/abs/2306.16636, arXiv:2306.16636 [cs]

Wei, T., Luan, J., Liu, W., Dong, S., Wang, B.: CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? (Jun 2023). https://doi.org/ 10.48550/arXiv.2306.16636, http://arxiv.org/abs/2306.16636, arXiv:2306.16636 [cs]

work page doi:10.48550/arxiv.2306.16636 2023

[19] [19]

In: Duh, K., Gomez, H., Bethard, S

Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., Duan,N.:AGIEval:AHuman-CentricBenchmarkforEvaluatingFoundationMod- els. In: Duh, K., Gomez, H., Bethard, S. (eds.) Findings of the Association for Computational Linguistics: NAACL 2024. pp. 2299–2314. Association for Compu- tational Linguistics, Mexico City, Mexico (Jun 2024). ...

work page doi:10.18653/v 2024