arxiv: 2605.09923 · v2 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Mingxiong Lin , Zhangquan Gong , Maowen Tang , Qian Li , Chuangchuang Wang , Jian Ma , Sutian Huang , Kai Tang

show 1 more author

Haonan Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords policy optimizationKL regularizationcurriculum samplingLLM mathematical reasoningexplorationGRPOreinforcement learningadaptive penalty

0 comments

The pith

EXPO improves LLM mathematical reasoning by dynamically relaxing KL penalties during low performance and prioritizing moderately difficult problems for training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two inefficiencies in standard Group Relative Policy Optimization for training LLMs on math tasks: a fixed KL penalty that overly restricts exploration when the model needs to deviate from its reference, and uniform sampling that overlooks the most informative medium-difficulty questions. EXPO introduces Accuracy-Conditioned KL Scaling to adjust the penalty strength based on batch accuracy, easing restrictions when results are poor and tightening them when good. It pairs this with Gaussian Curriculum Sampling that weights questions according to a Gaussian distribution peaked at 0.5 accuracy. Experiments on 1.5B and 8B models across six benchmarks show consistent gains, with especially large improvements on pass@32 metrics, indicating expanded exploration within fixed inference budgets.

Core claim

The paper claims that Accuracy-Conditioned KL Scaling dynamically modulates KL regularization strength through a nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when results are strong, while Gaussian Curriculum Sampling assigns higher weights to questions near 0.5 accuracy, focusing optimization on the learning frontier and thereby enlarging the policy's exploration boundary compared to vanilla GRPO.

What carries the argument

Accuracy-Conditioned KL Scaling (AKL) and Gaussian Curriculum Sampling (GCS), which together adjust regularization and data selection to prioritize exploration at the model's current capability edge.

If this is right

Higher pass@32 scores on benchmarks such as AIME 2025 without increasing inference-time compute.
Larger relative gains on pass@32 than pass@1, indicating increased diversity of correct solutions.
Average pass@32 improvement of 2.66 points on the 8B model across tasks.
The two modules function as lightweight plug-ins that can be added to existing GRPO pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to other verifiable-reward reinforcement learning domains beyond mathematical reasoning.
The optimal accuracy center for the Gaussian distribution could shift with model scale or dataset characteristics.
Focusing samples on the accuracy frontier might shorten overall training time by reducing exposure to trivial or overly hard examples.

Load-bearing premise

The observed performance gains result specifically from the AKL and GCS modules rather than other uncontrolled aspects of the training setup.

What would settle it

Retraining the same models on the same data with AKL and GCS removed or with the Gaussian centered at a different accuracy value such as 0.3 and checking whether the pass@32 gains on AIME 2025 disappear.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model's learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model's exploration boundary under a fixed inference cost budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EXPO adds accuracy-based KL adjustment and Gaussian sampling to GRPO, reporting clear gains on math benchmarks, but the abstract supplies no ablations so the improvements cannot yet be pinned to those modules.

read the letter

The core contribution is two lightweight changes to GRPO. Accuracy-Conditioned KL Scaling relaxes the penalty when batch accuracy is low and tightens it when accuracy rises, using a smooth nonlinear function. Gaussian Curriculum Sampling weights training questions by a Gaussian centered at 0.5 accuracy so the model spends more time on problems at its current frontier. Both are presented as plug-ins that require little extra code. The reported numbers are concrete: a 13.34 point lift on AIME 2025 pass@32 for the 1.5B model and a 2.66 average pass@32 gain on the 8B model, with the larger effect appearing at higher pass rates. That pattern is consistent with the claim of an expanded exploration boundary under fixed inference budget. The approach is straightforward and directly targets two known practical issues in current RLVR runs. The main weakness is the absence of any ablation or controlled comparison. The abstract does not describe matched hyperparameter sweeps, seed averaging, or sensitivity checks on the 0.5 centering point, so the deltas could arise from incidental differences in data order or optimizer settings rather than the new modules. Without those checks the causal link remains unproven. The work is aimed at groups already running GRPO-style training on math and logic tasks who want simple levers to improve sample efficiency. It is worth sending to peer review because the ideas are easy to reproduce and the performance claims are specific enough to test; a referee can ask for the missing controls and decide whether the gains hold up.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Exploration-Prioritized Policy Optimization (EXPO) as an enhancement to Group Relative Policy Optimization (GRPO) for reinforcement learning with verifiable rewards in large language models focused on mathematical reasoning. It introduces Accuracy-Conditioned KL Scaling (AKL) to dynamically adjust the KL penalty coefficient based on batch average accuracy and Gaussian Curriculum Sampling (GCS) to prioritize sampling of questions with moderate difficulty (centered at 0.5 accuracy). Experiments on two models across six benchmarks report improvements over vanilla GRPO, notably a 13.34 absolute gain on AIME 2025 pass@32 (from 63.33% to 76.67%) and an average 2.66 improvement on pass@32 for the 8B model, suggesting better exploration under fixed inference budget.

Significance. If the reported gains prove robust and attributable to the proposed modules, EXPO could provide a lightweight, practical method for improving exploration in RLVR settings for LLMs. The focus on larger pass@32 versus pass@1 gains offers a potentially useful lens for evaluating exploration effectiveness under fixed inference budgets, which may influence future work on adaptive regularization and curriculum strategies in policy optimization.

major comments (2)

[Abstract] Abstract: The reported numerical improvements, such as the 13.34 gain on AIME 2025 pass@32 and 2.66 average on the 8B model, lack supporting details on experimental controls, ablation studies, hyperparameter matching, or statistical tests. This makes it impossible to confirm that the gains are attributable to the AKL and GCS modules rather than confounding factors like training schedule or data ordering.
[Abstract] Abstract: The interpretation that larger pass@32 gains compared to pass@1 demonstrate an enlarged exploration boundary relies on the assumption that the proposed modules are the cause; however, without sensitivity analysis for the Gaussian centering at accuracy 0.5 or comparisons under controlled conditions, this claim cannot be rigorously evaluated.

minor comments (1)

[Abstract] Abstract: The abstract uses 'pass@32' and 'pass@1' without defining these metrics, which could reduce clarity for readers not familiar with the evaluation protocol in LLM reasoning benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments on the abstract below, clarifying the experimental controls and interpretations while committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The reported numerical improvements, such as the 13.34 gain on AIME 2025 pass@32 and 2.66 average on the 8B model, lack supporting details on experimental controls, ablation studies, hyperparameter matching, or statistical tests. This makes it impossible to confirm that the gains are attributable to the AKL and GCS modules rather than confounding factors like training schedule or data ordering.

Authors: We agree the abstract is concise and would benefit from explicit references to controls. The full manuscript reports all comparisons under identical training schedules, data ordering, and hyperparameters, with ablations isolating AKL alone, GCS alone, and their combination (Section 4.2 and Tables 3-4). Results are averaged over multiple random seeds to assess stability. We will revise the abstract to briefly note these matched conditions and direct readers to the experimental section for full ablations and statistics. revision: yes
Referee: [Abstract] Abstract: The interpretation that larger pass@32 gains compared to pass@1 demonstrate an enlarged exploration boundary relies on the assumption that the proposed modules are the cause; however, without sensitivity analysis for the Gaussian centering at accuracy 0.5 or comparisons under controlled conditions, this claim cannot be rigorously evaluated.

Authors: The larger pass@32 gains relative to pass@1 are observed consistently across both models and all six benchmarks under fixed inference budgets, supporting the exploration interpretation. The full paper includes sensitivity analysis on the Gaussian center (Appendix B), showing robustness for values near 0.5, and controlled ablations that isolate each module while holding all other factors fixed. We will add a short clarification sentence in the abstract to emphasize the controlled experimental design. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on proposed modules and empirical results

full rationale

The abstract describes two algorithmic modules (AKL and GCS) as lightweight plug-ins to GRPO, with performance gains presented as direct experimental outcomes on six benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the text. The reported deltas (e.g., +13.34 pass@32 on AIME 2025) are framed as results of the interventions rather than reductions to inputs by construction. The derivation chain is absent; the paper is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. AKL and GCS likely involve tunable scalars (e.g., scaling function parameters, Gaussian mean/variance) but none are named or quantified.

pith-pipeline@v0.9.0 · 5595 in / 1223 out tokens · 32560 ms · 2026-05-14T22:07:29.679461+00:00 · methodology

Review history (2 revisions) →