Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

Boyao Yang; Jun Zhu; Peng Cui

arxiv: 2605.17003 · v2 · pith:DURSVGYQnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

Peng Cui , Boyao Yang , Jun Zhu This is my paper

Pith reviewed 2026-05-20 15:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data selectionreinforcement learningLLM post-trainingmathematical reasoningonline selectionpolicy gradientcompute efficiency

0 comments

The pith

Learning-Zone Energy scores prompts to keep only 40 percent of data in RL post-training while matching or exceeding full-data results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Learning-Zone Energy as an online method to select which math-reasoning prompts deserve rollout and gradient updates during reinforcement learning post-training of large language models. It claims that current uniform approaches waste effort on prompts the model has already mastered or cannot yet solve. The method instead computes a single score from difficulty, uncertainty, and success-rate change, then uses that score to prune data while adding a replay buffer to catch forgetting. Experiments on Qwen models show the reduced set still reaches or beats baseline performance on GSM8K, MATH, and DAPO-MATH, with larger gains on harder out-of-distribution tests and lower overall compute.

Core claim

A closed-form Learning-Zone Energy Score fuses an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates; a forward pruner with replay then skips rollout generation for persistently solved prompts while periodically rechecking them.

What carries the argument

Learning-Zone Energy Score: a closed-form scalar combining difficulty anchor, outcome uncertainty, and pass-rate momentum that aligns with policy-gradient update size to guide data pruning.

If this is right

Retains only 40 percent of the training data per step
Matches or surpasses full-data baselines across GSM8K, MATH, and DAPO-MATH
Delivers larger gains on out-of-distribution sets such as AIME25 and AMC23
Cuts estimated training FLOPs by 36 percent

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scoring logic could be tested on non-math RL tasks such as code generation or instruction following to check whether compute savings generalize.
If the alignment between score and gradient magnitude holds, the method might reduce the need for large replay buffers in other online RL settings.
Periodic forgetting checks could be combined with curriculum scheduling to further stabilize training when data volume is reduced.

Load-bearing premise

The fused score is provably aligned with the expected magnitude of group-relative policy gradient updates.

What would settle it

Apply the same selection rule to a new model family or task suite and measure whether performance falls below the full-data baseline while data usage stays at 40 percent.

Figures

Figures reproduced from arXiv: 2605.17003 by Boyao Yang, Jun Zhu, Peng Cui.

**Figure 1.** Figure 1: Overview of the proposed method. The backward scorer computes the Learning-Zone Energy Score for each prompt group at every training step, guiding Top-K selection for policy updates. The forward pruner tracks group pass rates at the epoch level and skips rollout generation for groups that have been stably solved, with replay providing a safety mechanism to detect forgetting. Group Relative Policy Optimizat… view at source ↗

**Figure 2.** Figure 2: Pass@1 vs. wall-clock time for Qwen2.5-Math models. The × annotations denote speedup multipliers relative to the full-data baseline. Efficiency and training progress. The proposed data selection method reduces theoretical training FLOPs by approximately 36% by restricting backpropagation to the top-40% fraction of prompt groups, while maintaining full rollout coverage to recompute Energy Scores at each s… view at source ↗

**Figure 3.** Figure 3: Ablation studies on Energy Score components, rollout count [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Conceptual case study of the learning zone. Trivial prompts that are already solved (p ≈ 1) and overwhelmingly hard prompts that consistently fail (p ≈ 0) carry negligible LearningZone Energy and are therefore deprioritized. In contrast, frontier prompts with mixed rollout outcomes (p ≈ 0.5) receive the highest uncertainty weight and remain actively selected for GRPO updates. B Details of Models and Datas… view at source ↗

**Figure 5.** Figure 5: Rollout-N sensitivity under an unconstrained budget. Each N runs for the same number of gradient steps; compute cost scales with N. Even under unconstrained conditions, LZE selection consistently outperforms the baseline at every N, and N=8 again achieves the best absolute performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity to the momentum coefficient α. Both 1.5B and 7B models show consistent peaks at α = 0.3 (our default). At α = 0 the momentum term is absent; at α = 0.45 the score becomes overly reactive, destabilising selection. purely uncertainty-based criterion (no momentum). We sweep α ∈ {0, 0.15, 0.3, 0.45} on MATH with Qwen2.5-Math-1.5B and Qwen2.5-Math-7B, reporting Avg Pass@1 = (AMC23 + AIME25) / 2 [PI… view at source ↗

read the original abstract

Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LZE introduces a closed-form online selector for RL post-training data that reports 40% retention with matching or better math performance and notable OOD gains, but the provable gradient alignment rests on an unshown derivation.

read the letter

The paper's main offering is Learning-Zone Energy, a score that picks prompts for RL post-training by blending an initial difficulty anchor, normalized outcome uncertainty, and pass-rate momentum. The goal is to focus compute on the active learning frontier instead of spreading it evenly across all samples as GRPO and DAPO do. They add a forward pruner with replay to skip persistently solved prompts and check for forgetting later. This setup is fully online and comes with public code, which is useful for replication attempts. Experiments on Qwen models from 1.5B to 8B across GSM8K, MATH, and DAPO-MATH show they keep only 40% of the data per step yet match or exceed full-data baselines, with clear lifts on AIME25 and AMC23 plus an estimated 36% FLOPs drop. Those efficiency numbers address a real pain point in scaling reasoning models. The fusion into a single scalar claimed to track expected group-relative policy gradient magnitude is presented as the novel piece. The practical pruning and replay mechanics also look like straightforward engineering that could save wall-clock time. The soft spot sits in the theoretical claim. The abstract asserts the score is provably aligned with gradient update sizes, yet no derivation steps appear that equate the three signals to E[|advantage|] or the GRPO estimator. Normalization choices for uncertainty and the momentum term could embed data-dependent effects that weaken independence from the training distribution. Without error bars, exact baseline details, or checks for post-hoc selection, the reported gains need closer inspection to confirm they hold without hidden biases. This is aimed at groups running RL post-training for math and science capabilities who want to lower the compute cost per iteration. Readers focused on efficient scaling of reasoning LLMs will find the data-retention and OOD results worth examining. The concrete efficiency claims and the clear problem framing make it worth a serious referee's time even if the alignment math needs tightening. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Learning-Zone Energy (LZE), an online data selection method for efficient RL post-training of LLMs on math reasoning. It introduces a closed-form LZE score fusing an initial-difficulty anchor, normalized outcome-uncertainty term, and pass-rate momentum, asserted to be provably aligned with the expected magnitude of group-relative policy gradient (GRPO) updates. A forward pruner with replay skips rollouts for solved prompts. On Qwen models (1.5B–8B) across GSM8K, MATH, and DAPO-MATH, the method retains 40% of data per step while matching or exceeding full-data baselines, with OOD gains on AIME25 (+45.9%) and AMC23 (+18.2%) and ~36% FLOPs reduction. Code is released.

Significance. If the claimed provable alignment holds and the efficiency gains prove robust without hidden selection bias, the work could meaningfully advance compute-efficient RL post-training by concentrating effort on the learning frontier. The OOD improvements and code release are positive indicators for practical impact in scaling mathematical reasoning.

major comments (2)

[§3.2] §3.2 (LZE Score Derivation): The central claim that the closed-form LZE score is 'provably aligned' with the expected magnitude of GRPO updates lacks explicit derivation steps equating the fused signals (initial-difficulty anchor + normalized uncertainty + pass-rate momentum) to E[|advantage|] or the GRPO estimator. This alignment is load-bearing for the justification of bias-free data selection.
[§3.1] §3.1 (Normalization): The normalization constants for the outcome-uncertainty term are not specified as fixed or batch-dependent; if data-dependent, this could undermine the claimed independence from the training distribution and the provable alignment.

minor comments (2)

[Experiments] Experiments section: Reported gains on AIME25 and AMC23 lack error bars, number of runs, or variance details, which would strengthen assessment of robustness.
[Introduction] The abstract and introduction could more explicitly contrast LZE against prior online data selection or curriculum methods in RL for LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additions are warranted, we have revised the manuscript to strengthen the presentation of the LZE derivation and normalization details.

read point-by-point responses

Referee: [§3.2] §3.2 (LZE Score Derivation): The central claim that the closed-form LZE score is 'provably aligned' with the expected magnitude of GRPO updates lacks explicit derivation steps equating the fused signals (initial-difficulty anchor + normalized uncertainty + pass-rate momentum) to E[|advantage|] or the GRPO estimator. This alignment is load-bearing for the justification of bias-free data selection.

Authors: We agree that the original presentation would benefit from more explicit steps. In the revised manuscript we have expanded §3.2 with a full derivation (now also summarized in a new appendix) that starts from the GRPO advantage estimator, takes the expectation of its absolute value, and shows term-by-term that the initial-difficulty anchor supplies the baseline scale, the normalized uncertainty term bounds the outcome variance contribution, and the pass-rate momentum corrects for temporal drift, yielding an expression proportional to E[|advantage|] under standard assumptions on the group-relative baseline. This establishes the claimed alignment without introducing selection bias. revision: yes
Referee: [§3.1] §3.1 (Normalization): The normalization constants for the outcome-uncertainty term are not specified as fixed or batch-dependent; if data-dependent, this could undermine the claimed independence from the training distribution and the provable alignment.

Authors: The normalization constants are fixed hyperparameters chosen once from a small calibration set of prompts evaluated before the main training run; they are never recomputed from the current training batch or distribution. We have added an explicit statement to this effect in the revised §3.1, together with the precise numerical values used, thereby confirming that the independence property and the subsequent alignment proof remain intact. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed derivation.

full rationale

The paper defines the LZE score as a closed-form fusion of an initial-difficulty anchor, normalized outcome-uncertainty term, and pass-rate momentum, then asserts that this scalar is provably aligned with the expected magnitude of group-relative policy gradient updates. This alignment is presented as a theoretical property of the construction rather than a fitted parameter or self-citation. No equations in the provided abstract reduce the score to its inputs by construction, nor does the text invoke self-citations for uniqueness or load-bearing premises. The central claim retains independent content as a proposed online selection heuristic, with empirical results evaluated on external benchmarks (GSM8K, MATH, AIME25) separate from the score definition itself. This is the most common honest finding for a paper whose derivation is self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the new score being both theoretically aligned and practically effective; this rests on the assumption that the three chosen signals capture learning potential without additional fitted components or domain-specific tuning beyond what is described.

free parameters (1)

normalization constants or fusion coefficients for the three signals
The abstract describes a fused scalar but does not specify whether fixed or data-tuned coefficients are used to combine difficulty, uncertainty, and momentum.

axioms (1)

domain assumption The three signals (initial-difficulty anchor, normalized outcome-uncertainty term, pass-rate momentum) are complementary and together identify the active learning frontier.
Invoked when defining the closed-form score that is claimed to be provably aligned with policy gradient magnitude.

invented entities (1)

Learning-Zone Energy Score no independent evidence
purpose: Single scalar that ranks prompts for inclusion in RL rollouts and gradient updates
New quantity defined in the paper to guide online data selection; no independent falsifiable prediction outside the training loop is stated.

pith-pipeline@v0.9.0 · 5789 in / 1595 out tokens · 59918 ms · 2026-05-20T15:45:51.039980+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the uncertainty term is aligned with the expected GRPO gradient variance under a standard fixed-baseline approximation (Theorem 1)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

4p(1−p) ... peaks at p=0.5 and vanishes at both extremes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Amro Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semd- edup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Deepseek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-094 22-z

work page doi:10.1038/s41586-025-094 2025
[3]

Knowledge-Centric Hallucination Detection

Ahmadian Arash, Cremer Chris, Gallé Matthias, Fadaee Marzieh, Kreutzer Julia, Pietquin Olivier, Üstün Ahmet, and Hooker Sara. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267....

work page doi:10.18653/v1/2024 2024
[4]

Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, 2009. ISBN 9781605585161. doi: 10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901
[6]

Alpagasus: Training a better alpaca with fewer data

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InInternational Conference on Learning Representations (ICLR), pages 34767–34797, 2024

work page 2024
[7]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[8]

RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. ISSN 2835-8856

work page 2023
[9]

Reinforced self-training (rest) for language modeling, 2023

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

work page 2023
[10]

JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, and Zhiyuan Liu. JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

work page 2025
[11]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks, volume 1, 2021

work page 2021
[12]

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019. 10

work page 2019
[13]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[14]

Kumar, Benjamin Packer, and Daphne Koller

M. Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. InAdvances in Neural Information Processing Systems, volume 23, pages 1189–1197, 2010

work page 2010
[15]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations (ICLR), pages 39578–39601, 2024

work page 2024
[16]

(2017) Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. doi: 10.1109/ICCV.2017.324

work page doi:10.1109/iccv.2017.324 2017
[17]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, volume 37, pages 29029–29063, 2024. doi: 10.52202/079017-0914

work page doi:10.52202/079017-0914 2024
[18]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In International Conference on Learning Representations (ICLR), pages 22353–22373, 2024

work page 2024
[19]

Understanding r1-zero-like training: A critical perspective, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025

work page 2025
[20]

#instag: Instruction tagging for analyzing supervised fine-tuning of large language models

Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. InInternational Conference on Learning Representations (ICLR), pages 36456–36474, 2024

work page 2024
[21]

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Yixiu Mao, Yun Qu, Cheems Wang, Heming Zou, and Xiangyang Ji. Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[22]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[24]

Iterative Reasoning Preference Optimization

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative Reasoning Preference Optimization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 116617–116637. Curran Associates, Inc.,

work page
[25]

doi: 10.52202/079017-3702

work page doi:10.52202/079017-3702
[26]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo- ration by self-supervised prediction. In2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 488–489, 2017. doi: 10.1109/CVPRW.2017.70

work page doi:10.1109/cvprw.2017.70 2017
[27]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 11

work page 2024
[29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/10.1145/3 689031.3696075

work page doi:10.1145/3689031.3696075 2025
[30]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, June 2016

work page 2016
[31]

Be- yond human data: Scaling self-training for problem-solving with language models.Transactions on Machine Learning Research, 2024

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexan- der A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...

work page 2024
[32]

Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 19523–19536, 2022

work page 2022
[33]

Solving math word problems with process- and outcome-based feedback, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

work page 2022
[34]

Learning from mistakes via cooperative study assistant for large language models

Danqing Wang and Lei Li. Learning from mistakes via cooperative study assistant for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10667–10685, December 2023. doi: 10.18653/v1/2023.emnlp-m ain.659

work page doi:10.18653/v1/2023.emnlp-m 2023
[35]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022
[37]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3-4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[38]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InForty-first International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 54104–54132, 2024

work page 2024
[39]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems, volume 36, pages 34201–34227, 2023

work page 2023
[40]

A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

work page 2025
[41]

Qwen2.5 technical report, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page 2024
[42]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

work page 2024
[43]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[44]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023

work page 2023
[45]

Dapo: An open-source llm reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, juncai liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

work page 2025
[46]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

work page 2024
[47]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023
[48]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022

work page 2022
[49]

Generative verifiers: Reward modeling as next-token prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Seyed Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InInterna- tional Conference on Learning Representations (ICLR), pages 12476–12505, 2025

work page 2025
[50]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. InAdvances in Neural Information Processing Systems, volume 36, pages 55006–55021, 2023. 13 Appendices A Case Study: Data Selection Ra...

work page 2023

[1] [1]

Amro Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semd- edup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Deepseek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-094 22-z

work page doi:10.1038/s41586-025-094 2025

[3] [3]

Knowledge-Centric Hallucination Detection

Ahmadian Arash, Cremer Chris, Gallé Matthias, Fadaee Marzieh, Kreutzer Julia, Pietquin Olivier, Üstün Ahmet, and Hooker Sara. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267....

work page doi:10.18653/v1/2024 2024

[4] [4]

Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, 2009. ISBN 9781605585161. doi: 10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009

[5] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901

[6] [6]

Alpagasus: Training a better alpaca with fewer data

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InInternational Conference on Learning Representations (ICLR), pages 34767–34797, 2024

work page 2024

[7] [7]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021

[8] [8]

RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. ISSN 2835-8856

work page 2023

[9] [9]

Reinforced self-training (rest) for language modeling, 2023

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

work page 2023

[10] [10]

JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, and Zhiyuan Liu. JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

work page 2025

[11] [11]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks, volume 1, 2021

work page 2021

[12] [12]

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019. 10

work page 2019

[13] [13]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[14] [14]

Kumar, Benjamin Packer, and Daphne Koller

M. Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. InAdvances in Neural Information Processing Systems, volume 23, pages 1189–1197, 2010

work page 2010

[15] [15]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations (ICLR), pages 39578–39601, 2024

work page 2024

[16] [16]

(2017) Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. doi: 10.1109/ICCV.2017.324

work page doi:10.1109/iccv.2017.324 2017

[17] [17]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, volume 37, pages 29029–29063, 2024. doi: 10.52202/079017-0914

work page doi:10.52202/079017-0914 2024

[18] [18]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In International Conference on Learning Representations (ICLR), pages 22353–22373, 2024

work page 2024

[19] [19]

Understanding r1-zero-like training: A critical perspective, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025

work page 2025

[20] [20]

#instag: Instruction tagging for analyzing supervised fine-tuning of large language models

Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. InInternational Conference on Learning Representations (ICLR), pages 36456–36474, 2024

work page 2024

[21] [21]

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Yixiu Mao, Yun Qu, Cheems Wang, Heming Zou, and Xiangyang Ji. Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[22] [22]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[23] [23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[24] [24]

Iterative Reasoning Preference Optimization

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative Reasoning Preference Optimization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 116617–116637. Curran Associates, Inc.,

work page

[25] [25]

doi: 10.52202/079017-3702

work page doi:10.52202/079017-3702

[26] [26]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo- ration by self-supervised prediction. In2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 488–489, 2017. doi: 10.1109/CVPRW.2017.70

work page doi:10.1109/cvprw.2017.70 2017

[27] [27]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[28] [28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 11

work page 2024

[29] [29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/10.1145/3 689031.3696075

work page doi:10.1145/3689031.3696075 2025

[30] [30]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, June 2016

work page 2016

[31] [31]

Be- yond human data: Scaling self-training for problem-solving with language models.Transactions on Machine Learning Research, 2024

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexan- der A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...

work page 2024

[32] [32]

Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 19523–19536, 2022

work page 2022

[33] [33]

Solving math word problems with process- and outcome-based feedback, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

work page 2022

[34] [34]

Learning from mistakes via cooperative study assistant for large language models

Danqing Wang and Lei Li. Learning from mistakes via cooperative study assistant for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10667–10685, December 2023. doi: 10.18653/v1/2023.emnlp-m ain.659

work page doi:10.18653/v1/2023.emnlp-m 2023

[35] [35]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[36] [36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022

[37] [37]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3-4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[38] [38]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InForty-first International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 54104–54132, 2024

work page 2024

[39] [39]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems, volume 36, pages 34201–34227, 2023

work page 2023

[40] [40]

A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

work page 2025

[41] [41]

Qwen2.5 technical report, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page 2024

[42] [42]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

work page 2024

[43] [43]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[44] [44]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023

work page 2023

[45] [45]

Dapo: An open-source llm reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, juncai liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

work page 2025

[46] [46]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

work page 2024

[47] [47]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023

[48] [48]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022

work page 2022

[49] [49]

Generative verifiers: Reward modeling as next-token prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Seyed Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InInterna- tional Conference on Learning Representations (ICLR), pages 12476–12505, 2025

work page 2025

[50] [50]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. InAdvances in Neural Information Processing Systems, volume 36, pages 55006–55021, 2023. 13 Appendices A Case Study: Data Selection Ra...

work page 2023