pith. sign in

arxiv: 2605.17003 · v2 · pith:DURSVGYQnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

Pith reviewed 2026-05-20 15:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data selectionreinforcement learningLLM post-trainingmathematical reasoningonline selectionpolicy gradientcompute efficiency
0
0 comments X

The pith

Learning-Zone Energy scores prompts to keep only 40 percent of data in RL post-training while matching or exceeding full-data results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Learning-Zone Energy as an online method to select which math-reasoning prompts deserve rollout and gradient updates during reinforcement learning post-training of large language models. It claims that current uniform approaches waste effort on prompts the model has already mastered or cannot yet solve. The method instead computes a single score from difficulty, uncertainty, and success-rate change, then uses that score to prune data while adding a replay buffer to catch forgetting. Experiments on Qwen models show the reduced set still reaches or beats baseline performance on GSM8K, MATH, and DAPO-MATH, with larger gains on harder out-of-distribution tests and lower overall compute.

Core claim

A closed-form Learning-Zone Energy Score fuses an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates; a forward pruner with replay then skips rollout generation for persistently solved prompts while periodically rechecking them.

What carries the argument

Learning-Zone Energy Score: a closed-form scalar combining difficulty anchor, outcome uncertainty, and pass-rate momentum that aligns with policy-gradient update size to guide data pruning.

If this is right

  • Retains only 40 percent of the training data per step
  • Matches or surpasses full-data baselines across GSM8K, MATH, and DAPO-MATH
  • Delivers larger gains on out-of-distribution sets such as AIME25 and AMC23
  • Cuts estimated training FLOPs by 36 percent

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scoring logic could be tested on non-math RL tasks such as code generation or instruction following to check whether compute savings generalize.
  • If the alignment between score and gradient magnitude holds, the method might reduce the need for large replay buffers in other online RL settings.
  • Periodic forgetting checks could be combined with curriculum scheduling to further stabilize training when data volume is reduced.

Load-bearing premise

The fused score is provably aligned with the expected magnitude of group-relative policy gradient updates.

What would settle it

Apply the same selection rule to a new model family or task suite and measure whether performance falls below the full-data baseline while data usage stays at 40 percent.

Figures

Figures reproduced from arXiv: 2605.17003 by Boyao Yang, Jun Zhu, Peng Cui.

Figure 1
Figure 1. Figure 1: Overview of the proposed method. The backward scorer computes the Learning-Zone Energy Score for each prompt group at every training step, guiding Top-K selection for policy updates. The forward pruner tracks group pass rates at the epoch level and skips rollout generation for groups that have been stably solved, with replay providing a safety mechanism to detect forgetting. Group Relative Policy Optimizat… view at source ↗
Figure 2
Figure 2. Figure 2: Pass@1 vs. wall-clock time for Qwen2.5-Math models. The × annotations denote speedup multipliers relative to the full-data baseline. Efficiency and training progress. The proposed data selec￾tion method reduces theoretical training FLOPs by approx￾imately 36% by restricting backpropagation to the top-40% fraction of prompt groups, while maintaining full rollout coverage to recompute Energy Scores at each s… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies on Energy Score components, rollout count [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Conceptual case study of the learning zone. Trivial prompts that are already solved (p ≈ 1) and overwhelmingly hard prompts that consistently fail (p ≈ 0) carry negligible Learning￾Zone Energy and are therefore deprioritized. In contrast, frontier prompts with mixed rollout outcomes (p ≈ 0.5) receive the highest uncertainty weight and remain actively selected for GRPO updates. B Details of Models and Datas… view at source ↗
Figure 5
Figure 5. Figure 5: Rollout-N sensitivity under an unconstrained budget. Each N runs for the same number of gradient steps; compute cost scales with N. Even under unconstrained conditions, LZE selection consistently outperforms the baseline at every N, and N=8 again achieves the best absolute perfor￾mance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity to the momentum coefficient α. Both 1.5B and 7B models show consistent peaks at α = 0.3 (our default). At α = 0 the momentum term is absent; at α = 0.45 the score becomes overly reactive, destabilising selection. purely uncertainty-based criterion (no momentum). We sweep α ∈ {0, 0.15, 0.3, 0.45} on MATH with Qwen2.5-Math-1.5B and Qwen2.5-Math-7B, reporting Avg Pass@1 = (AMC23 + AIME25) / 2 [PI… view at source ↗
read the original abstract

Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Learning-Zone Energy (LZE), an online data selection method for efficient RL post-training of LLMs on math reasoning. It introduces a closed-form LZE score fusing an initial-difficulty anchor, normalized outcome-uncertainty term, and pass-rate momentum, asserted to be provably aligned with the expected magnitude of group-relative policy gradient (GRPO) updates. A forward pruner with replay skips rollouts for solved prompts. On Qwen models (1.5B–8B) across GSM8K, MATH, and DAPO-MATH, the method retains 40% of data per step while matching or exceeding full-data baselines, with OOD gains on AIME25 (+45.9%) and AMC23 (+18.2%) and ~36% FLOPs reduction. Code is released.

Significance. If the claimed provable alignment holds and the efficiency gains prove robust without hidden selection bias, the work could meaningfully advance compute-efficient RL post-training by concentrating effort on the learning frontier. The OOD improvements and code release are positive indicators for practical impact in scaling mathematical reasoning.

major comments (2)
  1. [§3.2] §3.2 (LZE Score Derivation): The central claim that the closed-form LZE score is 'provably aligned' with the expected magnitude of GRPO updates lacks explicit derivation steps equating the fused signals (initial-difficulty anchor + normalized uncertainty + pass-rate momentum) to E[|advantage|] or the GRPO estimator. This alignment is load-bearing for the justification of bias-free data selection.
  2. [§3.1] §3.1 (Normalization): The normalization constants for the outcome-uncertainty term are not specified as fixed or batch-dependent; if data-dependent, this could undermine the claimed independence from the training distribution and the provable alignment.
minor comments (2)
  1. [Experiments] Experiments section: Reported gains on AIME25 and AMC23 lack error bars, number of runs, or variance details, which would strengthen assessment of robustness.
  2. [Introduction] The abstract and introduction could more explicitly contrast LZE against prior online data selection or curriculum methods in RL for LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additions are warranted, we have revised the manuscript to strengthen the presentation of the LZE derivation and normalization details.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (LZE Score Derivation): The central claim that the closed-form LZE score is 'provably aligned' with the expected magnitude of GRPO updates lacks explicit derivation steps equating the fused signals (initial-difficulty anchor + normalized uncertainty + pass-rate momentum) to E[|advantage|] or the GRPO estimator. This alignment is load-bearing for the justification of bias-free data selection.

    Authors: We agree that the original presentation would benefit from more explicit steps. In the revised manuscript we have expanded §3.2 with a full derivation (now also summarized in a new appendix) that starts from the GRPO advantage estimator, takes the expectation of its absolute value, and shows term-by-term that the initial-difficulty anchor supplies the baseline scale, the normalized uncertainty term bounds the outcome variance contribution, and the pass-rate momentum corrects for temporal drift, yielding an expression proportional to E[|advantage|] under standard assumptions on the group-relative baseline. This establishes the claimed alignment without introducing selection bias. revision: yes

  2. Referee: [§3.1] §3.1 (Normalization): The normalization constants for the outcome-uncertainty term are not specified as fixed or batch-dependent; if data-dependent, this could undermine the claimed independence from the training distribution and the provable alignment.

    Authors: The normalization constants are fixed hyperparameters chosen once from a small calibration set of prompts evaluated before the main training run; they are never recomputed from the current training batch or distribution. We have added an explicit statement to this effect in the revised §3.1, together with the precise numerical values used, thereby confirming that the independence property and the subsequent alignment proof remain intact. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed derivation.

full rationale

The paper defines the LZE score as a closed-form fusion of an initial-difficulty anchor, normalized outcome-uncertainty term, and pass-rate momentum, then asserts that this scalar is provably aligned with the expected magnitude of group-relative policy gradient updates. This alignment is presented as a theoretical property of the construction rather than a fitted parameter or self-citation. No equations in the provided abstract reduce the score to its inputs by construction, nor does the text invoke self-citations for uniqueness or load-bearing premises. The central claim retains independent content as a proposed online selection heuristic, with empirical results evaluated on external benchmarks (GSM8K, MATH, AIME25) separate from the score definition itself. This is the most common honest finding for a paper whose derivation is self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the new score being both theoretically aligned and practically effective; this rests on the assumption that the three chosen signals capture learning potential without additional fitted components or domain-specific tuning beyond what is described.

free parameters (1)
  • normalization constants or fusion coefficients for the three signals
    The abstract describes a fused scalar but does not specify whether fixed or data-tuned coefficients are used to combine difficulty, uncertainty, and momentum.
axioms (1)
  • domain assumption The three signals (initial-difficulty anchor, normalized outcome-uncertainty term, pass-rate momentum) are complementary and together identify the active learning frontier.
    Invoked when defining the closed-form score that is claimed to be provably aligned with policy gradient magnitude.
invented entities (1)
  • Learning-Zone Energy Score no independent evidence
    purpose: Single scalar that ranks prompts for inclusion in RL rollouts and gradient updates
    New quantity defined in the paper to guide online data selection; no independent falsifiable prediction outside the training loop is stated.

pith-pipeline@v0.9.0 · 5789 in / 1595 out tokens · 59918 ms · 2026-05-20T15:45:51.039980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Amro Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S. Morcos. Semd- edup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023

  2. [2]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Deepseek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-094 22-z

  3. [3]

    Knowledge-Centric Hallucination Detection

    Ahmadian Arash, Cremer Chris, Gallé Matthias, Fadaee Marzieh, Kreutzer Julia, Pietquin Olivier, Üstün Ahmet, and Hooker Sara. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267....

  4. [4]

    Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, 2009. ISBN 9781605585161. doi: 10.1145/1553374.1553380

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  6. [6]

    Alpagasus: Training a better alpaca with fewer data

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InInternational Conference on Learning Representations (ICLR), pages 34767–34797, 2024

  7. [7]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  8. [8]

    RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. ISSN 2835-8856

  9. [9]

    Reinforced self-training (rest) for language modeling, 2023

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

  10. [10]

    JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, and Zhiyuan Liu. JustRL: Scaling a 1.5b LLM with a simple RL recipe, 2025

  11. [11]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks, volume 1, 2021

  12. [12]

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019. 10

  13. [13]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023

  14. [14]

    Kumar, Benjamin Packer, and Daphne Koller

    M. Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. InAdvances in Neural Information Processing Systems, volume 23, pages 1189–1197, 2010

  15. [15]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations (ICLR), pages 39578–39601, 2024

  16. [16]

    (2017) Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. doi: 10.1109/ICCV.2017.324

  17. [17]

    Not all tokens are what you need for pretraining

    Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Not all tokens are what you need for pretraining. InAdvances in Neural Information Processing Systems, volume 37, pages 29029–29063, 2024. doi: 10.52202/079017-0914

  18. [18]

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In International Conference on Learning Representations (ICLR), pages 22353–22373, 2024

  19. [19]

    Understanding r1-zero-like training: A critical perspective, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025

  20. [20]

    #instag: Instruction tagging for analyzing supervised fine-tuning of large language models

    Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. InInternational Conference on Learning Representations (ICLR), pages 36456–36474, 2024

  21. [21]

    Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

    Yixiu Mao, Yun Qu, Cheems Wang, Heming Zou, and Xiangyang Ji. Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models. InInternational Conference on Learning Representations (ICLR), 2026

  22. [22]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  23. [23]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  24. [24]

    Iterative Reasoning Preference Optimization

    Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative Reasoning Preference Optimization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 116617–116637. Curran Associates, Inc.,

  25. [25]

    doi: 10.52202/079017-3702

  26. [26]

    Efros, and Trevor Darrell

    Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo- ration by self-supervised prediction. In2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 488–489, 2017. doi: 10.1109/CVPRW.2017.70

  27. [27]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 11

  29. [29]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/10.1145/3 689031.3696075

  30. [30]

    Training region-based object detectors with online hard example mining

    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, June 2016

  31. [31]

    Be- yond human data: Scaling self-training for problem-solving with language models.Transactions on Machine Learning Research, 2024

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexan- der A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...

  32. [32]

    Beyond neural scaling laws: beating power law scaling via data pruning

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 19523–19536, 2022

  33. [33]

    Solving math word problems with process- and outcome-based feedback, 2022

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

  34. [34]

    Learning from mistakes via cooperative study assistant for large language models

    Danqing Wang and Lei Li. Learning from mistakes via cooperative study assistant for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10667–10685, December 2023. doi: 10.18653/v1/2023.emnlp-m ain.659

  35. [35]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

  37. [37]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3-4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696

  38. [38]

    LESS: Selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InForty-first International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 54104–54132, 2024

  39. [39]

    Data selection for language models via importance resampling

    Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems, volume 36, pages 34201–34227, 2023

  40. [40]

    A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to LLM reasoning: from rejection sampling to reinforce, 2025

  41. [41]

    Qwen2.5 technical report, 2024

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  42. [42]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

  43. [43]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  44. [44]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023

  45. [45]

    Dapo: An open-source llm reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, juncai liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

  46. [46]

    Self-rewarding language models, 2024

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

  47. [47]

    Scaling relationship on learning mathematical reasoning with large language models, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

  48. [48]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022

  49. [49]

    Generative verifiers: Reward modeling as next-token prediction

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Seyed Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InInterna- tional Conference on Learning Representations (ICLR), pages 12476–12505, 2025

  50. [50]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. InAdvances in Neural Information Processing Systems, volume 36, pages 55006–55021, 2023. 13 Appendices A Case Study: Data Selection Ra...