pith. machine review for the scientific record. sign in

arxiv: 2605.15012 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningverifiable rewardsfew-shot guidancelarge language modelssample efficiencysupervised fine-tuningon-policy learning
0
0 comments X

The pith

FEST boosts RLVR performance using only 128 randomly selected demonstrations from SFT data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes FEST, a few-shot demonstration-guided algorithm for reinforcement learning with verifiable rewards. It demonstrates that randomly selecting just 128 examples from a supervised fine-tuning dataset suffices to achieve strong results on challenging tasks like math and coding. The success relies on combining supervised signals, on-policy signals, and decaying weights to avoid overfitting during repeated training epochs. This approach significantly reduces the amount of data needed compared to standard methods while matching their performance.

Core claim

FEST attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. Three components are vital: the supervised signal, the on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, it outperforms baselines with magnitudes less SFT data and even matches their performance with the full dataset.

What carries the argument

The FEST algorithm that integrates few-shot supervised guidance with on-policy RL signals under a decaying weight schedule on the demonstration data.

If this is right

  • RLVR methods can succeed with far less supervised data than previously required.
  • Random selection of demonstrations is sufficient for effective guidance.
  • Decaying weights enable multiple epochs of training on small datasets without overfitting.
  • Performance on math and coding tasks can match full-dataset approaches using minimal examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could lower the barrier for training reasoning-capable LLMs by reducing data acquisition costs.
  • It may inspire similar few-shot guidance strategies in other reinforcement learning domains beyond language models.
  • Exploring adaptive selection or weighting beyond random choice could further improve results.

Load-bearing premise

Randomly chosen demonstrations from an SFT dataset will provide effective guidance when mixed with on-policy signals and subject to decaying weights.

What would settle it

An experiment showing that on a difficult new benchmark, FEST with 128 random demos performs no better than pure RLVR or requires non-random selection to match full SFT results.

Figures

Figures reproduced from arXiv: 2605.15012 by Alexander G. Schwing, Kai Yan, Yu-Xiong Wang.

Figure 2
Figure 2. Figure 2: Performance scalability across vary￾ing shot counts. Dashed lines represent base￾line results utilizing the full 46K SFT dataset. While FEST-GRPO provides higher robust￾ness in the extreme few-shot case (64 shots), FEST-DPO exhibits stronger scaling ability with more data. To evaluate the scalability of our approach across varying sizes of DE, we further test our method with 64, 256, and 512 examples. The … view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of SFT trajectory ratios and corresponding Pass@1 performance during HPT [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gradient norm comparison between DPO and GRPO objectives when applied independently. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training set accuracy profiles for ReLIFT-G and HPT-G. Both variants exhibit significant [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Implicit advantage z as a function of β. Panel (a) shows that average z scales approximately linearly with β, i.e., the log-ratio difference is nearly consistent and our assumption holds for β ∈ [0.001, 0.1]. Panel (b) reveals that higher β values result in a wider distribution of z, where a few examples with very low z receive intense “switch-like” signals. Conversely, lower β values yield a more concentr… view at source ↗
Figure 7
Figure 7. Figure 7: Reward curves on DE during training. Results for the “higher group” (LUFFY, ReLIFT-G, HPT-G, and RL-G) are excluded from these plots as their direct RL training on DE leads to near 100% training accuracy, masking meaningful comparative dynamics. The performance implications of this overfitting are discussed in Appendix D.1 and Tab. 3. 0 100 200 300 400 500 600 Step 0.1 0.2 0.3 0.4 0.5 0.6 Reward (Avg@8) RL… view at source ↗
Figure 8
Figure 8. Figure 8: Smoothed reward curves on DI utilizing a time-weighted exponential moving average. 0 100 200 300 400 500 600 Step 0.250 0.275 0.300 0.325 0.350 0.375 0.400 0.425 Pass@1 RL RL-G LUFFY CHORD￾HPT HPT-G ReLIFT ReLIFT-G FEST-DPO FEST-GRPO (a) DI (RL training set) accuracy 100 200 300 400 500 600 Step 0.34 0.36 0.38 0.40 0.42 Pass@1 RL RL-G LUFFY CHORD￾HPT HPT-G ReLIFT ReLIFT-G FEST-DPO FEST-GRPO (b) zoomed-in c… view at source ↗
Figure 9
Figure 9. Figure 9: Smoothed Pass@1 performance on the test set through the training process. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training and test set performance curves for the LIMOv2-8192 experiment described in [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Test set performance comparison between FEST, standalone SFT, and SPIN. Our method [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FEST, a few-shot demonstration-guided RLVR algorithm for LLMs that uses only 128 randomly selected demonstrations from an SFT dataset. It claims that combining supervised signals, on-policy signals, and decaying weights on the few-shot data prevents overfitting during multi-epoch training and yields compelling results on math and coding benchmarks, outperforming baselines that require magnitudes more SFT data and even matching full-dataset performance.

Significance. If the results hold under proper robustness checks, the work could meaningfully advance sample-efficient RLVR by showing that carefully weighted few-shot guidance from minimal random subsets can substitute for large-scale SFT, reducing data acquisition costs while maintaining or exceeding performance. The emphasis on the interplay of supervised, on-policy, and decaying-weight components offers actionable guidance for practitioners working on verifiable-reward fine-tuning.

major comments (2)
  1. [Experimental Results] The central claim that 'randomly selected' 128 demonstrations suffice depends on the assumption that subset variance is negligible. The experimental section reports results for a single (or unreported number of) random draw(s) but provides no statistics over multiple independent random selections of the 128 examples, no standard deviation across seeds, and no ablation isolating subset quality. This is load-bearing: if example difficulty or quality varies within the SFT pool, the reported gains may reflect a favorable draw rather than a general property of FEST. Please add results from at least five independent random 128-subsets with mean and std-dev metrics.
  2. [Ablation Studies] The paper states that supervised signal, on-policy signal, and decaying weights are 'vital,' yet the ablation tables do not quantify the performance drop when each component is removed individually while keeping the others fixed, nor do they isolate the effect of the decaying-weight schedule on multi-epoch overfitting. Without these controls, the necessity of all three components for the few-shot regime remains incompletely supported.
minor comments (2)
  1. [Abstract] The abstract refers to 'several benchmarks' and 'outperforms baselines' without naming the specific tasks, baselines, or quantitative deltas; moving at least the headline numbers into the abstract would improve readability.
  2. [Method] Notation for the decaying weight schedule (e.g., the functional form and hyper-parameters) should be defined once in a dedicated subsection and then referenced consistently in equations and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We agree that additional robustness checks and more granular ablations will strengthen the manuscript and plan to incorporate them in the revision.

read point-by-point responses
  1. Referee: [Experimental Results] The central claim that 'randomly selected' 128 demonstrations suffice depends on the assumption that subset variance is negligible. The experimental section reports results for a single (or unreported number of) random draw(s) but provides no statistics over multiple independent random selections of the 128 examples, no standard deviation across seeds, and no ablation isolating subset quality. This is load-bearing: if example difficulty or quality varies within the SFT pool, the reported gains may reflect a favorable draw rather than a general property of FEST. Please add results from at least five independent random 128-subsets with mean and std-dev metrics.

    Authors: We acknowledge that reporting results from only a single random draw leaves the claim vulnerable to subset-specific effects. We will rerun the full evaluation pipeline on at least five independent random selections of 128 examples, report mean performance together with standard deviation, and include these statistics in the revised experimental section. revision: yes

  2. Referee: [Ablation Studies] The paper states that supervised signal, on-policy signal, and decaying weights are 'vital,' yet the ablation tables do not quantify the performance drop when each component is removed individually while keeping the others fixed, nor do they isolate the effect of the decaying-weight schedule on multi-epoch overfitting. Without these controls, the necessity of all three components for the few-shot regime remains incompletely supported.

    Authors: We agree that the current ablations do not fully isolate each factor. In the revision we will add controlled experiments that (i) remove the supervised signal, on-policy signal, and decaying-weight schedule one at a time while holding the other two fixed, and (ii) compare fixed-weight versus decaying-weight schedules across multiple training epochs to quantify the overfitting-prevention effect. These results will be presented in an expanded ablation table. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external comparisons

full rationale

The paper introduces FEST as an empirical algorithm combining supervised fine-tuning signals, on-policy RL signals, and decaying weights on a small random subset of 128 SFT demonstrations. Its central claims are supported by reported benchmark comparisons showing outperformance relative to baselines that use more data. No equations, uniqueness theorems, or derivations are present that reduce by construction to fitted inputs or self-citations. The three vital components are identified from experimental findings rather than being presupposed in the method definition. The approach is self-contained against external benchmarks, with no load-bearing self-citation chains or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard RL assumptions plus the empirical claim that the three listed components suffice when applied to randomly chosen few-shot data.

free parameters (2)
  • number of demonstrations = 128
    Fixed at 128 and randomly selected from SFT dataset; the exact count is a design choice that the results depend on.
  • decaying weight schedule
    Weights applied to the few-shot SFT data that must decay over epochs to prevent overfitting; the schedule is not specified in the abstract.
axioms (2)
  • domain assumption Demonstration guidance from SFT data can usefully supplement RLVR when correct rollouts are rare.
    Invoked when stating that prior works use SFT after RL fails and that FEST improves on this.
  • domain assumption On-policy signals and supervised signals can be productively combined in the same training loop.
    Stated as one of the three vital components without further justification in the abstract.

pith-pipeline@v0.9.0 · 5486 in / 1466 out tokens · 62425 ms · 2026-05-15T03:14:12.279889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · 15 internal anchors

  1. [1]

    and Iov, I

    Afanasyev, M. and Iov, I. Slime: Stabilized likelihood implicit margin enforcement for preference optimization.arXiv preprint arXiv:2602.02383, 2026

  2. [2]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., and Hooker, S. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InACL, 2024

  3. [3]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

    Albalak, A., Phung, D., Lile, N., Rafailov, R., Gandhi, K., Castricato, L., Singh, A., Blagden, C., Xiang, V ., Mahan, D., et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

  4. [4]

    Global overview of Imitation Learning

    Attia, A. and Dayan, S. Global overview of imitation learning.arXiv preprint arXiv:1801.06503, 2018

  5. [5]

    The ai startup fueling chatgpt’s expertise is now valued at $10 bil- lion.The Wall Street Journal, 2026

    Au-Yeung, A. The ai startup fueling chatgpt’s expertise is now valued at $10 bil- lion.The Wall Street Journal, 2026. URL https://www.wsj.com/tech/ai/ the-ai-startup-fueling-chatgpts-expertise-is-now-valued-at-10-billion-f1281e56

  6. [6]

    Online preference alignment for language models via count-based exploration

    Bai, C., Zhang, Y ., Qiu, S., Zhang, Q., Xu, K., and Li, X. Online preference alignment for language models via count-based exploration. InICLR, 2026

  7. [7]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  8. [8]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. InNIPS, 2015

  9. [9]

    Data diversity matters for robust instruction tuning

    Bukharin, A., Li, S., Wang, Z., Yang, J., Yin, B., Li, X., Zhang, C., Zhao, T., and Jiang, H. Data diversity matters for robust instruction tuning. InFindings of EMNLP, 2024

  10. [10]

    Instruction mining: Instruction data selection for tuning large language models

    Cao, Y ., Kang, Y ., Wang, C., and Sun, L. Instruction mining: Instruction data selection for tuning large language models. InCOLM, 2024

  11. [11]

    H., Chen, X., Zhang, Q., Ranganath, R., and Cho, K

    Chen, A., Malladi, S., Zhang, L. H., Chen, X., Zhang, Q., Ranganath, R., and Cho, K. Preference learning algorithms do not learn preference rankings. InNeurIPS, 2024

  12. [12]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  13. [13]

    Sft or rl? an early investigation into training r1-like reasoning large vision-language models.TMLR, 2025

    Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y ., and Xie, C. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.TMLR, 2025

  14. [14]

    Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms.arXiv preprint arXiv:2505.13026, 2025

    Chen, J., Liu, F., Liu, N., Luo, Y ., Qin, E., Zheng, H., Dong, T., Zhu, H., Meng, Y ., and Wang, X. Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms.arXiv preprint arXiv:2505.13026, 2025

  15. [15]

    Self-play fine-tuning converts weak language models to strong language models

    Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong language models. InICML, 2024

  16. [16]

    X., Zhang, Z., and Wei, F

    Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective. InAAAI, 2026

  17. [17]

    H., Oh, J., Kim, M., and Lee, B.-J

    Cho, J. H., Oh, J., Kim, M., and Lee, B.-J. Rethinking dpo: The role of rejected responses in preference misalignment. InEMNLP, 2025

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

  19. [19]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  20. [20]

    Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

    D’Oosterlinck, K., Xu, W., Develder, C., Demeester, T., Singh, A., Potts, C., Kiela, D., and Mehri, S. Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment. InACL, 2025

  21. [21]

    Dornis, T. W. and Stober, S. Generative ai training and copyright law.arXiv preprint arXiv:2502.15858, 2025

  22. [22]

    Rlhf in an sft way: From optimal solution to reward-weighted alignment.TMLR, 2026

    Du, Y ., Li, Z., Cheng, P., Chen, Z., Xie, Y ., Wan, X., and Gao, A. Rlhf in an sft way: From optimal solution to reward-weighted alignment.TMLR, 2026

  23. [23]

    Kto: Model alignment as prospect theoretic optimization

    Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. InICML, 2024

  24. [24]

    Serl: Self-play reinforcement learning for large language models with limited data

    Fang, W., Liu, S., Zhou, Y ., Zhang, K., Zheng, T., Chen, K., Song, M., and Tao, D. Serl: Self-play reinforcement learning for large language models with limited data. InNeurIPS, 2025

  25. [25]

    Towards analyzing and understanding the limitations of dpo: A theoretical perspective.arXiv preprint arXiv:2404.04626, 2024

    Feng, D., Qin, B., Huang, C., Zhang, Z., and Lei, W. Towards analyzing and understanding the limitations of dpo: A theoretical perspective.arXiv preprint arXiv:2404.04626, 2024

  26. [26]

    Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning

    Fu, Y ., Chen, T., Chai, J., Wang, X., Tu, S., Yin, G., Lin, W., Zhang, Q., Zhu, Y ., and Zhao, D. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning. In ICLR, 2026

  27. [27]

    and Gu, S

    Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. In NeurIPS, 2021

  28. [28]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  29. [29]

    Proximalized preference optimization for diverse feedback types: A decomposed perspective on dpo

    Guo, K., Li, Y ., and Chen, Z. Proximalized preference optimization for diverse feedback types: A decomposed perspective on dpo. InNeurIPS, 2025

  30. [30]

    Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

    Guo, S., Zhang, B., Liu, T., Liu, T., Khalman, M., Llinares, F., Rame, A., Mesnard, T., Zhao, Y ., Piot, B., et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

  31. [31]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InACL, 2024

  32. [32]

    Unifying stable optimization and reference regularization in rlhf

    He, L., Qu, Q., Zhao, H., Wan, S., Wang, D., Yao, L., and Liu, T. Unifying stable optimization and reference regularization in rlhf. InICLR, 2026

  33. [33]

    Measuring mathematical problem solving with the math dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. InNeurIPS, 2021

  34. [34]

    and Yang, L

    Huang, Y . and Yang, L. F. Winning gold at imo 2025 with a model-agnostic verification-and- refinement pipeline. InMATH-AI Workshop at NeurIPS, 2025

  35. [35]

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V

    Huang, Z., Cheng, T., Qiu, Z., Wang, Z., Xu, Y ., Ponti, E. M., and Titov, I. Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679, 2025

  36. [36]

    K., Dantanarayana, J., Flautner, K., Tang, L., Kang, Y ., and Mars, J

    Irugalbandara, C., Mahendra, A., Daynauth, R., Arachchige, T. K., Dantanarayana, J., Flautner, K., Tang, L., Kang, Y ., and Mars, J. Scaling down to scale up: A cost-benefit analysis of replacing openai’s llm with open source slms in production. InISPASS, 2024. 11

  37. [37]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  38. [38]

    Jensen, J. L. W. V . Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica, 1906

  39. [39]

    Supervised fine-tuning versus reinforcement learning: A study of post-training methods for large language models.arXiv preprint arXiv:2603.13985, 2026

    Jiang, H., Zhang, W., Yao, J., Cai, H., Wang, S., and Song, R. Supervised fine-tuning versus reinforcement learning: A study of post-training methods for large language models.arXiv preprint arXiv:2603.13985, 2026

  40. [40]

    CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

    Jiang, X., Dong, Y ., Liu, M., Deng, H., Wang, T., Tao, Y ., Cao, R., Li, B., Jin, Z., Jiao, W., et al. Coderl+: Improving code generation via reinforcement with execution semantics alignment. arXiv preprint arXiv:2510.18471, 2025

  41. [41]

    X., Li, M., Qin, C., Wang, P., Savarese, S., et al

    Ke, Z., Jiao, F., Ming, Y ., Nguyen, X.-P., Xu, A., Long, D. X., Li, M., Qin, C., Wang, P., Savarese, S., et al. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.TMLR, 2025

  42. [42]

    and Alatan, A

    Köksal, A. and Alatan, A. A. Few-shot vision-language reasoning for satellite imagery via verifiable rewards. InICCV, 2025

  43. [43]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  44. [44]

    and Team, H

    Kydlíˇcek, H. and Team, H. F. Math-verify: A library for verifying mathematical answers,

  45. [45]

    GitHub repository

    URLhttps://github.com/huggingface/Math-Verify. GitHub repository

  46. [46]

    Solving quantitative reasoning problems with language models

    Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. InNeurIPS, 2022

  47. [47]

    Q., Shen, Z., et al

    Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S., Rasul, K., Yu, L., Jiang, A. Q., Shen, Z., et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face Repository, 2024

  48. [48]

    Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

    Li, P., Skripkin, M., Zubrey, A., Kuznetsov, A., and Oseledets, I. Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

  49. [49]

    Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

    Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

  50. [50]

    Empowering small vlms to think with dynamic memorization and exploration

    Liu, J., Deng, Y ., and Chen, L. Empowering small vlms to think with dynamic memorization and exploration. InICLR, 2026

  51. [51]

    Uft: Unifying supervised and reinforcement fine-tuning

    Liu, M., Farina, G., and Ozdaglar, A. Uft: Unifying supervised and reinforcement fine-tuning. InNeurIPS, 2025

  52. [52]

    Superrl: Reinforcement learning with supervision to boost language model reasoning.arXiv preprint arXiv:2506.01096, 2025

    Liu, Y ., Li, S., Cao, L., Xie, Y ., Zhou, M., Dong, H., Ma, X., Han, S., and Zhang, D. Superrl: Reinforcement learning with supervision to boost language model reasoning.arXiv preprint arXiv:2506.01096, 2025

  53. [53]

    S., and Lin, M

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective. InCOLM, 2025

  54. [54]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. InICLR, 2019

  55. [55]

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang

    Lv, X., Zuo, Y ., Sun, Y ., Liu, H., Wei, Y ., Chen, Z., Zhu, X., Zhang, K., Wang, B., Ding, N., et al. Towards a unified view of large language model post-training.arXiv preprint arXiv:2509.04419, 2025

  56. [56]

    H., Niu, J., Shen, C., He, R., Li, Y ., et al

    Ma, L., Liang, H., Qiang, M., Tang, L., Ma, X., Wong, Z. H., Niu, J., Shen, C., He, R., Li, Y ., et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. InICLR, 2026. 12

  57. [57]

    Gradient imbalance in direct preference optimization.arXiv preprint arXiv:2502.20847, 2025

    Ma, Q., Shi, J., Jin, C., Hwang, J.-N., Belongie, S., and Li, L. Gradient imbalance in direct preference optimization.arXiv preprint arXiv:2502.20847, 2025

  58. [58]

    Leveraging online olympiad-level math problems for LLMs training and contamination-resistant evaluation

    Mahdavi, S., Li, M., Liu, K., Thrampoulidis, C., Sigal, L., and Liao, R. Leveraging online olympiad-level math problems for LLMs training and contamination-resistant evaluation. In ICML, 2025

  59. [59]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    Min, Y ., Chen, Z., Jiang, J., Chen, J., Deng, J., Hu, Y ., Tang, Y ., Wang, J., Cheng, X., Song, H., et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

  60. [60]

    L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T

    Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. B. s1: Simple test-time scaling. InEMNLP, 2025

  61. [61]

    Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 1997

    Müller, A. Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 1997

  62. [62]

    How gpt-5 helped mathematician ernest ryu solve a 40-year-old open problem, 2025

    OpenAI. How gpt-5 helped mathematician ernest ryu solve a 40-year-old open problem, 2025. URLhttps://openai.com/index/gpt-5-mathematical-discovery/

  63. [63]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

  64. [64]

    Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

    Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

  65. [65]

    Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models, 2025

    Pan, X., Chen, Y ., Chen, Y ., Sun, Y ., Chen, D., Zhang, W., Xie, Y ., Huang, Y ., Zhang, Y ., Gao, D., Li, Y ., Ding, B., and Zhou, J. Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models, 2025. URL https: //arxiv.org/abs/2505.17826

  66. [66]

    What matters in data for dpo? InNeurIPS, 2025

    Pan, Y ., Cai, Z., Chen, G., Zhong, H., and Wang, C. What matters in data for dpo? InNeurIPS, 2025

  67. [67]

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  68. [68]

    D., Ermon, S., and Finn, C

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

  69. [69]

    L., Bellemare, M

    Roux, N. L., Bellemare, M. G., Lebensold, J., Bergeron, A., Greaves, J., Fréchette, A., Pelletier, C., Thibodeau-Laufer, E., Toth, S., and Work, S. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms. InNeurIPS, 2025

  70. [70]

    High-dimensional continuous control using generalized advantage estimation

    Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. InICLR, 2016

  71. [71]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017

  72. [72]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  73. [73]

    Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

  74. [74]

    Hybridflow: A flexible and efficient rlhf framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. InEuroSys, 2025

  75. [75]

    Ai models collapse when trained on recursively generated data.Nature, 2024

    Shumailov, I., Shumaylov, Z., Zhao, Y ., et al. Ai models collapse when trained on recursively generated data.Nature, 2024. 13

  76. [76]

    Fspo: Few-shot preference optimization of synthetic preference data in llms elicits effective personalization to real users

    Singh, A., Hsu, S., Hsu, K., Mitchell, E., Ermon, S., Hashimoto, T., Sharma, A., and Finn, C. Fspo: Few-shot preference optimization of synthetic preference data in llms elicits effective personalization to real users. InICLR, 2026

  77. [77]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

    Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

  78. [78]

    Sopo: Text-to-motion generation using semi-online preference optimization

    Tan, X., Wang, H., Geng, X., and Zhou, P. Sopo: Text-to-motion generation using semi-online preference optimization. InNeurIPS, 2025

  79. [79]

    Secrets of rlhf in large language models part ii: Reward modeling.arXiv preprint arXiv:2401.06080, 2024

    Wang, B., Zheng, R., Chen, L., Liu, Y ., Dou, S., Huang, C., Shen, W., Jin, S., Zhou, E., Shi, C., et al. Secrets of rlhf in large language models part ii: Reward modeling.arXiv preprint arXiv:2401.06080, 2024

  80. [80]

    W., Foo, C.-S., and Low, B

    Wang, J., Lin, X., Qiao, R., Koh, P. W., Foo, C.-S., and Low, B. K. H. NICE data selection for instruction tuning in LLMs with non-differentiable evaluation metric. InICML, 2025

Showing first 80 references.