pith. machine review for the scientific record. sign in

arxiv: 2601.04809 · v5 · submitted 2026-01-08 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learninglarge language modelsreasoningsynthetic environmentsadaptive trainingscalable synthesisprogramming problemsRL for LLMs
0
0 comments X

The pith

SCALER sustains informative RL signals for LLM reasoning by adapting synthetic environments to model capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SCALER to address slowdowns in reinforcement learning for large language models when task difficulty misaligns with model progress or when training repeats narrow patterns. It introduces a synthesis pipeline that turns real programming problems into verifiable reasoning environments offering controllable difficulty and unlimited new instances. An adaptive multi-environment strategy then tracks the model's capability frontier by adjusting difficulty and curating environments to preserve diversity. This combination prevents reward sparsity and overfitting, yielding stronger performance on reasoning benchmarks and more stable training over long horizons than fixed-dataset baselines.

Core claim

SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity, preventing reward sparsity, mitigating overfitting to narrow task patterns, and supporting sustained improvement throughout training.

What carries the argument

The scalable synthesis pipeline that generates verifiable reasoning environments from programming problems, paired with an adaptive multi-environment RL strategy that co-adjusts difficulty and environment selection to the model's evolving frontier.

If this is right

  • RL training can proceed with unbounded new instances instead of exhausting a fixed dataset.
  • Difficulty adjustments keep reward signals informative as model capability grows.
  • Models reach higher performance across diverse reasoning benchmarks.
  • Training remains stable over long horizons without overfitting to recurring patterns.
  • Distributional diversity across environments reduces collapse to narrow solution strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to domains outside programming by synthesizing tasks from other verifiable sources such as math proofs or scientific hypotheses.
  • It suggests synthetic generation might eventually replace much of the need for static curated datasets in scaling RL for reasoning.
  • Testing the method on models larger than those in the experiments would reveal whether the co-adaptation continues to track capability frontiers effectively.

Load-bearing premise

The synthesis pipeline can reliably convert real programming problems into verifiable environments with controllable difficulty and strong correctness guarantees while the adaptive strategy avoids introducing distributional biases or reward sparsity.

What would settle it

If models trained with SCALER show no consistent outperformance over dataset-based RL baselines on standard reasoning benchmarks or if reward sparsity increases as training lengthens, the central claim would be falsified.

read the original abstract

Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SCALER, a framework for RL-based improvement of LLM reasoning. It introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments featuring controllable difficulty and unbounded instance generation while preserving strong correctness guarantees. It further describes an adaptive multi-environment RL strategy that dynamically adjusts difficulty and curates environments to track the model's capability frontier, maintain distributional diversity, prevent reward sparsity, and mitigate overfitting to narrow patterns. Experiments claim consistent outperformance over dataset-based RL baselines across diverse reasoning benchmarks together with more stable long-horizon training dynamics.

Significance. If the synthesis pipeline delivers sound, automated verification for unbounded instances and the adaptive curation avoids distributional collapse, SCALER would address two central bottlenecks in RL for reasoning—misaligned difficulty and pattern overfitting—potentially enabling sustained improvement beyond finite datasets. The reported stability in long-horizon dynamics would be a notable practical advantage for training large models.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.
  2. [Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.
minor comments (1)
  1. [Abstract] The abstract and experimental claims would benefit from explicit reporting of the number of environments, difficulty parameterization, and verification success rate to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer exposition of the verification and curation mechanisms. We address each major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.

    Authors: The synthesis pipeline starts from real-world programming problems whose solutions are already equipped with executable test suites. Each generated instance is produced by a semantics-preserving transformation (variable renaming, control-flow restructuring, and input scaling) whose equivalence to the original is checked by running the reference solution on both the original and transformed test cases. Any instance failing this check is discarded. This provides an automated, execution-based verification mechanism that scales to unbounded generation while inheriting the original problems' correctness guarantees. We agree the manuscript under-emphasizes this step and will add an explicit subsection in §3 describing the equivalence-checking procedure, failure rates, and soundness argument. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.

    Authors: The adaptive strategy maintains a dynamic environment pool whose composition is updated every K episodes according to two metrics: (1) per-environment success rate, used to estimate the current capability frontier and to up-weight environments near the frontier while down-weighting those that are either too easy or too hard; (2) environment-type entropy computed over a sliding window of recent trajectories, which triggers re-sampling when entropy falls below a threshold to restore diversity. Reward sparsity is monitored via the fraction of zero-reward episodes; when this exceeds a threshold, the system injects easier environments from the pool. These mechanisms are described at a high level in §4; we will expand the section with the precise update rules, threshold values, and ablation results showing their effect on distributional collapse and training stability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SCALER framework proposal

full rationale

The paper introduces SCALER as a novel framework with a synthesis pipeline for converting programming problems into verifiable environments and an adaptive multi-environment RL strategy for tracking capability frontiers. These elements are presented as original design contributions rather than derived from prior equations or parameters. Claims of outperformance and stable dynamics rest on reported experimental results across benchmarks, not on any self-referential reductions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that collapse back to the inputs by construction. The derivation chain is the proposal and empirical validation of the method itself, which remains self-contained and externally testable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the synthesis pipeline and adaptive strategy are described at high level without detailing underlying assumptions or fitted quantities.

pith-pipeline@v0.9.0 · 5506 in / 1100 out tokens · 21886 ms · 2026-05-16T16:19:30.273173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 14 internal anchors

  1. [1]

    Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023

    Art of Problem Solving. Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023. We use the 2023 AMC 12A/12B problems as hosted by AoPS; a processed version is available at https://huggingface.co/datasets/zwhe99/amc23

  2. [2]

    Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, ZhengyuChen, ShijieGeng, AoyanLi, BoLi, BowenLi, LinyiLi, BoyiLiu, JiahengLiu, KaiboLiu, QiLiu, ShukaiLiu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y. Peng, Kai Shen, Jiahao Su, Jing Su, Tao Sun, Yifan S...

  3. [3]

    Towards understanding self-play for llm reasoning, 2025

    Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi. Towards understanding self-play for llm reasoning, 2025. URLhttps://arxiv.org/abs/2510.27072

  4. [4]

    Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025. URLhttps://arxiv.org/abs/ 2504.19162

  5. [5]

    Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025

  6. [6]

    Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025

    XinjieChen,MinpengLiao,GuoxinChen,ChengxiLi,BiaoFu,KaiFan,andXinggaoLiu. Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025. URLhttps://arxiv.org/abs/2507.06573

  7. [7]

    Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025

    ZhuoyueChen,JihaiZhang,BenLiu,FangquanLin,andWotaoYin. Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025

  8. [8]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  9. [9]

    Serl: Self-play reinforcement learning for large language models with limited data, 2025

    Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025. URL https://arxiv.org/abs/2505.20347

  10. [10]

    16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025

    Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, and Guorui Zhou. 16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025. URLhttps://arxiv.org/abs/ 2508.05710

  11. [11]

    Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D. Manning. Synthetic data generation & multi-step rl for reasoning & tool use, 2025. URLhttps://arxiv.org/abs/2504.04736

  12. [12]

    Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025

    Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025. URL https://arxiv.org/abs/2512.19682

  13. [13]

    Synthetic data rl: Task definition is all you need, 2025

    Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, and Yikang Shen. Synthetic data rl: Task definition is all you need, 2025. URLhttps://arxiv.org/abs/2505.17063

  14. [14]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXivpreprintarXiv:2504.11456, 2025

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874

  16. [16]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  17. [17]

    BIG-benchextrahard

    Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa,KateOlszewska,YiTay,VinhQ.Tran,QuocV.Le,andOrhanFirat. BIG-benchextrahard. In Proceedingsof the 63rdAnnual Meeting of the...

  18. [18]

    Learnalign: Reasoningdataselectionforreinforcementlearninginlargelanguagemodelsbasedonimprovedgradientalignment,

    ShipengLi,ShikunLi,ZhiqinYang,XinghuaZhang,GaodeChen,XiaoboXia,HengyuLiu,andZhePeng. Learnalign: Reasoningdataselectionforreinforcementlearninginlargelanguagemodelsbasedonimprovedgradientalignment,

  19. [19]

    URLhttps://arxiv.org/abs/2506.11480

  20. [20]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Axel Gimeno Gil, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robs...

  21. [21]

    Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025

    Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, and Weizhu Chen. Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2506.08989

  22. [22]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. URL https://arxiv.org/abs/2305.20050. Releases PRM800K and the MATH-500 (500-problem) test subset via https://github.com/openai/prm800k

  23. [23]

    Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

    Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. arXivpreprintarXiv:2506.24119, 2025

  24. [24]

    Saturn: Sat-based reinforcement learning to unleash language model reasoning.arXivpreprint arXiv:2505.16368, 2025

    Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, and Ge Li. Saturn: Sat-based reinforcement learning to unleash language model reasoning.arXivpreprint arXiv:2505.16368, 2025

  25. [25]

    Openai o1 system card,

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  26. [26]

    URLhttps://arxiv.org/abs/2412.16720

  27. [28]

    Curriculum reinforcement learning from easy to hard tasks improves llm reasoning

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. arXiv preprintarXiv:2506.06632, 2025

  28. [29]

    Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

    Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin. Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

  29. [30]

    What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025

    Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025

  30. [31]

    URLhttps://arxiv.org/abs/2406.14532

    AmrithSetlur,SaurabhGarg,XinyangGeng,NamanGarg,VirginiaSmith,andAviralKumar.Rlonincorrectsynthetic data scales the efficiency of llm math reasoning by eight-fold, 2024. URLhttps://arxiv.org/abs/2406.14532

  31. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [33]

    Efficient reinforcement finetuning via adaptive curriculum learning.arXivpreprintarXiv:2504.05520, 2025

    Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXivpreprintarXiv:2504.05520, 2025

  33. [34]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas 18 Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025

  34. [35]

    Kimi k1.5: Scaling reinforcement learning with llms, 2025

    Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms, 2025

  35. [36]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  36. [37]

    Scheduling Your LLM Reinforcement Learning with Reasoning Trees

    Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, and Jiawei Chen. Scheduling your llm reinforcement learning with reasoning trees, 2025. URLhttps://arxiv.org/abs/2510.24832

  37. [38]

    Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025

    PinzhengWang,JuntaoLi,ZechengTang,HaijiaGui,andMinzhang. Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025. URLhttps://arxiv.org/abs/2506.22920

  38. [40]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    URLhttps://arxiv.org/abs/2406.01574. Accepted at NeurIPS 2024 Datasets and Benchmarks (Spotlight). Dataset:https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

  39. [41]

    Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025

    ZijianWu,JinjieNi,XiangyanLiu,ZichenLiu,HangYan,andMichaelQizheShieh. Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025. URLhttps://arxiv.org/abs/2506.02096

  40. [42]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, WangZhang,HangZhu,JinhuaZhu,JiazeChen,JiangjieChen,ChengyiWang,HongliYu,YuxuanSong,Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Q...

  41. [43]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    AohanZeng,XinLv,QinkaiZheng,ZhenyuHou,BinChen,ChengxingXie,CunxiangWang,DaYin,HaoZeng,Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  42. [44]

    Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXivpreprintarXiv:2511.07317, 2025

    Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXivpreprintarXiv:2511.07317, 2025

  43. [45]

    Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025

    Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, and Guihai Chen. Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025. URLhttps://arxiv.org/abs/2509.17730

  44. [46]

    Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024

    Yifan Zhang and Team Math-AI. Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024

  45. [47]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    AndrewZhao,YiranWu,YangYue,TongWu,QuentinXu,MatthieuLin,ShenzhiWang,QingyunWu,ZilongZheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. 19