Recognition: 1 theorem link
· Lean TheoremSCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
Pith reviewed 2026-05-16 16:19 UTC · model grok-4.3
The pith
SCALER sustains informative RL signals for LLM reasoning by adapting synthetic environments to model capability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity, preventing reward sparsity, mitigating overfitting to narrow task patterns, and supporting sustained improvement throughout training.
What carries the argument
The scalable synthesis pipeline that generates verifiable reasoning environments from programming problems, paired with an adaptive multi-environment RL strategy that co-adjusts difficulty and environment selection to the model's evolving frontier.
If this is right
- RL training can proceed with unbounded new instances instead of exhausting a fixed dataset.
- Difficulty adjustments keep reward signals informative as model capability grows.
- Models reach higher performance across diverse reasoning benchmarks.
- Training remains stable over long horizons without overfitting to recurring patterns.
- Distributional diversity across environments reduces collapse to narrow solution strategies.
Where Pith is reading between the lines
- The approach could extend to domains outside programming by synthesizing tasks from other verifiable sources such as math proofs or scientific hypotheses.
- It suggests synthetic generation might eventually replace much of the need for static curated datasets in scaling RL for reasoning.
- Testing the method on models larger than those in the experiments would reveal whether the co-adaptation continues to track capability frontiers effectively.
Load-bearing premise
The synthesis pipeline can reliably convert real programming problems into verifiable environments with controllable difficulty and strong correctness guarantees while the adaptive strategy avoids introducing distributional biases or reward sparsity.
What would settle it
If models trained with SCALER show no consistent outperformance over dataset-based RL baselines on standard reasoning benchmarks or if reward sparsity increases as training lengthens, the central claim would be falsified.
read the original abstract
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SCALER, a framework for RL-based improvement of LLM reasoning. It introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments featuring controllable difficulty and unbounded instance generation while preserving strong correctness guarantees. It further describes an adaptive multi-environment RL strategy that dynamically adjusts difficulty and curates environments to track the model's capability frontier, maintain distributional diversity, prevent reward sparsity, and mitigate overfitting to narrow patterns. Experiments claim consistent outperformance over dataset-based RL baselines across diverse reasoning benchmarks together with more stable long-horizon training dynamics.
Significance. If the synthesis pipeline delivers sound, automated verification for unbounded instances and the adaptive curation avoids distributional collapse, SCALER would address two central bottlenecks in RL for reasoning—misaligned difficulty and pattern overfitting—potentially enabling sustained improvement beyond finite datasets. The reported stability in long-horizon dynamics would be a notable practical advantage for training large models.
major comments (2)
- [Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.
- [Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.
minor comments (1)
- [Abstract] The abstract and experimental claims would benefit from explicit reporting of the number of environments, difficulty parameterization, and verification success rate to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for clearer exposition of the verification and curation mechanisms. We address each major comment below and will incorporate clarifications and additional details in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.
Authors: The synthesis pipeline starts from real-world programming problems whose solutions are already equipped with executable test suites. Each generated instance is produced by a semantics-preserving transformation (variable renaming, control-flow restructuring, and input scaling) whose equivalence to the original is checked by running the reference solution on both the original and transformed test cases. Any instance failing this check is discarded. This provides an automated, execution-based verification mechanism that scales to unbounded generation while inheriting the original problems' correctness guarantees. We agree the manuscript under-emphasizes this step and will add an explicit subsection in §3 describing the equivalence-checking procedure, failure rates, and soundness argument. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.
Authors: The adaptive strategy maintains a dynamic environment pool whose composition is updated every K episodes according to two metrics: (1) per-environment success rate, used to estimate the current capability frontier and to up-weight environments near the frontier while down-weighting those that are either too easy or too hard; (2) environment-type entropy computed over a sliding window of recent trajectories, which triggers re-sampling when entropy falls below a threshold to restore diversity. Reward sparsity is monitored via the fraction of zero-reward episodes; when this exceeds a threshold, the system injects easier environments from the pool. These mechanisms are described at a high level in §4; we will expand the section with the precise update rules, threshold values, and ablation results showing their effect on distributional collapse and training stability. revision: yes
Circularity Check
No circularity detected in SCALER framework proposal
full rationale
The paper introduces SCALER as a novel framework with a synthesis pipeline for converting programming problems into verifiable environments and an adaptive multi-environment RL strategy for tracking capability frontiers. These elements are presented as original design contributions rather than derived from prior equations or parameters. Claims of outperformance and stable dynamics rest on reported experimental results across benchmarks, not on any self-referential reductions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that collapse back to the inputs by construction. The derivation chain is the proposal and empirical validation of the method itself, which remains self-contained and externally testable.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
difficulty of the instances is characterized by the array length or the number of edges in the graph and we discretize the difficulty into distinct difficulty levels... dt+1=clip(dt+β·(acct-τ),0,D)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023
Art of Problem Solving. Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023. We use the 2023 AMC 12A/12B problems as hosted by AoPS; a processed version is available at https://huggingface.co/datasets/zwhe99/amc23
work page 2023
-
[2]
Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, ZhengyuChen, ShijieGeng, AoyanLi, BoLi, BowenLi, LinyiLi, BoyiLiu, JiahengLiu, KaiboLiu, QiLiu, ShukaiLiu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y. Peng, Kai Shen, Jiahao Su, Jing Su, Tao Sun, Yifan S...
-
[3]
Towards understanding self-play for llm reasoning, 2025
Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi. Towards understanding self-play for llm reasoning, 2025. URLhttps://arxiv.org/abs/2510.27072
- [4]
-
[5]
Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025
-
[6]
Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025
XinjieChen,MinpengLiao,GuoxinChen,ChengxiLi,BiaoFu,KaiFan,andXinggaoLiu. Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025. URLhttps://arxiv.org/abs/2507.06573
-
[7]
Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025
ZhuoyueChen,JihaiZhang,BenLiu,FangquanLin,andWotaoYin. Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025
work page 2025
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Serl: Self-play reinforcement learning for large language models with limited data, 2025
Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025. URL https://arxiv.org/abs/2505.20347
-
[10]
16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025
Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, and Guorui Zhou. 16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025. URLhttps://arxiv.org/abs/ 2508.05710
- [11]
-
[12]
Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025
Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025. URL https://arxiv.org/abs/2512.19682
-
[13]
Synthetic data rl: Task definition is all you need, 2025
Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, and Yikang Shen. Synthetic data rl: Task definition is all you need, 2025. URLhttps://arxiv.org/abs/2505.17063
-
[14]
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXivpreprintarXiv:2504.11456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa,KateOlszewska,YiTay,VinhQ.Tran,QuocV.Le,andOrhanFirat. BIG-benchextrahard. In Proceedingsof the 63rdAnnual Meeting of the...
-
[18]
ShipengLi,ShikunLi,ZhiqinYang,XinghuaZhang,GaodeChen,XiaoboXia,HengyuLiu,andZhePeng. Learnalign: Reasoningdataselectionforreinforcementlearninginlargelanguagemodelsbasedonimprovedgradientalignment,
-
[19]
URLhttps://arxiv.org/abs/2506.11480
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Axel Gimeno Gil, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robs...
-
[21]
Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025
Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, and Weizhu Chen. Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2506.08989
-
[22]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. URL https://arxiv.org/abs/2305.20050. Releases PRM800K and the MATH-500 (500-problem) test subset via https://github.com/openai/prm800k
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. arXivpreprintarXiv:2506.24119, 2025
-
[24]
Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, and Ge Li. Saturn: Sat-based reinforcement learning to unleash language model reasoning.arXivpreprint arXiv:2505.16368, 2025
-
[25]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
-
[26]
URLhttps://arxiv.org/abs/2412.16720
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Curriculum reinforcement learning from easy to hard tasks improves llm reasoning
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. arXiv preprintarXiv:2506.06632, 2025
-
[29]
Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin. Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023
-
[30]
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025
-
[31]
URLhttps://arxiv.org/abs/2406.14532
AmrithSetlur,SaurabhGarg,XinyangGeng,NamanGarg,VirginiaSmith,andAviralKumar.Rlonincorrectsynthetic data scales the efficiency of llm math reasoning by eight-fold, 2024. URLhttps://arxiv.org/abs/2406.14532
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXivpreprintarXiv:2504.05520, 2025
-
[34]
Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas 18 Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025
-
[35]
Kimi k1.5: Scaling reinforcement learning with llms, 2025
Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms, 2025
work page 2025
-
[36]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Scheduling Your LLM Reinforcement Learning with Reasoning Trees
Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, and Jiawei Chen. Scheduling your llm reinforcement learning with reasoning trees, 2025. URLhttps://arxiv.org/abs/2510.24832
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025
PinzhengWang,JuntaoLi,ZechengTang,HaijiaGui,andMinzhang. Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025. URLhttps://arxiv.org/abs/2506.22920
-
[40]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
URLhttps://arxiv.org/abs/2406.01574. Accepted at NeurIPS 2024 Datasets and Benchmarks (Spotlight). Dataset:https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025
ZijianWu,JinjieNi,XiangyanLiu,ZichenLiu,HangYan,andMichaelQizheShieh. Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025. URLhttps://arxiv.org/abs/2506.02096
-
[42]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, WangZhang,HangZhu,JinhuaZhu,JiazeChen,JiangjieChen,ChengyiWang,HongliYu,YuxuanSong,Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Q...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
AohanZeng,XinLv,QinkaiZheng,ZhenyuHou,BinChen,ChengxingXie,CunxiangWang,DaYin,HaoZeng,Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXivpreprintarXiv:2511.07317, 2025
-
[45]
Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025
Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, and Guihai Chen. Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025. URLhttps://arxiv.org/abs/2509.17730
-
[46]
Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024
Yifan Zhang and Team Math-AI. Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024
work page 2024
-
[47]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
AndrewZhao,YiranWu,YangYue,TongWu,QuentinXu,MatthieuLin,ShenzhiWang,QingyunWu,ZilongZheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. 19
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.