arxiv: 2601.04809 · v5 · submitted 2026-01-08 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Caijun Xu , Changyi Xiao , Zhongyuan Peng , Xinrun Wang , Yixin Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learninglarge language modelsreasoningsynthetic environmentsadaptive trainingscalable synthesisprogramming problemsRL for LLMs

0 comments

The pith

SCALER sustains informative RL signals for LLM reasoning by adapting synthetic environments to model capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SCALER to address slowdowns in reinforcement learning for large language models when task difficulty misaligns with model progress or when training repeats narrow patterns. It introduces a synthesis pipeline that turns real programming problems into verifiable reasoning environments offering controllable difficulty and unlimited new instances. An adaptive multi-environment strategy then tracks the model's capability frontier by adjusting difficulty and curating environments to preserve diversity. This combination prevents reward sparsity and overfitting, yielding stronger performance on reasoning benchmarks and more stable training over long horizons than fixed-dataset baselines.

Core claim

SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity, preventing reward sparsity, mitigating overfitting to narrow task patterns, and supporting sustained improvement throughout training.

What carries the argument

The scalable synthesis pipeline that generates verifiable reasoning environments from programming problems, paired with an adaptive multi-environment RL strategy that co-adjusts difficulty and environment selection to the model's evolving frontier.

If this is right

RL training can proceed with unbounded new instances instead of exhausting a fixed dataset.
Difficulty adjustments keep reward signals informative as model capability grows.
Models reach higher performance across diverse reasoning benchmarks.
Training remains stable over long horizons without overfitting to recurring patterns.
Distributional diversity across environments reduces collapse to narrow solution strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to domains outside programming by synthesizing tasks from other verifiable sources such as math proofs or scientific hypotheses.
It suggests synthetic generation might eventually replace much of the need for static curated datasets in scaling RL for reasoning.
Testing the method on models larger than those in the experiments would reveal whether the co-adaptation continues to track capability frontiers effectively.

Load-bearing premise

The synthesis pipeline can reliably convert real programming problems into verifiable environments with controllable difficulty and strong correctness guarantees while the adaptive strategy avoids introducing distributional biases or reward sparsity.

What would settle it

If models trained with SCALER show no consistent outperformance over dataset-based RL baselines on standard reasoning benchmarks or if reward sparsity increases as training lengthens, the central claim would be falsified.

read the original abstract

Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCALER's synthesis pipeline for unlimited verifiable reasoning problems plus adaptive multi-environment RL is the real contribution, but it stands or falls on whether the verification stays sound at scale.

read the letter

The paper's main move is to generate fresh reasoning instances from real programming problems with controllable difficulty and no hard limit on quantity, then run RL across a changing set of environments that adjust on the fly to the model's current capability. This directly targets the common stall in dataset-based RL where difficulty stops matching and patterns repeat too much. The adaptive curation is meant to keep diversity and avoid sparse rewards over long training runs. If the experiments are clean, the reported gains on reasoning benchmarks and the more stable dynamics are the practical payoff. That part is worth taking seriously for anyone working on scaling RL for LLM reasoning. The soft spot is exactly where the stress-test flagged it: the synthesis has to deliver instances whose verifiers are fully automated and free of generation errors, otherwise the reward signal becomes unreliable and both the outperformance and stability claims weaken. The abstract does not detail the verification mechanism, so the full methods section needs to show concrete soundness arguments or equivalence checks rather than heuristics. Without that, the unbounded-generation advantage is more aspirational than demonstrated. This is for groups already running RL on reasoning tasks who need longer training horizons. It is coherent enough on its own terms to deserve a serious referee, even if the verification details turn out to require revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes SCALER, a framework for RL-based improvement of LLM reasoning. It introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments featuring controllable difficulty and unbounded instance generation while preserving strong correctness guarantees. It further describes an adaptive multi-environment RL strategy that dynamically adjusts difficulty and curates environments to track the model's capability frontier, maintain distributional diversity, prevent reward sparsity, and mitigate overfitting to narrow patterns. Experiments claim consistent outperformance over dataset-based RL baselines across diverse reasoning benchmarks together with more stable long-horizon training dynamics.

Significance. If the synthesis pipeline delivers sound, automated verification for unbounded instances and the adaptive curation avoids distributional collapse, SCALER would address two central bottlenecks in RL for reasoning—misaligned difficulty and pattern overfitting—potentially enabling sustained improvement beyond finite datasets. The reported stability in long-horizon dynamics would be a notable practical advantage for training large models.

major comments (2)

[Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.
[Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.

minor comments (1)

[Abstract] The abstract and experimental claims would benefit from explicit reporting of the number of environments, difficulty parameterization, and verification success rate to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer exposition of the verification and curation mechanisms. We address each major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Synthesis Pipeline): the central claim that the pipeline 'preserves strong correctness guarantees' for unbounded instance generation is not supported by any described verification mechanism (e.g., formal solvers, equivalence-checked test-case generation, or soundness proofs). Without this, reward signals at scale become unreliable, directly undermining both the outperformance and stable-dynamics results.

Authors: The synthesis pipeline starts from real-world programming problems whose solutions are already equipped with executable test suites. Each generated instance is produced by a semantics-preserving transformation (variable renaming, control-flow restructuring, and input scaling) whose equivalence to the original is checked by running the reference solution on both the original and transformed test cases. Any instance failing this check is discarded. This provides an automated, execution-based verification mechanism that scales to unbounded generation while inheriting the original problems' correctness guarantees. We agree the manuscript under-emphasizes this step and will add an explicit subsection in §3 describing the equivalence-checking procedure, failure rates, and soundness argument. revision: yes
Referee: [Abstract and §4] Abstract and §4 (Adaptive Multi-Environment Strategy): the claim that dynamic curation 'tracks the model's capability frontier' and 'maintains distributional diversity' lacks any concrete mechanism or metric for detecting or preventing distributional collapse or reward sparsity. This assumption is load-bearing for the long-horizon stability result.

Authors: The adaptive strategy maintains a dynamic environment pool whose composition is updated every K episodes according to two metrics: (1) per-environment success rate, used to estimate the current capability frontier and to up-weight environments near the frontier while down-weighting those that are either too easy or too hard; (2) environment-type entropy computed over a sliding window of recent trajectories, which triggers re-sampling when entropy falls below a threshold to restore diversity. Reward sparsity is monitored via the fraction of zero-reward episodes; when this exceeds a threshold, the system injects easier environments from the pool. These mechanisms are described at a high level in §4; we will expand the section with the precise update rules, threshold values, and ablation results showing their effect on distributional collapse and training stability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SCALER framework proposal

full rationale

The paper introduces SCALER as a novel framework with a synthesis pipeline for converting programming problems into verifiable environments and an adaptive multi-environment RL strategy for tracking capability frontiers. These elements are presented as original design contributions rather than derived from prior equations or parameters. Claims of outperformance and stable dynamics rest on reported experimental results across benchmarks, not on any self-referential reductions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that collapse back to the inputs by construction. The derivation chain is the proposal and empirical validation of the method itself, which remains self-contained and externally testable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the synthesis pipeline and adaptive strategy are described at high level without detailing underlying assumptions or fitted quantities.

pith-pipeline@v0.9.0 · 5506 in / 1100 out tokens · 21886 ms · 2026-05-16T16:19:30.273173+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

difficulty of the instances is characterized by the array length or the number of edges in the graph and we discretize the difficulty into distinct difficulty levels... dt+1=clip(dt+β·(acct-τ),0,D)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 14 internal anchors

[1]

Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023

Art of Problem Solving. Amc 12 (2023) problems.https://artofproblemsolving.com/wiki/index.php/2023_AMC_ 12A_Problems, 2023. We use the 2023 AMC 12A/12B problems as hosted by AoPS; a processed version is available at https://huggingface.co/datasets/zwhe99/amc23

work page 2023
[2]

Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, ZhengyuChen, ShĳieGeng, AoyanLi, BoLi, BowenLi, LinyiLi, BoyiLiu, JiahengLiu, KaiboLiu, QiLiu, ShukaiLiu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y. Peng, Kai Shen, Jiahao Su, Jing Su, Tao Sun, Yifan S...

work page arXiv 2025
[3]

Towards understanding self-play for llm reasoning, 2025

Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi. Towards understanding self-play for llm reasoning, 2025. URLhttps://arxiv.org/abs/2510.27072

work page arXiv 2025
[4]

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025. URLhttps://arxiv.org/abs/ 2504.19162

work page arXiv 2025
[5]

Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXivpreprintarXiv:2505.14970, 2025

work page arXiv 2025
[6]

Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025

XinjieChen,MinpengLiao,GuoxinChen,ChengxiLi,BiaoFu,KaiFan,andXinggaoLiu. Fromdata-centrictosample- centric: Enhancing llm reasoning via progressive optimization, 2025. URLhttps://arxiv.org/abs/2507.06573

work page arXiv 2025
[7]

Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025

ZhuoyueChen,JihaiZhang,BenLiu,FangquanLin,andWotaoYin. Scaledowntospeedup: Dynamicdataselection for reinforcement learning.Training, 2500:3000, 2025

work page 2025
[8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Serl: Self-play reinforcement learning for large language models with limited data, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025. URL https://arxiv.org/abs/2505.20347

work page arXiv 2025
[10]

16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025

Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, and Guorui Zhou. 16 Klear-codetest: Scalable test case generation for code reinforcement learning, 2025. URLhttps://arxiv.org/abs/ 2508.05710

work page arXiv 2025
[11]

Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D. Manning. Synthetic data generation & multi-step rl for reasoning & tool use, 2025. URLhttps://arxiv.org/abs/2504.04736

work page arXiv 2025
[12]

Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, 2025. URL https://arxiv.org/abs/2512.19682

work page arXiv 2025
[13]

Synthetic data rl: Task definition is all you need, 2025

Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, and Yikang Shen. Synthetic data rl: Task definition is all you need, 2025. URLhttps://arxiv.org/abs/2505.17063

work page arXiv 2025
[14]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXivpreprintarXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

BIG-benchextrahard

Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa,KateOlszewska,YiTay,VinhQ.Tran,QuocV.Le,andOrhanFirat. BIG-benchextrahard. In Proceedingsof the 63rdAnnual Meeting of the...

work page doi:10.18653/v1/2025.acl-long.1285 2025
[18]

Learnalign: Reasoningdataselectionforreinforcementlearninginlargelanguagemodelsbasedonimprovedgradientalignment,

ShipengLi,ShikunLi,ZhiqinYang,XinghuaZhang,GaodeChen,XiaoboXia,HengyuLiu,andZhePeng. Learnalign: Reasoningdataselectionforreinforcementlearninginlargelanguagemodelsbasedonimprovedgradientalignment,

work page
[19]

URLhttps://arxiv.org/abs/2506.11480

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Axel Gimeno Gil, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robs...

work page doi:10.5281/zenodo.6975437 2022
[21]

Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025

Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, and Weizhu Chen. Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2506.08989

work page arXiv 2025
[22]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. URL https://arxiv.org/abs/2305.20050. Releases PRM800K and the MATH-500 (500-problem) test subset via https://github.com/openai/prm800k

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. arXivpreprintarXiv:2506.24119, 2025

work page arXiv 2025
[24]

Saturn: Sat-based reinforcement learning to unleash language model reasoning.arXivpreprint arXiv:2505.16368, 2025

Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, and Ge Li. Saturn: Sat-based reinforcement learning to unleash language model reasoning.arXivpreprint arXiv:2505.16368, 2025

work page arXiv 2025
[25]

Openai o1 system card,

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page
[26]

URLhttps://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Curriculum reinforcement learning from easy to hard tasks improves llm reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. arXiv preprintarXiv:2506.06632, 2025

work page arXiv 2025
[29]

Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin. Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

work page arXiv 2023
[30]

What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025

work page arXiv 2025
[31]

URLhttps://arxiv.org/abs/2406.14532

AmrithSetlur,SaurabhGarg,XinyangGeng,NamanGarg,VirginiaSmith,andAviralKumar.Rlonincorrectsynthetic data scales the efficiency of llm math reasoning by eight-fold, 2024. URLhttps://arxiv.org/abs/2406.14532

work page arXiv 2024
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Efficient reinforcement finetuning via adaptive curriculum learning.arXivpreprintarXiv:2504.05520, 2025

Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXivpreprintarXiv:2504.05520, 2025

work page arXiv 2025
[34]

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas 18 Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760, 2025

work page arXiv 2025
[35]

Kimi k1.5: Scaling reinforcement learning with llms, 2025

Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms, 2025

work page 2025
[36]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Scheduling Your LLM Reinforcement Learning with Reasoning Trees

Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, and Jiawei Chen. Scheduling your llm reinforcement learning with reasoning trees, 2025. URLhttps://arxiv.org/abs/2510.24832

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025

PinzhengWang,JuntaoLi,ZechengTang,HaĳiaGui,andMinzhang. Improvingrationalityinthereasoningprocess of language models through self-playing game, 2025. URLhttps://arxiv.org/abs/2506.22920

work page arXiv 2025
[40]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

URLhttps://arxiv.org/abs/2406.01574. Accepted at NeurIPS 2024 Datasets and Benchmarks (Spotlight). Dataset:https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025

ZĳianWu,JinjieNi,XiangyanLiu,ZichenLiu,HangYan,andMichaelQizheShieh. Synthrl: Scalingvisualreasoning with verifiable data synthesis, 2025. URLhttps://arxiv.org/abs/2506.02096

work page arXiv 2025
[42]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, WangZhang,HangZhu,JinhuaZhu,JiazeChen,JiangjieChen,ChengyiWang,HongliYu,YuxuanSong,Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Q...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

AohanZeng,XinLv,QinkaiZheng,ZhenyuHou,BinChen,ChengxingXie,CunxiangWang,DaYin,HaoZeng,Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXivpreprintarXiv:2511.07317, 2025

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXivpreprintarXiv:2511.07317, 2025

work page arXiv 2025
[45]

Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025

Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, and Guihai Chen. Confclip: Confidence-weighted and clipped reward for reinforcement learning in llms, 2025. URLhttps://arxiv.org/abs/2509.17730

work page arXiv 2025
[46]

Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024

Yifan Zhang and Team Math-AI. Aime 2024.https://huggingface.co/datasets/math-ai/aime24, 2024

work page 2024
[47]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

AndrewZhao,YiranWu,YangYue,TongWu,QuentinXu,MatthieuLin,ShenzhiWang,QingyunWu,ZilongZheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. 19

work page internal anchor Pith review Pith/arXiv arXiv 2025