arxiv: 2504.11456 · v2 · submitted 2025-04-15 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He , Tian Liang , Jiahao Xu , Qiuzhi Liu , Xingyu Chen , Yue Wang , Linfeng Song , Dian Yu

show 7 more authors

Zhenwen Liang Wenxuan Wang Zhuosheng Zhang Rui Wang Zhaopeng Tu Haitao Mi Dong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mathematical reasoningreinforcement learninglarge language modelsdatasetdecontaminationverifiable answersgeneralizationmathematical benchmarks

0 comments

The pith

DeepMath-103K supplies 103K hard, clean math problems that let reinforcement learning reach state-of-the-art reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepMath-103K as a large collection of difficult mathematical problems intended for training language models with reinforcement learning. Its key features are high difficulty, thorough removal of overlaps with known test sets, and answers that can be checked automatically. Training on this data produces models that set new records on tough math tests while also improving on tasks in biology, physics, and chemistry. Readers should care because better data of this kind can unlock more reliable advances in AI systems that handle complex problems.

Core claim

DeepMath-103K is a dataset of 103,000 mathematical problems at high difficulty levels, decontaminated against many existing benchmarks and equipped with verifiable answers for reward signals in reinforcement learning. It comes with three R1 solutions suitable for supervised fine-tuning and other methods. Models trained using this dataset attain state-of-the-art results on challenging mathematical benchmarks and exhibit generalization to non-mathematical domains including biology, physics, and chemistry.

What carries the argument

The DeepMath-103K dataset itself, which supplies scale, difficulty, decontamination, and verifiability to support rule-based rewards in RL training.

Load-bearing premise

The decontamination step completely eliminates any test-set overlap and the selected problems are hard enough to produce real reasoning gains instead of overfitting.

What would settle it

Train a model on the dataset and measure whether its performance on standard math benchmarks fails to exceed previous methods or if hidden overlap with benchmarks is later discovered.

read the original abstract

Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepMath-103K is a new 103K-problem math dataset with decontamination claims and verifiable answers, but the abstract gives almost no evidence for the SOTA results or cross-domain generalization.

read the letter

The main takeaway is that this paper releases DeepMath-103K, a sizable collection of math problems aimed at difficulty levels 5-9, with stated decontamination from benchmarks and answers that support rule-based rewards. They also supply three R1 solution formats for different training setups like SFT or RL. That combination of scale, difficulty targeting, and data release is the concrete new piece relative to earlier math datasets mentioned in the abstract. Releasing it on Hugging Face helps reproducibility, and the focus on verifiable answers fits what RL reasoning work actually needs for reward signals. The paper correctly flags that current datasets often fall short on challenge or cleanliness, so the intent lines up with real gaps in the area. The soft spots sit in the empirical claims. The abstract says models trained on this data hit SOTA on hard math benchmarks and generalize to biology, physics, and chemistry, yet it supplies no training details, baselines, or ablation results. The decontamination is called rigorous but gives no list of covered benchmarks, no similarity method, and no threshold, which leaves the no-leakage assumption untested from the text alone. If even modest overlap exists, the reported gains could trace to contamination rather than better reasoning. This is the kind of dataset paper that matters for people running RL experiments on reasoning models. A reader who needs fresh, high-difficulty math data for training could pull value from the release itself and test the decontamination independently. It deserves a serious referee because the dataset contribution is tangible and the claims are falsifiable once the full methods and results appear. I would send it to review but flag the need for explicit decontamination procedures and full experimental protocols in the first round.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DeepMath-103K, a dataset of 103K high-difficulty (primarily levels 5-9) mathematical problems that has undergone rigorous decontamination against numerous benchmarks and includes verifiable answers suitable for rule-based RL rewards. It also supplies three distinct R1 solutions to support diverse training paradigms such as SFT. The central claims are that models trained on this dataset achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization to non-mathematical domains including biology, physics, and chemistry.

Significance. If the decontamination procedure is shown to be effective and the reported gains are reproducible with proper baselines, the dataset would constitute a useful resource for RL-based reasoning research by supplying large-scale, challenging, and verifiable training data. The provision of multiple solution formats is a practical strength that could facilitate varied training setups. Cross-domain generalization, if substantiated, would further indicate utility for scientific reasoning tasks beyond mathematics.

major comments (3)

[Decontamination subsection] Decontamination subsection: The abstract states 'rigorous decontamination against numerous benchmarks' but provides no explicit list of those benchmarks, no similarity metric (exact string, n-gram, or embedding cosine), and no overlap threshold. This information is load-bearing for the claim that SOTA results reflect genuine reasoning improvements rather than train-test leakage.
[Experimental Results section] Experimental Results section: No details are given on training protocols (e.g., RL hyperparameters, model sizes), baseline models, or the precise benchmarks and metrics where SOTA is claimed. Without these, the central empirical assertions cannot be evaluated.
[Generalization paragraph] Generalization paragraph: The claim of generalization to biology, physics, and chemistry lacks any description of the evaluation tasks, quantitative results, or controls showing that gains arise from improved reasoning rather than domain-specific artifacts.

minor comments (2)

[Abstract] Abstract: Consider adding one or two key quantitative performance numbers to make the SOTA claim more concrete for readers.
[Dataset description] Dataset description: Clarify the exact number of problems per difficulty level and topic distribution to allow better assessment of coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We agree that additional details are needed to support the claims regarding decontamination, experimental results, and cross-domain generalization. We will revise the manuscript to address these points and provide the requested information.

read point-by-point responses

Referee: [Decontamination subsection] Decontamination subsection: The abstract states 'rigorous decontamination against numerous benchmarks' but provides no explicit list of those benchmarks, no similarity metric (exact string, n-gram, or embedding cosine), and no overlap threshold. This information is load-bearing for the claim that SOTA results reflect genuine reasoning improvements rather than train-test leakage.

Authors: We agree that the decontamination procedure must be documented with greater specificity. In the revised manuscript, we will expand the Decontamination subsection to provide an explicit list of all benchmarks used, detail the similarity metrics applied (exact string matching, n-gram overlap, and embedding cosine similarity), and state the overlap thresholds employed for removal of contaminated items. This will allow readers to evaluate the effectiveness of the procedure and confirm the absence of train-test leakage. revision: yes
Referee: [Experimental Results section] Experimental Results section: No details are given on training protocols (e.g., RL hyperparameters, model sizes), baseline models, or the precise benchmarks and metrics where SOTA is claimed. Without these, the central empirical assertions cannot be evaluated.

Authors: We acknowledge that the Experimental Results section requires more comprehensive documentation. The revised version will include full details on the RL training protocols (hyperparameters and model sizes), the baseline models used for comparison, and the exact benchmarks and metrics on which state-of-the-art performance is reported. These additions will make the empirical claims fully evaluable. revision: yes
Referee: [Generalization paragraph] Generalization paragraph: The claim of generalization to biology, physics, and chemistry lacks any description of the evaluation tasks, quantitative results, or controls showing that gains arise from improved reasoning rather than domain-specific artifacts.

Authors: We agree that the generalization claims need supporting details. In the revised manuscript, we will expand the Generalization paragraph to describe the specific evaluation tasks in biology, physics, and chemistry, report the quantitative results, and include controls or analyses showing that the observed gains derive from improved reasoning rather than domain-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release relies on external benchmarks and independent verification

full rationale

The paper presents a new dataset DeepMath-103K constructed via collection, decontamination, and verification steps, then evaluates models trained on it against external mathematical and cross-domain benchmarks. No derivation chain, equations, parameter fitting, or self-citation load-bearing steps exist that reduce claims to inputs by construction. The SOTA and generalization results are empirical outcomes from training and testing on held-out data, with decontamination presented as a procedural safeguard rather than a self-referential proof. This is a standard dataset contribution whose validity rests on reproducibility of the data pipeline and independent benchmark performance, not internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper rather than a theoretical derivation, so no free parameters, axioms, or invented entities are required for the central claim.

pith-pipeline@v0.9.0 · 5520 in / 1020 out tokens · 41808 ms · 2026-05-16T10:27:45.145056+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 7.0

AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
cs.LG 2026-01 unverdicted novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
cs.CL 2026-04 unverdicted novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
cs.CL 2026-01 unverdicted novelty 6.0

CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
cs.LG 2026-05 unverdicted novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
cs.AI 2026-04 unverdicted novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
cs.LG 2026-04 unverdicted novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
Your Model Diversity, Not Method, Determines Reasoning Strategy
cs.AI 2026-04 unverdicted novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

URL https: //arxiv.org/abs/2502.17387. 3https://github.com/volcengine/verl 14 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ın Bl´azquez, Guilherme Penedo, Lewis Tunstall, Andr´es Marafioti, Hynek Kydl´ıˇcek, Agust´ın Piqueres Lajar´ın, Vaibhav Srivastav, Joshua Lochner, ...

work page arXiv
[2]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URL https://arxiv.org/abs/2502.02737. Daman Arora, Himanshu Singh, and Mausam. Have LLMs advanced enough? a challenging problem solving benchmark for large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7527–7543, Singapore, December

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

doi: 10.18653/v1/2023.emnlp-main.468

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.468. URL https://aclanthology.org/2023.emnlp-main.468. Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling. In The Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/2023.emnlp-main.468 2023
[4]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

URL https://arxiv.org/abs/ 2412.21187. Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models. arXiv preprint arXiv:2503.04548,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Hugging Face

URL https://arxiv.org/abs/2503.16219. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January

work page arXiv
[8]

Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou

URL https: //github.com/huggingface/open-r1. Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathe- matical problem-solving skills in large language models using odyssey math data. arXiv preprint arXiv:2406.18321,

work page arXiv
[9]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307,

work page arXiv
[10]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

URL https://arxiv.org/abs/2410.07985. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

15 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

URL https://arxiv.org/abs/2406.12753. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. B...

work page arXiv
[14]

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/18abbeef8cfe920 3fdf9053c9c4fe191-Paper-Conference.pdf. Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [...

work page 2022
[15]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin

URL https://arxiv.org/abs/2401.09003. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,

work page arXiv
[16]

16 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning MAA

Notion Blog. 16 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning MAA. American invitational mathematics examination (AIME). Mathematics Competition Series, n.d.a. URL https://maa.org/math-competitions/aime. MAA. American mathematics competitions (AMC 10/12). Mathematics Competition Series, n.d.b. URL https://maa.org/math-competitions...

work page arXiv
[17]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11

work page 2019
[18]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

URL http://arxiv.org/abs/1908.10084. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling,

work page internal anchor Pith review Pith/arXiv arXiv 1908
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman

URL https://openreview.net/forum?id=zLU21oQjD5. Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560, 2024a. Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor ...

work page arXiv
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

URL https://arxiv.org/abs/2503.14476. Xiang Yue, Tianyu Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. Advances in Neural Information Processing Systems, 37:90629–90660,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025a. URL https://arxiv.org/abs/2503.18892. Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P . Xing. Megamath: Pushing the limits of open math corpora. arXiv preprint arXiv:2504.02807,

work page arXiv