Recognition: 1 theorem link
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
Pith reviewed 2026-05-16 10:27 UTC · model grok-4.3
The pith
DeepMath-103K supplies 103K hard, clean math problems that let reinforcement learning reach state-of-the-art reasoning performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepMath-103K is a dataset of 103,000 mathematical problems at high difficulty levels, decontaminated against many existing benchmarks and equipped with verifiable answers for reward signals in reinforcement learning. It comes with three R1 solutions suitable for supervised fine-tuning and other methods. Models trained using this dataset attain state-of-the-art results on challenging mathematical benchmarks and exhibit generalization to non-mathematical domains including biology, physics, and chemistry.
What carries the argument
The DeepMath-103K dataset itself, which supplies scale, difficulty, decontamination, and verifiability to support rule-based rewards in RL training.
Load-bearing premise
The decontamination step completely eliminates any test-set overlap and the selected problems are hard enough to produce real reasoning gains instead of overfitting.
What would settle it
Train a model on the dataset and measure whether its performance on standard math benchmarks fails to exceed previous methods or if hidden overlap with benchmarks is later discovered.
read the original abstract
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeepMath-103K, a dataset of 103K high-difficulty (primarily levels 5-9) mathematical problems that has undergone rigorous decontamination against numerous benchmarks and includes verifiable answers suitable for rule-based RL rewards. It also supplies three distinct R1 solutions to support diverse training paradigms such as SFT. The central claims are that models trained on this dataset achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization to non-mathematical domains including biology, physics, and chemistry.
Significance. If the decontamination procedure is shown to be effective and the reported gains are reproducible with proper baselines, the dataset would constitute a useful resource for RL-based reasoning research by supplying large-scale, challenging, and verifiable training data. The provision of multiple solution formats is a practical strength that could facilitate varied training setups. Cross-domain generalization, if substantiated, would further indicate utility for scientific reasoning tasks beyond mathematics.
major comments (3)
- [Decontamination subsection] Decontamination subsection: The abstract states 'rigorous decontamination against numerous benchmarks' but provides no explicit list of those benchmarks, no similarity metric (exact string, n-gram, or embedding cosine), and no overlap threshold. This information is load-bearing for the claim that SOTA results reflect genuine reasoning improvements rather than train-test leakage.
- [Experimental Results section] Experimental Results section: No details are given on training protocols (e.g., RL hyperparameters, model sizes), baseline models, or the precise benchmarks and metrics where SOTA is claimed. Without these, the central empirical assertions cannot be evaluated.
- [Generalization paragraph] Generalization paragraph: The claim of generalization to biology, physics, and chemistry lacks any description of the evaluation tasks, quantitative results, or controls showing that gains arise from improved reasoning rather than domain-specific artifacts.
minor comments (2)
- [Abstract] Abstract: Consider adding one or two key quantitative performance numbers to make the SOTA claim more concrete for readers.
- [Dataset description] Dataset description: Clarify the exact number of problems per difficulty level and topic distribution to allow better assessment of coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We agree that additional details are needed to support the claims regarding decontamination, experimental results, and cross-domain generalization. We will revise the manuscript to address these points and provide the requested information.
read point-by-point responses
-
Referee: [Decontamination subsection] Decontamination subsection: The abstract states 'rigorous decontamination against numerous benchmarks' but provides no explicit list of those benchmarks, no similarity metric (exact string, n-gram, or embedding cosine), and no overlap threshold. This information is load-bearing for the claim that SOTA results reflect genuine reasoning improvements rather than train-test leakage.
Authors: We agree that the decontamination procedure must be documented with greater specificity. In the revised manuscript, we will expand the Decontamination subsection to provide an explicit list of all benchmarks used, detail the similarity metrics applied (exact string matching, n-gram overlap, and embedding cosine similarity), and state the overlap thresholds employed for removal of contaminated items. This will allow readers to evaluate the effectiveness of the procedure and confirm the absence of train-test leakage. revision: yes
-
Referee: [Experimental Results section] Experimental Results section: No details are given on training protocols (e.g., RL hyperparameters, model sizes), baseline models, or the precise benchmarks and metrics where SOTA is claimed. Without these, the central empirical assertions cannot be evaluated.
Authors: We acknowledge that the Experimental Results section requires more comprehensive documentation. The revised version will include full details on the RL training protocols (hyperparameters and model sizes), the baseline models used for comparison, and the exact benchmarks and metrics on which state-of-the-art performance is reported. These additions will make the empirical claims fully evaluable. revision: yes
-
Referee: [Generalization paragraph] Generalization paragraph: The claim of generalization to biology, physics, and chemistry lacks any description of the evaluation tasks, quantitative results, or controls showing that gains arise from improved reasoning rather than domain-specific artifacts.
Authors: We agree that the generalization claims need supporting details. In the revised manuscript, we will expand the Generalization paragraph to describe the specific evaluation tasks in biology, physics, and chemistry, report the quantitative results, and include controls or analyses showing that the observed gains derive from improved reasoning rather than domain-specific artifacts. revision: yes
Circularity Check
No circularity: dataset release relies on external benchmarks and independent verification
full rationale
The paper presents a new dataset DeepMath-103K constructed via collection, decontamination, and verification steps, then evaluates models trained on it against external mathematical and cross-domain benchmarks. No derivation chain, equations, parameter fitting, or self-citation load-bearing steps exist that reduce claims to inputs by construction. The SOTA and generalization results are empirical outcomes from training and testing on held-out data, with decontamination presented as a procedural safeguard rather than a self-referential proof. This is a standard dataset contribution whose validity rests on reproducibility of the data pipeline and independent benchmark performance, not internal redefinition.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
-
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
-
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Reference graph
Works this paper leans on
-
[1]
URL https: //arxiv.org/abs/2502.17387. 3https://github.com/volcengine/verl 14 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ın Bl´azquez, Guilherme Penedo, Lewis Tunstall, Andr´es Marafioti, Hynek Kydl´ıˇcek, Agust´ın Piqueres Lajar´ın, Vaibhav Srivastav, Joshua Lochner, ...
-
[2]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
URL https://arxiv.org/abs/2502.02737. Daman Arora, Himanshu Singh, and Mausam. Have LLMs advanced enough? a challenging problem solving benchmark for large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7527–7543, Singapore, December
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
doi: 10.18653/v1/2023.emnlp-main.468
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.468. URL https://aclanthology.org/2023.emnlp-main.468. Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller, weaker, yet better: Training LLM reasoners via compute-optimal sampling. In The Thirteenth International Conference on Learning Representations,
-
[4]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
URL https://arxiv.org/abs/ 2412.21187. Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models. arXiv preprint arXiv:2503.04548,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URL https://arxiv.org/abs/2503.16219. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January
-
[8]
Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou
URL https: //github.com/huggingface/open-r1. Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathe- matical problem-solving skills in large language models using odyssey math data. arXiv preprint arXiv:2406.18321,
-
[9]
Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307,
-
[10]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
URL https://arxiv.org/abs/2410.07985. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
15 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
URL https://arxiv.org/abs/2406.12753. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. B...
-
[14]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/18abbeef8cfe920 3fdf9053c9c4fe191-Paper-Conference.pdf. Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [...
work page 2022
-
[15]
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin
URL https://arxiv.org/abs/2401.09003. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,
-
[16]
16 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning MAA
Notion Blog. 16 A Large-Scale Challenging Mathematical Dataset for Advancing Reasoning MAA. American invitational mathematics examination (AIME). Mathematics Competition Series, n.d.a. URL https://maa.org/math-competitions/aime. MAA. American mathematics competitions (AMC 10/12). Mathematics Competition Series, n.d.b. URL https://maa.org/math-competitions...
-
[17]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11
work page 2019
-
[18]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
URL http://arxiv.org/abs/1908.10084. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling,
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL https://arxiv.org/abs/2402.03300. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman
URL https://openreview.net/forum?id=zLU21oQjD5. Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560, 2024a. Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor ...
-
[21]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
URL https://arxiv.org/abs/2503.14476. Xiang Yue, Tianyu Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. Advances in Neural Information Processing Systems, 37:90629–90660,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025a. URL https://arxiv.org/abs/2503.18892. Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning...
work page internal anchor Pith review Pith/arXiv arXiv
- [23]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.