D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

Chongyang Tao; Renda Li; Ru Zhang; Weijie Qiu; Xiangxiang Chu; Yong Wang; Ziyu Ma

arxiv: 2605.17037 · v1 · pith:XZVRNAW7new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

D²Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

Ru Zhang , Renda Li , Ziyu Ma , Weijie Qiu , Chongyang Tao , Yong Wang , Xiangxiang Chu This is my paper

Pith reviewed 2026-05-19 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelsmathematical reasoningself-evolutiondata efficiencydifficulty adaptationco-evolution

0 comments

The pith

D²Evo achieves data-efficient RL for LLM reasoning by mining medium-difficulty anchors and jointly evolving a question generator with the solver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a reinforcement learning approach that keeps generating useful training examples for large language models even as their reasoning ability improves. It does so by repeatedly selecting medium-difficulty problems based on the current solver's success rate, then training a separate questioner component to produce new problems at matching levels, and optimizing both together in each round. This setup directly tackles the problems of scarce medium-difficulty data and the fact that fixed datasets quickly become too easy. A sympathetic reader would care because the method reaches competitive mathematical reasoning performance while using fewer than two thousand real samples and transfers to broader reasoning tasks.

Core claim

In each iteration, the method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. This process addresses effective data scarcity and dynamic difficulty shifts that arise during RL training of LLMs for reasoning.

What carries the argument

The dual difficulty-aware self-evolution loop that mines anchors from the solver's current performance to guide the questioner in producing new training samples at matched difficulty levels.

If this is right

The approach outperforms prior methods on mathematical reasoning benchmarks while using fewer than 2K real samples.
Training produces stable progressive gains through repeated co-optimization of questioner and solver.
The resulting models exhibit strong generalization when evaluated on general reasoning benchmarks.
The framework mitigates both data scarcity and the loss of medium-difficulty signals as model capability rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring principle could be tested in non-math domains such as code generation or multi-step planning where optimal task difficulty also shifts with agent skill.
If the co-evolution loop proves stable at larger scales, the method would reduce reliance on human-curated datasets for applying RL to specialized reasoning tasks.
One could measure whether the generated questions maintain diversity across iterations or converge to a narrow set of problem types.

Load-bearing premise

The question generator will continue to produce useful training examples only if it remains aligned with the solver's improving ability and avoids drifting into questions that are consistently too easy or too hard.

What would settle it

After multiple iterations, measure the solver's success rate on the newly generated questions; if the rate moves close to zero or close to one instead of staying in the medium range, the progressive-gain claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17037 by Chongyang Tao, Renda Li, Ru Zhang, Weijie Qiu, Xiangxiang Chu, Yong Wang, Ziyu Ma.

**Figure 1.** Figure 1: Top: Performance on mathematical benchmarks when training on Math12K subsets of different difficulty levels. Bottom: Difficulty distributions of two commonly used datasets and acceptance rates of generated questions across difficulty levels. Difficulty measured by rollout average accuracy; acceptance rate evaluated using GPT-5.2 (see Sec. 4.2 for details). Experiments use Qwen3-4B-Base. However, GRPO is h… view at source ↗

**Figure 2.** Figure 2: Difficulty distribution before and after one epoch GRPO on OpenRs-7K and Math-12K dataset, conducted with Qwen-3- 8B-Base. • We conduct extensive evaluation demonstrating that D 2Evo achieves superior performance on mathematical tasks and shows generalization capability on general benchmarks, using fewer than 2K real samples. 2. Method 2.1. Motivation A recent study (Bae et al., 2025) demonstrates that the… view at source ↗

**Figure 3.** Figure 3: An overview of our framework. At each iteration, mid-difficulty anchors are mined from real data by a frozen Solver. Conditioned on these anchors, the Questioner is optimized with GRPO under a difficulty-aware reward to generate diverse questions within the target difficulty band. We then construct a hybrid buffer of anchors and filtered generations, and update the Solver with GRPO on this hybrid buffer, f… view at source ↗

**Figure 4.** Figure 4: Mathematical and general reasoning performance of the Questioner under independent training and in D2Evo across iterations. to facilitate knowledge transfer between question generation and problem solving. To further analyze this effect, we examine the mathematical and general reasoning capabilities of the Questioner under independent training and within D2Evo across iterations. As shown in [PITH_FULL_I… view at source ↗

**Figure 5.** Figure 5: Comparison of question generation capability between R-Zero and D2Evo. Left and middle: evolution of average difficulty and acceptance rate of generated questions across iterations. Right: difficulty distributions of generated questions with respect to the current Solver for R-Zero and D2Evo. ble 5 shows that at each iteration, the Solver’s acceptance rate on question generation is comparable to or higher … view at source ↗

read the original abstract

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D²Evo adds a joint Solver-Questioner loop that mines medium-difficulty anchors from the current model to generate its own training data, but the stability of that loop still needs checking.

read the letter

D²Evo's main contribution is a self-evolution setup where the Solver's performance on a small set of real samples defines medium-difficulty anchors, and a Questioner is trained to produce new questions at those levels while both improve together. This directly targets the scarcity of suitable RL data for LLM reasoning and the fact that samples get too easy as the model advances. The joint optimization and explicit anchoring look like a practical way to keep the difficulty matched without relying on fixed external curricula or pure anchor-free generation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes D²Evo, a Dual Difficulty-aware self-Evolution RL framework for data-efficient reasoning in LLMs. In each iteration it mines medium-difficulty anchors from the current Solver's performance on a small set of real samples, trains a Questioner to generate diverse questions at matching difficulty levels, and jointly optimizes both components. The central claim is that this co-evolution produces stable progressive gains, outperforming prior methods on mathematical reasoning benchmarks while using fewer than 2K real mathematical samples and exhibiting strong generalization on general reasoning benchmarks.

Significance. If the empirical results and stability of the co-evolution loop hold, the work would be significant for reducing reliance on large human-annotated datasets in RL for LLM reasoning. The explicit use of difficulty anchors derived from the evolving Solver and the joint training of Questioner and Solver represent a concrete mechanism for handling dynamic difficulty shifts, which could influence subsequent self-improvement and synthetic-data pipelines.

major comments (2)

[§3.2] §3.2 (Anchor Mining): the medium-difficulty threshold is defined relative to the Solver's accuracy on the same small real-sample set used for evaluation; this introduces a potential circularity because performance gains are measured against thresholds that are themselves updated by the evolving Solver, and no fixed external difficulty oracle or held-out validation set is used to decouple the two.
[§4.3] §4.3 (Ablation and Iteration Analysis): while per-iteration accuracy curves are reported, there is no monitoring of the difficulty distribution of generated questions across iterations or ablation on the anchor-selection threshold; without these, it is impossible to verify that the claimed progressive gains are not artifacts of persistent difficulty mismatch or mode collapse in the Questioner.

minor comments (2)

[§3.1] Notation for the difficulty estimator E_p is introduced without an explicit equation; adding a numbered equation would clarify how the medium-difficulty band is computed from Solver logits.
[Table 2] Table 2 caption does not state the number of random seeds or whether error bars reflect standard deviation across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the presentation of our method.

read point-by-point responses

Referee: [§3.2] §3.2 (Anchor Mining): the medium-difficulty threshold is defined relative to the Solver's accuracy on the same small real-sample set used for evaluation; this introduces a potential circularity because performance gains are measured against thresholds that are themselves updated by the evolving Solver, and no fixed external difficulty oracle or held-out validation set is used to decouple the two.

Authors: We appreciate the referee highlighting this design aspect. The adaptive threshold is intentionally computed from the Solver's current performance on the small real-sample set so that medium-difficulty anchors remain relevant as the model evolves; this is a core feature of the self-evolution framework rather than an unintended circularity. Crucially, all reported performance gains are measured on standard held-out mathematical reasoning benchmarks (MATH, GSM8K, etc.) that are completely disjoint from the mining set. We will revise §3.2 to explicitly state this separation, clarify the rationale for using an evolving internal reference instead of a fixed external oracle, and add a brief experiment that reserves a small held-out portion of the real samples solely for threshold validation. revision: yes
Referee: [§4.3] §4.3 (Ablation and Iteration Analysis): while per-iteration accuracy curves are reported, there is no monitoring of the difficulty distribution of generated questions across iterations or ablation on the anchor-selection threshold; without these, it is impossible to verify that the claimed progressive gains are not artifacts of persistent difficulty mismatch or mode collapse in the Questioner.

Authors: We agree that additional diagnostics would make the stability claims more robust. In the revised manuscript we will add (i) plots tracking the distribution of estimated difficulty levels of questions generated by the Questioner across iterations and (ii) an ablation varying the anchor-selection threshold (e.g., different accuracy percentiles used to define “medium” difficulty). These new results will be placed in §4.3 and will help demonstrate that the observed gains arise from effective difficulty matching rather than mismatch or collapse. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an iterative algorithmic framework (D²Evo) for co-evolving a Solver and Questioner via RL, with medium-difficulty anchors mined from the current Solver's performance on a fixed set of real samples. This process is presented as an empirical method rather than a closed mathematical derivation. No equations are shown that reduce a claimed prediction or result to a fitted parameter or self-defined quantity by construction. No load-bearing self-citations to uniqueness theorems or ansatzes from prior author work are invoked. Claims of data-efficient gains rest on external benchmark evaluations (mathematical and general reasoning tasks) using fewer than 2K real samples, which are independent of the internal difficulty thresholds. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone; full methods section would be needed to audit these.

pith-pipeline@v0.9.0 · 5725 in / 1083 out tokens · 42708 ms · 2026-05-19T20:17:53.748928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mines medium-difficulty anchors based on the current Solver’s capability... jointly optimizes both components
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

difficulty defined as (1 − correct/N) × 100 with thresholds low/high

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 21 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

Curriculum Learning , author =. Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

work page
[13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Paced Learning for Latent Variable Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[14]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , author =. arXiv preprint arXiv:1712.01815 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

Spice: Self-play in corpus environments improves reasoning , author=. arXiv preprint arXiv:2510.24684 , year=

work page arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Online difficulty filtering for reasoning oriented reinforcement learning.arXiv preprint arXiv:2504.03380,

Online difficulty filtering for reasoning oriented reinforcement learning , author=. arXiv preprint arXiv:2504.03380 , year=

work page arXiv
[21]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-zero: Self-evolving reasoning llm from zero data , author=. arXiv preprint arXiv:2508.05004 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2504.05520 , year=

Efficient reinforcement finetuning via adaptive curriculum learning , author=. arXiv preprint arXiv:2504.05520 , year=

work page arXiv
[23]

Reinforcement learning for reasoning in small llms: What works and what doesn’t

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , author=. arXiv preprint arXiv:2503.16219 , year=

work page arXiv
[24]

arXiv preprint arXiv:2509.24726 , year=

Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution , author=. arXiv preprint arXiv:2509.24726 , year=

work page arXiv
[25]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[26]

EMNLP , year=

HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation , author=. EMNLP , year=

work page
[27]

AAAI , year=

AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting , author=. AAAI , year=

work page
[28]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[29]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Supergpqa: Scaling llm evaluation across 285 graduate disciplines , author=. arXiv preprint arXiv:2502.14739 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Big-bench extra hard , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[35]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[36]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[39]

ICLR , year=

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. ICLR , year=

work page
[40]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Learning to reason with search for llms via reinforcement learning , author=. arXiv preprint arXiv:2503.19470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Advances in Neural Information Processing Systems , volume=

Self-playing adversarial language game enhances llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

Spell: Self-play reinforcement learning for evolving long-context language models , author=. arXiv preprint arXiv:2509.23863 , year=

work page arXiv
[45]

Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

Language self-play for data-free training , author=. arXiv preprint arXiv:2509.07414 , year=

work page arXiv
[46]

arXiv preprint arXiv:2404.14387 , year=

A survey on self-evolution of large language models , author=. arXiv preprint arXiv:2404.14387 , year=

work page arXiv
[47]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page
[48]

arXiv preprint arXiv:2210.01240 , year=

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought , author=. arXiv preprint arXiv:2210.01240 , year=

work page arXiv
[49]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[50]

arXiv preprint arXiv:2505.03469 , year=

Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

work page arXiv
[51]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Large language models are reasoning teachers , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[52]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Llava-o1: Let vision language models reason step-by-step , author=. arXiv preprint arXiv:2411.10440 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

arXiv e-prints , pages=

Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models , author=. arXiv e-prints , pages=

work page
[54]

arXiv preprint arXiv:2505.23091 , year=

Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models , author=. arXiv preprint arXiv:2505.23091 , year=

work page arXiv
[55]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Learning like humans: Advancing llm reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[56]

arXiv preprint arXiv:2503.07065 , year=

Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning , author=. arXiv preprint arXiv:2503.07065 , year=

work page arXiv
[57]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660,

Maximizing Confidence Alone Improves Reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv
[59]

arXiv preprint arXiv:2506.17219 , year=

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning , author=. arXiv preprint arXiv:2506.17219 , year=

work page arXiv
[60]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mathfusion: Enhancing mathematical problem-solving of llm through instruction fusion , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[61]

Advances in Neural Information Processing Systems , volume=

Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

Harder is better: Boosting mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation , author=. arXiv preprint arXiv:2601.20614 , year=

work page arXiv
[63]

arXiv preprint arXiv:2507.13266 , year=

Questa: Expanding reasoning capacity in llms via question augmentation , author=. arXiv preprint arXiv:2507.13266 , year=

work page arXiv
[64]

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution , author=. arXiv preprint arXiv:2604.15840 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

M. J. Kearns , title =

work page

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[6] [6]

Suppressed for Anonymity , author=

work page

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[9] [9]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

Curriculum Learning , author =. Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

work page

[13] [13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Paced Learning for Latent Variable Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[14] [14]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , author =. arXiv preprint arXiv:1712.01815 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

Spice: Self-play in corpus environments improves reasoning , author=. arXiv preprint arXiv:2510.24684 , year=

work page arXiv

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Online difficulty filtering for reasoning oriented reinforcement learning.arXiv preprint arXiv:2504.03380,

Online difficulty filtering for reasoning oriented reinforcement learning , author=. arXiv preprint arXiv:2504.03380 , year=

work page arXiv

[21] [21]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-zero: Self-evolving reasoning llm from zero data , author=. arXiv preprint arXiv:2508.05004 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv preprint arXiv:2504.05520 , year=

Efficient reinforcement finetuning via adaptive curriculum learning , author=. arXiv preprint arXiv:2504.05520 , year=

work page arXiv

[23] [23]

Reinforcement learning for reasoning in small llms: What works and what doesn’t

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , author=. arXiv preprint arXiv:2503.16219 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2509.24726 , year=

Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution , author=. arXiv preprint arXiv:2509.24726 , year=

work page arXiv

[25] [25]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page

[26] [26]

EMNLP , year=

HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation , author=. EMNLP , year=

work page

[27] [27]

AAAI , year=

AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting , author=. AAAI , year=

work page

[28] [28]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[29] [29]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page

[32] [32]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[33] [33]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Supergpqa: Scaling llm evaluation across 285 graduate disciplines , author=. arXiv preprint arXiv:2502.14739 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Big-bench extra hard , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[35] [35]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023

[36] [36]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[39] [39]

ICLR , year=

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. ICLR , year=

work page

[40] [40]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Learning to reason with search for llms via reinforcement learning , author=. arXiv preprint arXiv:2503.19470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Advances in Neural Information Processing Systems , volume=

Self-playing adversarial language game enhances llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page

[44] [44]

Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

Spell: Self-play reinforcement learning for evolving long-context language models , author=. arXiv preprint arXiv:2509.23863 , year=

work page arXiv

[45] [45]

Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

Language self-play for data-free training , author=. arXiv preprint arXiv:2509.07414 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2404.14387 , year=

A survey on self-evolution of large language models , author=. arXiv preprint arXiv:2404.14387 , year=

work page arXiv

[47] [47]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page

[48] [48]

arXiv preprint arXiv:2210.01240 , year=

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought , author=. arXiv preprint arXiv:2210.01240 , year=

work page arXiv

[49] [49]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[50] [50]

arXiv preprint arXiv:2505.03469 , year=

Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

work page arXiv

[51] [51]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Large language models are reasoning teachers , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page

[52] [52]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Llava-o1: Let vision language models reason step-by-step , author=. arXiv preprint arXiv:2411.10440 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

arXiv e-prints , pages=

Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models , author=. arXiv e-prints , pages=

work page

[54] [54]

arXiv preprint arXiv:2505.23091 , year=

Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models , author=. arXiv preprint arXiv:2505.23091 , year=

work page arXiv

[55] [55]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Learning like humans: Advancing llm reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[56] [56]

arXiv preprint arXiv:2503.07065 , year=

Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning , author=. arXiv preprint arXiv:2503.07065 , year=

work page arXiv

[57] [57]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660,

Maximizing Confidence Alone Improves Reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv

[59] [59]

arXiv preprint arXiv:2506.17219 , year=

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning , author=. arXiv preprint arXiv:2506.17219 , year=

work page arXiv

[60] [60]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mathfusion: Enhancing mathematical problem-solving of llm through instruction fusion , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[61] [61]

Advances in Neural Information Processing Systems , volume=

Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page

[62] [62]

Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

Harder is better: Boosting mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation , author=. arXiv preprint arXiv:2601.20614 , year=

work page arXiv

[63] [63]

arXiv preprint arXiv:2507.13266 , year=

Questa: Expanding reasoning capacity in llms via question augmentation , author=. arXiv preprint arXiv:2507.13266 , year=

work page arXiv

[64] [64]

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution , author=. arXiv preprint arXiv:2604.15840 , year=

work page internal anchor Pith review Pith/arXiv arXiv