pith. sign in

arxiv: 2605.17037 · v1 · pith:XZVRNAW7new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

D²Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

Pith reviewed 2026-05-19 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learninglarge language modelsmathematical reasoningself-evolutiondata efficiencydifficulty adaptationco-evolution
0
0 comments X

The pith

D²Evo achieves data-efficient RL for LLM reasoning by mining medium-difficulty anchors and jointly evolving a question generator with the solver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a reinforcement learning approach that keeps generating useful training examples for large language models even as their reasoning ability improves. It does so by repeatedly selecting medium-difficulty problems based on the current solver's success rate, then training a separate questioner component to produce new problems at matching levels, and optimizing both together in each round. This setup directly tackles the problems of scarce medium-difficulty data and the fact that fixed datasets quickly become too easy. A sympathetic reader would care because the method reaches competitive mathematical reasoning performance while using fewer than two thousand real samples and transfers to broader reasoning tasks.

Core claim

In each iteration, the method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. This process addresses effective data scarcity and dynamic difficulty shifts that arise during RL training of LLMs for reasoning.

What carries the argument

The dual difficulty-aware self-evolution loop that mines anchors from the solver's current performance to guide the questioner in producing new training samples at matched difficulty levels.

If this is right

  • The approach outperforms prior methods on mathematical reasoning benchmarks while using fewer than 2K real samples.
  • Training produces stable progressive gains through repeated co-optimization of questioner and solver.
  • The resulting models exhibit strong generalization when evaluated on general reasoning benchmarks.
  • The framework mitigates both data scarcity and the loss of medium-difficulty signals as model capability rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring principle could be tested in non-math domains such as code generation or multi-step planning where optimal task difficulty also shifts with agent skill.
  • If the co-evolution loop proves stable at larger scales, the method would reduce reliance on human-curated datasets for applying RL to specialized reasoning tasks.
  • One could measure whether the generated questions maintain diversity across iterations or converge to a narrow set of problem types.

Load-bearing premise

The question generator will continue to produce useful training examples only if it remains aligned with the solver's improving ability and avoids drifting into questions that are consistently too easy or too hard.

What would settle it

After multiple iterations, measure the solver's success rate on the newly generated questions; if the rate moves close to zero or close to one instead of staying in the medium range, the progressive-gain claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17037 by Chongyang Tao, Renda Li, Ru Zhang, Weijie Qiu, Xiangxiang Chu, Yong Wang, Ziyu Ma.

Figure 1
Figure 1. Figure 1: Top: Performance on mathematical benchmarks when training on Math12K subsets of different difficulty levels. Bot￾tom: Difficulty distributions of two commonly used datasets and acceptance rates of generated questions across difficulty levels. Difficulty measured by rollout average accuracy; acceptance rate evaluated using GPT-5.2 (see Sec. 4.2 for details). Experiments use Qwen3-4B-Base. However, GRPO is h… view at source ↗
Figure 2
Figure 2. Figure 2: Difficulty distribution before and after one epoch GRPO on OpenRs-7K and Math-12K dataset, conducted with Qwen-3- 8B-Base. • We conduct extensive evaluation demonstrating that D 2Evo achieves superior performance on mathematical tasks and shows generalization capability on general benchmarks, using fewer than 2K real samples. 2. Method 2.1. Motivation A recent study (Bae et al., 2025) demonstrates that the… view at source ↗
Figure 3
Figure 3. Figure 3: An overview of our framework. At each iteration, mid-difficulty anchors are mined from real data by a frozen Solver. Conditioned on these anchors, the Questioner is optimized with GRPO under a difficulty-aware reward to generate diverse questions within the target difficulty band. We then construct a hybrid buffer of anchors and filtered generations, and update the Solver with GRPO on this hybrid buffer, f… view at source ↗
Figure 4
Figure 4. Figure 4: Mathematical and general reasoning performance of the Questioner under independent training and in D2Evo across iterations. to facilitate knowledge transfer between question genera￾tion and problem solving. To further analyze this effect, we examine the mathematical and general reasoning capa￾bilities of the Questioner under independent training and within D2Evo across iterations. As shown in [PITH_FULL_I… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of question generation capability between R-Zero and D2Evo. Left and middle: evolution of average difficulty and acceptance rate of generated questions across iterations. Right: difficulty distributions of generated questions with respect to the current Solver for R-Zero and D2Evo. ble 5 shows that at each iteration, the Solver’s acceptance rate on question generation is comparable to or higher … view at source ↗
read the original abstract

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes D²Evo, a Dual Difficulty-aware self-Evolution RL framework for data-efficient reasoning in LLMs. In each iteration it mines medium-difficulty anchors from the current Solver's performance on a small set of real samples, trains a Questioner to generate diverse questions at matching difficulty levels, and jointly optimizes both components. The central claim is that this co-evolution produces stable progressive gains, outperforming prior methods on mathematical reasoning benchmarks while using fewer than 2K real mathematical samples and exhibiting strong generalization on general reasoning benchmarks.

Significance. If the empirical results and stability of the co-evolution loop hold, the work would be significant for reducing reliance on large human-annotated datasets in RL for LLM reasoning. The explicit use of difficulty anchors derived from the evolving Solver and the joint training of Questioner and Solver represent a concrete mechanism for handling dynamic difficulty shifts, which could influence subsequent self-improvement and synthetic-data pipelines.

major comments (2)
  1. [§3.2] §3.2 (Anchor Mining): the medium-difficulty threshold is defined relative to the Solver's accuracy on the same small real-sample set used for evaluation; this introduces a potential circularity because performance gains are measured against thresholds that are themselves updated by the evolving Solver, and no fixed external difficulty oracle or held-out validation set is used to decouple the two.
  2. [§4.3] §4.3 (Ablation and Iteration Analysis): while per-iteration accuracy curves are reported, there is no monitoring of the difficulty distribution of generated questions across iterations or ablation on the anchor-selection threshold; without these, it is impossible to verify that the claimed progressive gains are not artifacts of persistent difficulty mismatch or mode collapse in the Questioner.
minor comments (2)
  1. [§3.1] Notation for the difficulty estimator E_p is introduced without an explicit equation; adding a numbered equation would clarify how the medium-difficulty band is computed from Solver logits.
  2. [Table 2] Table 2 caption does not state the number of random seeds or whether error bars reflect standard deviation across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the presentation of our method.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Anchor Mining): the medium-difficulty threshold is defined relative to the Solver's accuracy on the same small real-sample set used for evaluation; this introduces a potential circularity because performance gains are measured against thresholds that are themselves updated by the evolving Solver, and no fixed external difficulty oracle or held-out validation set is used to decouple the two.

    Authors: We appreciate the referee highlighting this design aspect. The adaptive threshold is intentionally computed from the Solver's current performance on the small real-sample set so that medium-difficulty anchors remain relevant as the model evolves; this is a core feature of the self-evolution framework rather than an unintended circularity. Crucially, all reported performance gains are measured on standard held-out mathematical reasoning benchmarks (MATH, GSM8K, etc.) that are completely disjoint from the mining set. We will revise §3.2 to explicitly state this separation, clarify the rationale for using an evolving internal reference instead of a fixed external oracle, and add a brief experiment that reserves a small held-out portion of the real samples solely for threshold validation. revision: yes

  2. Referee: [§4.3] §4.3 (Ablation and Iteration Analysis): while per-iteration accuracy curves are reported, there is no monitoring of the difficulty distribution of generated questions across iterations or ablation on the anchor-selection threshold; without these, it is impossible to verify that the claimed progressive gains are not artifacts of persistent difficulty mismatch or mode collapse in the Questioner.

    Authors: We agree that additional diagnostics would make the stability claims more robust. In the revised manuscript we will add (i) plots tracking the distribution of estimated difficulty levels of questions generated by the Questioner across iterations and (ii) an ablation varying the anchor-selection threshold (e.g., different accuracy percentiles used to define “medium” difficulty). These new results will be placed in §4.3 and will help demonstrate that the observed gains arise from effective difficulty matching rather than mismatch or collapse. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an iterative algorithmic framework (D²Evo) for co-evolving a Solver and Questioner via RL, with medium-difficulty anchors mined from the current Solver's performance on a fixed set of real samples. This process is presented as an empirical method rather than a closed mathematical derivation. No equations are shown that reduce a claimed prediction or result to a fitted parameter or self-defined quantity by construction. No load-bearing self-citations to uniqueness theorems or ansatzes from prior author work are invoked. Claims of data-efficient gains rest on external benchmark evaluations (mathematical and general reasoning tasks) using fewer than 2K real samples, which are independent of the internal difficulty thresholds. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone; full methods section would be needed to audit these.

pith-pipeline@v0.9.0 · 5725 in / 1083 out tokens · 42708 ms · 2026-05-19T20:17:53.748928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 21 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    Training language models to follow instructions with human feedback

    Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =

  10. [10]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =

  11. [11]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  12. [12]

    Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

    Curriculum Learning , author =. Proceedings of the 26th Annual International Conference on Machine Learning (ICML) , year =

  13. [13]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Self-Paced Learning for Latent Variable Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  14. [14]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , author =. arXiv preprint arXiv:1712.01815 , year =

  15. [15]

    Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

    Spice: Self-play in corpus environments improves reasoning , author=. arXiv preprint arXiv:2510.24684 , year=

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  17. [17]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  19. [19]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

  20. [20]

    Online difficulty filtering for reasoning oriented reinforcement learning.arXiv preprint arXiv:2504.03380,

    Online difficulty filtering for reasoning oriented reinforcement learning , author=. arXiv preprint arXiv:2504.03380 , year=

  21. [21]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    R-zero: Self-evolving reasoning llm from zero data , author=. arXiv preprint arXiv:2508.05004 , year=

  22. [22]

    Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

    Efficient reinforcement finetuning via adaptive curriculum learning , author=. arXiv preprint arXiv:2504.05520 , year=

  23. [23]

    Reinforcement learning for reasoning in small llms: What works and what doesn’t

    Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , author=. arXiv preprint arXiv:2503.16219 , year=

  24. [24]

    arXiv preprint arXiv:2509.24726 , year=

    Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution , author=. arXiv preprint arXiv:2509.24726 , year=

  25. [25]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  26. [26]

    EMNLP , year=

    HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation , author=. EMNLP , year=

  27. [27]

    AAAI , year=

    AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting , author=. AAAI , year=

  28. [28]

    The Twelfth International Conference on Learning Representations , year=

    Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

  29. [29]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

  30. [30]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  33. [33]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines , author=. arXiv preprint arXiv:2502.14739 , year=

  34. [34]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Big-bench extra hard , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  35. [35]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  36. [36]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  37. [37]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  38. [38]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  39. [39]

    ICLR , year=

    Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. ICLR , year=

  40. [40]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Learning to reason with search for llms via reinforcement learning , author=. arXiv preprint arXiv:2503.19470 , year=

  41. [41]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

  42. [42]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Self-playing adversarial language game enhances llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

    Spell: Self-play reinforcement learning for evolving long-context language models , author=. arXiv preprint arXiv:2509.23863 , year=

  45. [45]

    Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

    Language self-play for data-free training , author=. arXiv preprint arXiv:2509.07414 , year=

  46. [46]

    arXiv preprint arXiv:2404.14387 , year=

    A survey on self-evolution of large language models , author=. arXiv preprint arXiv:2404.14387 , year=

  47. [47]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  48. [48]

    arXiv preprint arXiv:2210.01240 , year=

    Language models are greedy reasoners: A systematic formal analysis of chain-of-thought , author=. arXiv preprint arXiv:2210.01240 , year=

  49. [49]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  50. [50]

    arXiv preprint arXiv:2505.03469 , year=

    Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

  51. [51]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Large language models are reasoning teachers , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  52. [52]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Llava-o1: Let vision language models reason step-by-step , author=. arXiv preprint arXiv:2411.10440 , year=

  53. [53]

    arXiv e-prints , pages=

    Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models , author=. arXiv e-prints , pages=

  54. [54]

    arXiv preprint arXiv:2505.23091 , year=

    Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models , author=. arXiv preprint arXiv:2505.23091 , year=

  55. [55]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Learning like humans: Advancing llm reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  56. [56]

    arXiv preprint arXiv:2503.07065 , year=

    Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning , author=. arXiv preprint arXiv:2503.07065 , year=

  57. [57]

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

  58. [58]

    Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660, 2025

    Maximizing Confidence Alone Improves Reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

  59. [59]

    No free lunch: Rethinking internal feedback for llm reasoning.arXiv preprint arXiv:2506.17219, 2025

    No Free Lunch: Rethinking Internal Feedback for LLM Reasoning , author=. arXiv preprint arXiv:2506.17219 , year=

  60. [60]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mathfusion: Enhancing mathematical problem-solving of llm through instruction fusion , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  61. [61]

    Advances in Neural Information Processing Systems , volume=

    Sws: Self-aware weakness-driven problem synthesis in reinforcement learning for llm reasoning , author=. Advances in Neural Information Processing Systems , volume=

  62. [62]

    Harder is better: Boost- ing mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation.arXiv preprint arXiv:2601.20614, 2026

    Harder is better: Boosting mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation , author=. arXiv preprint arXiv:2601.20614 , year=

  63. [63]

    arXiv preprint arXiv:2507.13266 , year=

    Questa: Expanding reasoning capacity in llms via question augmentation , author=. arXiv preprint arXiv:2507.13266 , year=

  64. [64]

    CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

    CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution , author=. arXiv preprint arXiv:2604.15840 , year=