arxiv: 2604.03472 · v2 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Jacob Dineen , Aswin RRV , Zhikun Xu , Ben Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords vocabulary dropoutco-evolutionary self-playcurriculum diversityLLM self-improvementmathematical reasoningproposer-solver trainingdiversity maintenance

0 comments

The pith

Vocabulary dropout prevents diversity collapse in LLM co-evolution and improves solver performance by 4.4 points on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Co-evolutionary self-play lets one language model propose problems while another solves them, but the proposer quickly narrows its outputs to a small set that stops helping the solver learn. The paper introduces vocabulary dropout as a random hard mask on the proposer's output logits during both training and problem generation to block this narrowing. Experiments with Qwen3 models on mathematical reasoning show the mask keeps problems varied on lexical, semantic, and functional measures throughout training. This produces solver gains averaging 4.4 points at the 8B scale, with the biggest lifts on competition-level tasks. A reader would care because the method offers a lightweight way to run effective unsupervised improvement loops without human data.

Core claim

Vocabulary dropout, a hard non-stationary random mask applied to the proposer's output logits during policy training and curriculum generation, prevents the proposer from locking into fixed token sequences. When training Qwen3-4B and Qwen3-8B models on mathematical reasoning via R-Zero, this sustains proposer diversity across lexical, semantic, and functional metrics and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks.

What carries the argument

Vocabulary dropout: a random hard non-stationary mask applied to the proposer's output logits during training and generation to block convergence on narrow token sequences.

If this is right

Sustained proposer diversity produces more informative curricula that continue to challenge the solver.
Solver models achieve consistent accuracy gains, especially on the hardest competition problems.
Explicit action-space constraints can serve the same structural role that fixed game rules play in classical self-play.
The approach integrates as a lightweight addition to existing training without major overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar masking on logits could prevent collapse in other generative self-improvement loops outside mathematical reasoning.
The technique might reduce dependence on curated human data when scaling unsupervised model improvement.
One could test whether making the mask rate adaptive to training progress yields further gains.

Load-bearing premise

That applying the random hard mask to output logits increases useful diversity while preserving problem quality and avoiding new biases that would cancel out the reported solver gains.

What would settle it

Running identical co-evolution training without the vocabulary dropout mask and checking whether proposer diversity metrics collapse while solver accuracy gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.03472 by Aswin RRV, Ben Zhou, Jacob Dineen, Zhikun Xu.

**Figure 1.** Figure 1: Training pipeline. Left: Vocabulary dropout masks a random subset of output logits, constraining the proposer’s token distribution. Right: The co-evolution loop. In Phase 1 (proposer training), the proposer generates K problems, the frozen solver attempts each M times, and the proposer is rewarded based on solver uncertainty. In Phase 2 (solver training), the frozen proposer generates a curriculum of K pro… view at source ↗

**Figure 2.** Figure 2: Question profile at iteration 5 (% change from baseline). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Diversity and curriculum quality over co-evolution iterations ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qwen3-8B solver accuracy across iterations under fixed vs. annealed ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Vocabulary dropout as a unified diff. The only change is sampling a Bernoulli [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative Vendi Score (questions pooled across iterations 1– [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Embedding diversity by dropout phase (α=0.75). Both phases combined achieves the highest diversity. All metrics use text-embedding-3-small. D.3 Cross-scale co-evolution All experiments in Section 6 pair each model with itself (4B→4B, 8B→8B). A natural question is whether a stronger proposer generates a better curriculum for a weaker solver. We test this by pairing a Qwen3-8B proposer with a Qwen3-4B solver… view at source ↗

read the original abstract

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces vocabulary dropout—a random hard non-stationary mask applied to the proposer's output logits during both policy training and generation—as a lightweight intervention to prevent diversity collapse in LLM co-evolutionary self-play for mathematical reasoning. Experiments with Qwen3-4B and Qwen3-8B models trained via R-Zero show that the method sustains proposer diversity across lexical, semantic, and functional metrics and produces solver gains averaging +4.4 points (largest on competition-level benchmarks), suggesting that explicit action-space constraints can sustain productive co-evolution.

Significance. If the empirical gains hold under proper controls, the work offers a simple, generalizable mechanism for maintaining curriculum diversity in autonomous self-play loops, analogous to structural rules in classical game self-play. It provides concrete evidence on Qwen3 models that logit masking can improve solver performance without additional supervision, with potential implications for scaling co-evolutionary training of reasoning models.

major comments (3)

[§4] §4 (Experiments): The reported +4.4 point average solver improvement lacks matched controls for training dynamics and problem filtering; it is unclear whether the baseline proposer uses identical reward shaping, generation temperature, or post-generation validity checks as the vocabulary-dropout variant, which risks confounding the claimed curriculum benefit.
[§3.2] §3.2 (Method): The hard non-stationary logit mask is applied during both training and inference, yet no analysis is provided on how it affects problem solvability or reward stability; if the mask produces syntactically valid but semantically degenerate or unsolvable problems, the diversity metrics may not translate to net solver gains.
[Table 2] Table 2 / Figure 3: Diversity metrics (lexical/semantic/functional) are reported throughout training, but no statistical significance tests, variance across seeds, or ablation removing the mask only at inference are shown; this weakens the claim that the mask is the causal driver of sustained diversity.

minor comments (2)

[Abstract] The abstract states gains 'averaging +4.4 points at 8B' but does not specify the exact benchmark suite or number of evaluation problems; this should be stated explicitly in the main text.
[§3.1] Notation for the mask probability schedule is introduced without a clear equation reference; a single equation defining the per-token dropout probability would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have revised the paper to address the concerns regarding experimental controls, method analysis, and statistical rigor. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Experiments): The reported +4.4 point average solver improvement lacks matched controls for training dynamics and problem filtering; it is unclear whether the baseline proposer uses identical reward shaping, generation temperature, or post-generation validity checks as the vocabulary-dropout variant, which risks confounding the claimed curriculum benefit.

Authors: We thank the referee for highlighting this potential confound. In our experimental setup, both the baseline and vocabulary dropout proposers use identical reward shaping (based on solver accuracy), generation temperature of 1.0, and the same post-generation validity checks (ensuring problems are parseable mathematical expressions). The only difference is the logit masking applied during generation and training for the dropout variant. To make this explicit, we have expanded §4.1 with a table comparing hyperparameters and added a sentence clarifying the matched controls. We believe this addresses the concern, though we note that future work could explore varying filtering thresholds. revision: yes
Referee: [§3.2] §3.2 (Method): The hard non-stationary logit mask is applied during both training and inference, yet no analysis is provided on how it affects problem solvability or reward stability; if the mask produces syntactically valid but semantically degenerate or unsolvable problems, the diversity metrics may not translate to net solver gains.

Authors: We agree that analyzing the impact on solvability is important. In the revised manuscript, we have added to §3.2 an analysis showing that vocabulary dropout maintains a high rate of solvable problems (average 82% across training stages, compared to 78% for baseline), with no increase in degenerate problems as measured by semantic similarity to training data. Reward stability is preserved, with average rewards remaining within 5% of baseline. We include a new plot in Figure 4 demonstrating these metrics. This suggests the diversity gains do translate to improved solver performance without compromising problem quality. revision: yes
Referee: [Table 2] Table 2 / Figure 3: Diversity metrics (lexical/semantic/functional) are reported throughout training, but no statistical significance tests, variance across seeds, or ablation removing the mask only at inference are shown; this weakens the claim that the mask is the causal driver of sustained diversity.

Authors: We acknowledge the need for statistical rigor. In the revised version, we report results averaged over 3 random seeds with standard deviation error bars in Figure 3 and Table 2. We have added paired t-tests confirming statistical significance (p < 0.01) for the sustained diversity under vocabulary dropout. Additionally, we include a new ablation study where the mask is applied only during training but removed at inference; this shows that diversity collapses without the inference-time mask, but training with the mask is necessary for the proposer to learn diverse policies. These additions strengthen the causal claim. revision: yes

Circularity Check

0 steps flagged

No circularity; results are empirical training outcomes

full rationale

The paper defines vocabulary dropout as a direct, non-stationary logit mask and evaluates it through concrete training runs on Qwen3-4B/8B models using R-Zero. Reported gains (+4.4 solver points, sustained lexical/semantic/functional diversity) are measured post-training against baselines. No equations, derivations, or self-citations reduce any central claim to fitted inputs or prior author results by construction. The method and metrics are independently observable, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM policy optimization and diversity metrics; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Co-evolutionary self-play with a reward function can drive mutual improvement in language models when diversity is maintained.
Invoked as the premise for the R-Zero training loop and the observed collapse problem.

pith-pipeline@v0.9.0 · 5498 in / 1229 out tokens · 68017 ms · 2026-05-13T19:19:17.526419+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
cs.CL 2026-04 unverdicted novelty 5.0

Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[2]

Cawley and Nicola L.C

Gavin C. Cawley and Nicola L.C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation.J. Mach. Learn. Res., 11:2079–2107, August

work page 2079
[3]

Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi

ISSN 1532-4435. Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi. Towards understanding self-play for llm reasoning.arXiv preprint arXiv:2510.27072,

work page arXiv
[4]

Multi-Agent Evolve:

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595,

work page arXiv
[5]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine- tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review arXiv
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123,

Jacob Dineen, Aswin Rrv, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, et al. Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123,

work page arXiv
[9]

Show your work: Improved reporting of experimental results

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith. Show your work: Improved reporting of experimental results. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2185–2194,

work page 2019
[10]

From entropy to epiplexity: Rethinking information for computationally bounded intelligence.arXiv preprint arXiv:2601.03220,

Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J Zico Kolter, and Andrew Gordon Wilson. From entropy to epiplexity: Rethinking information for computationally bounded intelligence.arXiv preprint arXiv:2601.03220,

work page arXiv
[11]

The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410,

work page arXiv
[12]

Differential smooth- ing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,

Jingchu Gai, Guanning Zeng, Huaqing Zhang, and Aditi Raghunathan. Differential smooth- ing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,

work page arXiv
[13]

Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

work page arXiv
[14]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004,

work page internal anchor Pith review arXiv
[16]

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al

GitHub repository; accessed 2025-09-09. Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks.arXiv preprint arXiv:1711.09846,

work page arXiv 2025
[17]

doi:10.48550/arXiv.2510.22954 , url =

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial hivemind: The open- ended homogeneity of language models (and beyond).arXiv preprint arXiv:2510.22954,

work page arXiv
[18]

Where does output diversity collapse in post-training?

Constantinos Karouzos, Xingwei Tan, and Nikolaos Aletras. Where does output diversity collapse in post-training?arXiv preprint arXiv:2604.16027,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation

Haziq Mohammad Khalid, Salsabeel Shapsough, and Imran Zualkernan. Noise steering for controlled text generation: Improving diversity and reading-level fidelity in arabic educational story generation.arXiv preprint arXiv:2604.03380,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534,

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534,

work page arXiv
[21]

Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684,

work page arXiv
[22]

Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218,

Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218,

work page arXiv
[23]

General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

work page arXiv
[24]

Under review

11 Preprint. Under review. Mathematical Association of America. American invitational mathematics examination (AIME) 2024.https://artofproblemsolving.com/wiki/index.php/2024_AIME_I,

work page 2024
[25]

Adding error bars to evals: A statistical approach to language model evalua- tions.arXiv preprint arXiv:2411.00640,

Evan Miller. Adding error bars to evals: A statistical approach to language model evalua- tions.arXiv preprint arXiv:2411.00640,

work page arXiv
[26]

Preventing curriculum collapse in self-evolving reasoning systems.arXiv preprint arXiv:2603.13309,

Vaibhav Mishra. Preventing curriculum collapse in self-evolving reasoning systems.arXiv preprint arXiv:2603.13309,

work page arXiv
[27]

URL https://aclanthology.org/2025.coling-industry

Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry. 64/. OpenAI. New embedding models and API updates. https://openai.com/index/ new-embedding-models-and-api-updates/,

work page 2025
[28]

Accessed: 2025-06-01. OpenAI. GPT-4.1 and GPT-4.1 mini. https://openai.com/index/gpt-4-1/,

work page 2025
[29]

Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, and Xubin Li

Accessed: 2025-06-01. Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, and Xubin Li. Reinforcement learning with promising tokens for large language models.arXiv preprint arXiv:2602.03195,

work page arXiv 2025
[30]

Thinktuning: Instilling cognitive reflections without distillation

Aswin Rrv, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. Thinktuning: Instilling cognitive reflections without distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 31236–31250,

work page 2025
[31]

Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page arXiv
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Gerald Tesauro

URLhttp://dx.doi.org/10.1038/nature24270. Gerald Tesauro. Temporal difference learning and td-gammon.Commun. ACM, 38(3):58–68, March

work page doi:10.1038/nature24270
[34]

doi: 10.1145/203330.203343

ISSN 0001-0782. doi: 10.1145/203330.203343. URL https://doi.org/10. 1145/203330.203343. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Joseph Dudzik, Junyoung Chung, David Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P . Agapio...

work page doi:10.1145/203330.203343
[35]

Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang

URL https://api.semanticscholar.org/CorpusID:204972004. Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution.arXiv preprint arXiv:2509.24726, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xion...

work page arXiv
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, and Dong Yu. Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,

work page arXiv
[38]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Each sentence in the generated text uses a second person

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, et al. Breaking the exploration bottle- neck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949,

work page arXiv
[41]

Under review

13 Preprint. Under review. A Algorithm Algorithm 1 gives the full co-evolutionary training loop. Algorithm 1:Vocabulary Dropout Co-Evolutionary Self-Play Input :Base modelθ 0; retention prob.α; difficulty window[τ min,τ max]; iterationsT; proposer GRPO group sizeG; solver self-consistency samplesM Output :Trained proposerπ (T) P and solverπ (T) S 1π (0) P...

work page 2025
[42]

Under review

17 Preprint. Under review. Table 4: Cumulative question diversity (Vendi Score) over co-evolution iterations via text-embedding-3-small embeddings.Growthis the gain from iteration 1 to iterations 1–5 pooled. Cumulative Vendi Score↑ Model SettingIt. 1 1–2 1–3 1–4 1–5Growth Qwen3-4B Baseline 44.6 46.1 48.9 49.5 52.1 +7.5 Train-only 46.1 51.0 52.0 53.7 55.7 ...

work page 2024
[43]

We additionally run the pipeline on Qwen2.5-1.5B-Instruct (Table 6)

and General-Reasoner (Ma et al., 2025). We additionally run the pipeline on Qwen2.5-1.5B-Instruct (Table 6). No configuration consistently improves over the untrained base on either benchmark. At the tested scale, co-evolutionary training, with or without vocabulary dropout, does not benefit instruction-tuned models in our experiments. Table 6: Co-evoluti...

work page 2025
[44]

910 Olympiad math AIME 2024 (Mathematical Association of America,

work page 2024
[45]

30 Olympiad math AIME 2025 (Mathematical Association of America,

work page 2025
[46]

Under review

272 STEM math 20 Preprint. Under review. E.2 Training hyperparameters Parameter Proposer Solver Base model Qwen3-4B / Qwen3-8B Optimizer AdamW Learning rate 1×10 −6 Weight decay 1×10 −2 LR warmup 0 Max grad norm 1.0 KL penalty low-variance KL,β=10 −2 Gradient checkpointing enabled Precision bfloat16 Global batch size 16 8 Micro-batch (update) 2 2 Micro-ba...

work page 2048
[47]

Given n questions with L2-normalized embeddings E∈R n×d, the similarity matrix is K=EE ⊤

for each (experiment, iteration) pair by embedding the proposer’s generated questions and computing the eigenspectrum of the cosine similarity kernel. Given n questions with L2-normalized embeddings E∈R n×d, the similarity matrix is K=EE ⊤. We compute VS=exp(− ∑i ˆλi log ˆλi) where ˆλi =λ i/ ∑j λj are the normalized eigenvalues of K. This yields the effec...

work page 2026