Recognition: 1 theorem link
· Lean TheoremVocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Pith reviewed 2026-05-13 19:19 UTC · model grok-4.3
The pith
Vocabulary dropout prevents diversity collapse in LLM co-evolution and improves solver performance by 4.4 points on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vocabulary dropout, a hard non-stationary random mask applied to the proposer's output logits during policy training and curriculum generation, prevents the proposer from locking into fixed token sequences. When training Qwen3-4B and Qwen3-8B models on mathematical reasoning via R-Zero, this sustains proposer diversity across lexical, semantic, and functional metrics and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks.
What carries the argument
Vocabulary dropout: a random hard non-stationary mask applied to the proposer's output logits during training and generation to block convergence on narrow token sequences.
If this is right
- Sustained proposer diversity produces more informative curricula that continue to challenge the solver.
- Solver models achieve consistent accuracy gains, especially on the hardest competition problems.
- Explicit action-space constraints can serve the same structural role that fixed game rules play in classical self-play.
- The approach integrates as a lightweight addition to existing training without major overhead.
Where Pith is reading between the lines
- Similar masking on logits could prevent collapse in other generative self-improvement loops outside mathematical reasoning.
- The technique might reduce dependence on curated human data when scaling unsupervised model improvement.
- One could test whether making the mask rate adaptive to training progress yields further gains.
Load-bearing premise
That applying the random hard mask to output logits increases useful diversity while preserving problem quality and avoiding new biases that would cancel out the reported solver gains.
What would settle it
Running identical co-evolution training without the vocabulary dropout mask and checking whether proposer diversity metrics collapse while solver accuracy gains disappear or reverse.
Figures
read the original abstract
Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces vocabulary dropout—a random hard non-stationary mask applied to the proposer's output logits during both policy training and generation—as a lightweight intervention to prevent diversity collapse in LLM co-evolutionary self-play for mathematical reasoning. Experiments with Qwen3-4B and Qwen3-8B models trained via R-Zero show that the method sustains proposer diversity across lexical, semantic, and functional metrics and produces solver gains averaging +4.4 points (largest on competition-level benchmarks), suggesting that explicit action-space constraints can sustain productive co-evolution.
Significance. If the empirical gains hold under proper controls, the work offers a simple, generalizable mechanism for maintaining curriculum diversity in autonomous self-play loops, analogous to structural rules in classical game self-play. It provides concrete evidence on Qwen3 models that logit masking can improve solver performance without additional supervision, with potential implications for scaling co-evolutionary training of reasoning models.
major comments (3)
- [§4] §4 (Experiments): The reported +4.4 point average solver improvement lacks matched controls for training dynamics and problem filtering; it is unclear whether the baseline proposer uses identical reward shaping, generation temperature, or post-generation validity checks as the vocabulary-dropout variant, which risks confounding the claimed curriculum benefit.
- [§3.2] §3.2 (Method): The hard non-stationary logit mask is applied during both training and inference, yet no analysis is provided on how it affects problem solvability or reward stability; if the mask produces syntactically valid but semantically degenerate or unsolvable problems, the diversity metrics may not translate to net solver gains.
- [Table 2] Table 2 / Figure 3: Diversity metrics (lexical/semantic/functional) are reported throughout training, but no statistical significance tests, variance across seeds, or ablation removing the mask only at inference are shown; this weakens the claim that the mask is the causal driver of sustained diversity.
minor comments (2)
- [Abstract] The abstract states gains 'averaging +4.4 points at 8B' but does not specify the exact benchmark suite or number of evaluation problems; this should be stated explicitly in the main text.
- [§3.1] Notation for the mask probability schedule is introduced without a clear equation reference; a single equation defining the per-token dropout probability would improve reproducibility.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We have revised the paper to address the concerns regarding experimental controls, method analysis, and statistical rigor. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported +4.4 point average solver improvement lacks matched controls for training dynamics and problem filtering; it is unclear whether the baseline proposer uses identical reward shaping, generation temperature, or post-generation validity checks as the vocabulary-dropout variant, which risks confounding the claimed curriculum benefit.
Authors: We thank the referee for highlighting this potential confound. In our experimental setup, both the baseline and vocabulary dropout proposers use identical reward shaping (based on solver accuracy), generation temperature of 1.0, and the same post-generation validity checks (ensuring problems are parseable mathematical expressions). The only difference is the logit masking applied during generation and training for the dropout variant. To make this explicit, we have expanded §4.1 with a table comparing hyperparameters and added a sentence clarifying the matched controls. We believe this addresses the concern, though we note that future work could explore varying filtering thresholds. revision: yes
-
Referee: [§3.2] §3.2 (Method): The hard non-stationary logit mask is applied during both training and inference, yet no analysis is provided on how it affects problem solvability or reward stability; if the mask produces syntactically valid but semantically degenerate or unsolvable problems, the diversity metrics may not translate to net solver gains.
Authors: We agree that analyzing the impact on solvability is important. In the revised manuscript, we have added to §3.2 an analysis showing that vocabulary dropout maintains a high rate of solvable problems (average 82% across training stages, compared to 78% for baseline), with no increase in degenerate problems as measured by semantic similarity to training data. Reward stability is preserved, with average rewards remaining within 5% of baseline. We include a new plot in Figure 4 demonstrating these metrics. This suggests the diversity gains do translate to improved solver performance without compromising problem quality. revision: yes
-
Referee: [Table 2] Table 2 / Figure 3: Diversity metrics (lexical/semantic/functional) are reported throughout training, but no statistical significance tests, variance across seeds, or ablation removing the mask only at inference are shown; this weakens the claim that the mask is the causal driver of sustained diversity.
Authors: We acknowledge the need for statistical rigor. In the revised version, we report results averaged over 3 random seeds with standard deviation error bars in Figure 3 and Table 2. We have added paired t-tests confirming statistical significance (p < 0.01) for the sustained diversity under vocabulary dropout. Additionally, we include a new ablation study where the mask is applied only during training but removed at inference; this shows that diversity collapses without the inference-time mask, but training with the mask is necessary for the proposer to learn diverse policies. These additions strengthen the causal claim. revision: yes
Circularity Check
No circularity; results are empirical training outcomes
full rationale
The paper defines vocabulary dropout as a direct, non-stationary logit mask and evaluates it through concrete training runs on Qwen3-4B/8B models using R-Zero. Reported gains (+4.4 solver points, sustained lexical/semantic/functional diversity) are measured post-training against baselines. No equations, derivations, or self-citations reduce any central claim to fitted inputs or prior author results by construction. The method and metrics are independently observable, rendering the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Co-evolutionary self-play with a reward function can drive mutual improvement in language models when diversity is maintained.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.
Reference graph
Works this paper leans on
-
[1]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[2]
Gavin C. Cawley and Nicola L.C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation.J. Mach. Learn. Res., 11:2079–2107, August
work page 2079
-
[3]
Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi
ISSN 1532-4435. Justin Yang Chae, Md Tanvirul Alam, and Nidhi Rastogi. Towards understanding self-play for llm reasoning.arXiv preprint arXiv:2510.27072,
-
[4]
Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595,
-
[5]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine- tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,
work page internal anchor Pith review arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123,
Jacob Dineen, Aswin Rrv, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, et al. Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123,
-
[9]
Show your work: Improved reporting of experimental results
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith. Show your work: Improved reporting of experimental results. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2185–2194,
work page 2019
-
[10]
Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J Zico Kolter, and Andrew Gordon Wilson. From entropy to epiplexity: Rethinking information for computationally bounded intelligence.arXiv preprint arXiv:2601.03220,
-
[11]
Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410,
-
[12]
Jingchu Gai, Guanning Zeng, Huaqing Zhang, and Aditi Raghunathan. Differential smooth- ing mitigates sharpening and improves llm reasoning.arXiv preprint arXiv:2511.19942,
-
[13]
Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,
-
[14]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004,
work page internal anchor Pith review arXiv
-
[16]
GitHub repository; accessed 2025-09-09. Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks.arXiv preprint arXiv:1711.09846,
-
[17]
doi:10.48550/arXiv.2510.22954 , url =
Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial hivemind: The open- ended homogeneity of language models (and beyond).arXiv preprint arXiv:2510.22954,
-
[18]
Where does output diversity collapse in post-training?
Constantinos Karouzos, Xingwei Tan, and Nikolaos Aletras. Where does output diversity collapse in post-training?arXiv preprint arXiv:2604.16027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Haziq Mohammad Khalid, Salsabeel Shapsough, and Imran Zualkernan. Noise steering for controlled text generation: Improving diversity and reading-level fidelity in arabic educational story generation.arXiv preprint arXiv:2604.03380,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534,
-
[21]
Spice: Self-play in corpus environments improves reasoning.arXiv, 2025
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684,
-
[22]
Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218,
-
[23]
General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,
Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,
-
[24]
11 Preprint. Under review. Mathematical Association of America. American invitational mathematics examination (AIME) 2024.https://artofproblemsolving.com/wiki/index.php/2024_AIME_I,
work page 2024
-
[25]
Evan Miller. Adding error bars to evals: A statistical approach to language model evalua- tions.arXiv preprint arXiv:2411.00640,
-
[26]
Preventing curriculum collapse in self-evolving reasoning systems.arXiv preprint arXiv:2603.13309,
Vaibhav Mishra. Preventing curriculum collapse in self-evolving reasoning systems.arXiv preprint arXiv:2603.13309,
-
[27]
URL https://aclanthology.org/2025.coling-industry
Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry. 64/. OpenAI. New embedding models and API updates. https://openai.com/index/ new-embedding-models-and-api-updates/,
work page 2025
-
[28]
Accessed: 2025-06-01. OpenAI. GPT-4.1 and GPT-4.1 mini. https://openai.com/index/gpt-4-1/,
work page 2025
-
[29]
Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, and Xubin Li
Accessed: 2025-06-01. Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, and Xubin Li. Reinforcement learning with promising tokens for large language models.arXiv preprint arXiv:2602.03195,
-
[30]
Thinktuning: Instilling cognitive reflections without distillation
Aswin Rrv, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. Thinktuning: Instilling cognitive reflections without distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 31236–31250,
work page 2025
-
[31]
Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
URLhttp://dx.doi.org/10.1038/nature24270. Gerald Tesauro. Temporal difference learning and td-gammon.Commun. ACM, 38(3):58–68, March
-
[34]
ISSN 0001-0782. doi: 10.1145/203330.203343. URL https://doi.org/10. 1145/203330.203343. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Joseph Dudzik, Junyoung Chung, David Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P . Agapio...
-
[35]
URL https://api.semanticscholar.org/CorpusID:204972004. Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution.arXiv preprint arXiv:2509.24726, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xion...
-
[36]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,
Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, and Dong Yu. Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,
-
[38]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Each sentence in the generated text uses a second person
Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, et al. Breaking the exploration bottle- neck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949,
-
[41]
13 Preprint. Under review. A Algorithm Algorithm 1 gives the full co-evolutionary training loop. Algorithm 1:Vocabulary Dropout Co-Evolutionary Self-Play Input :Base modelθ 0; retention prob.α; difficulty window[τ min,τ max]; iterationsT; proposer GRPO group sizeG; solver self-consistency samplesM Output :Trained proposerπ (T) P and solverπ (T) S 1π (0) P...
work page 2025
-
[42]
17 Preprint. Under review. Table 4: Cumulative question diversity (Vendi Score) over co-evolution iterations via text-embedding-3-small embeddings.Growthis the gain from iteration 1 to iterations 1–5 pooled. Cumulative Vendi Score↑ Model SettingIt. 1 1–2 1–3 1–4 1–5Growth Qwen3-4B Baseline 44.6 46.1 48.9 49.5 52.1 +7.5 Train-only 46.1 51.0 52.0 53.7 55.7 ...
work page 2024
-
[43]
We additionally run the pipeline on Qwen2.5-1.5B-Instruct (Table 6)
and General-Reasoner (Ma et al., 2025). We additionally run the pipeline on Qwen2.5-1.5B-Instruct (Table 6). No configuration consistently improves over the untrained base on either benchmark. At the tested scale, co-evolutionary training, with or without vocabulary dropout, does not benefit instruction-tuned models in our experiments. Table 6: Co-evoluti...
work page 2025
-
[44]
910 Olympiad math AIME 2024 (Mathematical Association of America,
work page 2024
-
[45]
30 Olympiad math AIME 2025 (Mathematical Association of America,
work page 2025
-
[46]
272 STEM math 20 Preprint. Under review. E.2 Training hyperparameters Parameter Proposer Solver Base model Qwen3-4B / Qwen3-8B Optimizer AdamW Learning rate 1×10 −6 Weight decay 1×10 −2 LR warmup 0 Max grad norm 1.0 KL penalty low-variance KL,β=10 −2 Gradient checkpointing enabled Precision bfloat16 Global batch size 16 8 Micro-batch (update) 2 2 Micro-ba...
work page 2048
-
[47]
Given n questions with L2-normalized embeddings E∈R n×d, the similarity matrix is K=EE ⊤
for each (experiment, iteration) pair by embedding the proposer’s generated questions and computing the eigenspectrum of the cosine similarity kernel. Given n questions with L2-normalized embeddings E∈R n×d, the similarity matrix is K=EE ⊤. We compute VS=exp(− ∑i ˆλi log ˆλi) where ˆλi =λ i/ ∑j λj are the normalized eigenvalues of K. This yields the effec...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.