pith. sign in

arxiv: 2605.16727 · v1 · pith:F6QCW7GLnew · submitted 2026-05-16 · 💻 cs.AI

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

Pith reviewed 2026-05-19 21:33 UTC · model grok-4.3

classification 💻 cs.AI
keywords PopuLoRApopulation self-playLoRA adaptersasymmetric self-playco-evolutionLLM reasoningRLVR post-trainingproblem generation
0
0 comments X

The pith

A population of specialized LoRA adapters in asymmetric self-play creates a co-evolutionary arms race that improves LLM reasoning over single-agent baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PopuLoRA as a population-based asymmetric self-play method for RLVR post-training of LLMs. Teachers and students are distinct LoRA adapters on one frozen base model: teachers generate problems, students attempt them with a verifier, and cross-evaluation across sub-populations replaces the self-calibration that collapses single-agent loops. Weight-space mutations and crossovers evolve the population in seconds. On top of Absolute Zero Reasoner, the setup produces an arms race of harder problems and oscillating solve rates that expands problem coverage. The population mean then surpasses a compute-matched single-agent baseline across three code and seven math benchmarks, with even the weakest member outperforming the baseline in aggregate.

Core claim

PopuLoRA places specialized LoRA adapters into asymmetric self-play where teachers propose problems solved by matched students under a programmatic verifier, with cross-evaluation between sub-populations replacing single-agent self-calibration. LoRA weight-space evolution operators generate mutations and crossovers to maintain same-rank population members. Against a per-adapter compute-matched single-agent baseline, the population avoids convergence on easy problems, enters a co-evolutionary arms race with rising problem complexity and oscillating solve rates, and delivers higher benchmark scores despite lower training-time reward.

What carries the argument

Asymmetric teacher-student roles among specialized LoRA adapters on a shared base, combined with cross-evaluation between sub-populations and LoRA weight-space mutations plus crossovers for population evolution.

If this is right

  • The population mean outperforms the single-agent baseline on HumanEval+, MBPP+, LiveCodeBench and on AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench.
  • Even the weakest population member beats the baseline on aggregate across those benchmarks.
  • Teachers generate increasingly complex problems throughout training rather than easy ones the students already solve.
  • Student solve rates oscillate instead of converging to a stable high value.
  • Problem-space coverage continues to expand for the duration of training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Measuring the distribution of problem difficulty scores over training steps could directly test whether the arms race persists.
  • The oscillation pattern may indicate a dynamic equilibrium that keeps the population from overfitting to any fixed problem subset.
  • Extending the same teacher-student cross-evaluation structure to non-reasoning RLVR domains could sustain diversity in other verifiable tasks.
  • Population members could be periodically archived to create a growing library of increasingly capable specialist adapters.

Load-bearing premise

Cross-evaluation between sub-populations reliably blocks self-calibration and maintains an expanding problem-space arms race instead of letting the group settle on a narrow set of solvable problems.

What would settle it

Observing that generated problem difficulty or diversity stops rising after early training steps, or that student solve rates flatten without continued oscillation, would indicate the claimed arms race has collapsed.

Figures

Figures reproduced from arXiv: 2605.16727 by Augustine N. Mavor-Parker, Geoffrey Bradway, Lorenz Wolf, Matthew James Sargent, Maxwill Lin, Roger Creus Castanyer.

Figure 1
Figure 1. Figure 1: One PopuLoRA iteration. Matched teacher–student pairs generate and solve under a sandboxed verifier; the student’s failure rate is the teacher’s reward; every 𝑘 steps, LoRA evolution replaces the weakest members. 3.2 ARCHITECTURE The population consists of 𝑁𝑇 teacher and 𝑁𝑆 student LoRA adapters attached to a single shared frozen code-LLM base. Every adapter has the same rank 𝑟 and attaches to the same set… view at source ↗
Figure 2
Figure 2. Figure 2: reports greedy pass@1 across three code and seven math benchmarks, comparing the population’s mean and best teacher and best student against the per-adapter compute-matched baseline. The 8T+8S rows use their available 100-gradient-step checkpoint. Average HumanEval+ MBPP+ LCB v5 AIME24 AIME25 AMC23 MATH-500 Minerva GSM8K Olympiad 0.0 0.2 0.4 0.6 0.8 1.0 Pass@1 Average Code Math .39 .42 .46 .47 .45 Qwen2.5-… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics. Left two panels: solver (solve rate, format rate). Right two panels: teacher (problem difficulty = 1 − solve rate, validity rate). Baseline in black, population mean in blue with per-member spread. Per-type breakdown in Appendix F. The population’s dynamics are strikingly different. Student solve rates oscillate throughout training rather than monotonically rising. This pattern has a nat… view at source ↗
Figure 4
Figure 4. Figure 4: Program complexity over training. Baseline (black) trends downward on all four axes; population (blue) trends upward. Coverage analysis in Appendix E. The difference is clear. In every panel the baseline curves trend downward: the single-agent teacher learns to produce progressively simpler programs along every axis, converging on the simplest programs it can 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TrueSkill 𝜇 and arms race. Left/centre: per-adapter ratings (light) and role mean (bold). Right: matchup outcome from student (blue) vs. teacher (orange) perspective; the lead alternates throughout training. Individual adapters differentiate from the population mean as training progresses. Early on, all members cluster near the prior 𝜇=25; by mid-training, distinct high and low performers have emerged in b… view at source ↗
Figure 6
Figure 6. Figure 6: LoRA operator retention (snapshot step 25). Top: mutations (parent in grey). Bottom: crossovers (two parents in grey; trained on different task types). All children recover to near-parent performance within ∼20 steps. Full operator grid in Appendix J. The mutation results (top row) confirm that perturbed children start close to their parent and resume gradient updates without resetting to the frozen base, … view at source ↗
Figure 7
Figure 7. Figure 7: Population size ablation. Even a single teacher–student pair (1T+1S) avoids the baseline’s mode collapse. Co-evolutionary oscillations become more pronounced at 4T+4S and 8T+8S. The 8T+8S run shown here stops at 100 gradient steps. Even at the smallest population size, 1T+1S, decoupling the teacher and student into separate adapters is enough to avoid the baseline’s mode collapse: the solver reward does no… view at source ↗
Figure 8
Figure 8. Figure 8: pairs one baseline-generated and one population-generated problem from matched training steps, drawn from the saved per-step problem archives. Picks are deterministic: at each step we take a median￾complexity quality-1.0 problem, subject to a loose line-count bound so snippets fit the figure; at step 100 we additionally report the most trivial quality-1.0 baseline problem to illustrate the mode-collapse en… view at source ↗
Figure 9
Figure 9. Figure 9: Problem-space coverage. CVT archive grid coverage (percent of the 4096-cell budget). Baseline (black) vs population (blue). 0 50 100 150 200 0.2 0.4 0.6 0.8 1.0 code_i Solve rate 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Validity rate 0 50 100 150 200 0.2 0.4 0.6 0.8 1.0 code_o 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 Training step 0.0 0.2 0.4 0.6 0.8 1.0 code_f 0 50 100 150 200 Training st… view at source ↗
Figure 10
Figure 10. Figure 10: Per-type breakdown of [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: isolates the solver’s solve rate for each of the three AZR task types. The baseline reaches near-perfect solve rate on all three types, consistent with self-calibration to easy problems. The population’s solve rate oscillates on each type, with the oscillation frequency varying across types (fastest on output prediction, slowest on induction), matching the per-type dynamics in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 12
Figure 12. Figure 12: shows per-student solve-rate profiles against each teacher at five equispaced training snapshots. T0 T1 T2 T3 0.0 0.2 0.4 0.6 0.8 1.0 Solve rate Step 0 S0 S1 S2 S3 T0 T1 T2 T3 Step 48 T0 T1 T2 T3 Step 97 T0 T1 T2 T3 Step 146 T0 T1 T2 T3 Step 195 Teacher [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pass@1 for each of the 4 teachers and 4 students from the 4T+4S population. The main text ( [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pass@1 for each of the 8 teachers and 8 students from the 8T+8S population. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Downstream pass@1 including the full-finetune Baseline AZR (300 gradient steps, non-LoRA). Compare with [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Mutation-operator retention across snapshot steps. Rows: mutation operators M1–M6 plus copy_- parent control. Columns: snapshot steps (10, 25, 50, 100). Parent’s 100-step learning curve is drawn in grey, and the child’s 50-step retraining curve in colour, with the child’s x-axis offset by the snapshot step so both live on the same global-step scale. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Crossover-operator retention across snapshot steps. Same layout as the mutation figure: rows are X1–X9 plus the linear_0_5 plain-average control; columns are snapshot steps (10, 25, 50, 100). Parents from exp_c1 task￾merging sweep in grey, child retraining in colour with the snapshot-step offset. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Training diagnostics. Gradient norm, policy-gradient loss, entropy. Baseline (black) vs population mean (blue) with per-member spread (light blue). L RESPONSE LENGTH OVER TRAINING [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Response length over training. Baseline (black) collapses to short responses (∼250 tokens); population (blue) grows to ∼1000 tokens as problem complexity increases. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PopuLoRA, a population-based asymmetric self-play framework for RLVR post-training of LLMs. Teachers and students are specialized LoRA adapters on a shared frozen base; teachers propose problems that matched students solve under a programmatic verifier, with cross-evaluation between sub-populations replacing self-calibration. A family of LoRA weight-space mutation and crossover operators enables population evolution at 7B scale. The central claim is that this induces a co-evolutionary arms race (increasingly complex problems, oscillating solve rates, expanding coverage) that yields superior benchmark performance: the population mean and even its weakest member outperform a compute-matched per-adapter single-agent baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), despite lower training-time reward.

Significance. If the claimed mechanism is substantiated, the work would demonstrate a practical route to scaling self-play for reasoning without the self-calibration trap that limits single-agent approaches, offering a population-level alternative to standard RLVR post-training.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results section: the attribution of outperformance to a sustained co-evolutionary arms race with expanding problem-space coverage rests on qualitative descriptions of oscillating solve rates. No time-series statistics are reported on problem metrics (e.g., average solution length, verifier pass-rate on fixed hard subsets, or entropy of problem types) that would distinguish progressive difficulty growth from static diversity or ensemble effects.
  2. [Methods] Methods section on population loop and cross-evaluation: the claim that cross-evaluation between sub-populations reliably prevents self-calibration and sustains an expanding problem space lacks quantitative verification. The manuscript does not report how sub-population splits and matching are implemented or whether problem coverage continues to expand after initial exploration.
minor comments (2)
  1. [Results] Add error bars or multiple-run statistics to benchmark tables and training curves to support the reported outperformance claims.
  2. [Experimental setup] Clarify the exact values and sensitivity of free parameters (LoRA mutation/crossover rates, population size, sub-population split) in the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the empirical support for the co-evolutionary mechanism. We respond to each major comment below and will incorporate revisions to provide additional quantitative analyses where the current evidence is primarily qualitative.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the attribution of outperformance to a sustained co-evolutionary arms race with expanding problem-space coverage rests on qualitative descriptions of oscillating solve rates. No time-series statistics are reported on problem metrics (e.g., average solution length, verifier pass-rate on fixed hard subsets, or entropy of problem types) that would distinguish progressive difficulty growth from static diversity or ensemble effects.

    Authors: We agree that the current presentation relies on qualitative descriptions of oscillating solve rates and expanding coverage. To better substantiate the distinction between progressive difficulty growth and alternative explanations such as static diversity or ensemble effects, we will add time-series statistics and plots in the revised experimental results section. These will include average solution length, verifier pass-rates on fixed hard subsets, and entropy of problem types over training steps. revision: yes

  2. Referee: [Methods] Methods section on population loop and cross-evaluation: the claim that cross-evaluation between sub-populations reliably prevents self-calibration and sustains an expanding problem space lacks quantitative verification. The manuscript does not report how sub-population splits and matching are implemented or whether problem coverage continues to expand after initial exploration.

    Authors: The methods section (Section 3) describes the population loop, including the division into sub-populations and the cross-evaluation matching procedure used to replace self-calibration. We acknowledge, however, that quantitative verification of continued expansion after initial exploration is not fully reported. In the revision we will add metrics and visualizations tracking problem coverage (e.g., unique problem-type counts and difficulty distributions) across training phases to confirm sustained expansion beyond the early stages. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark comparison with no self-referential derivation or fitted prediction

full rationale

The manuscript describes an algorithmic framework (PopuLoRA) instantiated on top of Absolute Zero Reasoner and evaluated via direct, compute-matched comparison to a single-adapter baseline on fixed external benchmarks (HumanEval+, MBPP+, LiveCodeBench, AIME, AMC, MATH-500, etc.). No equations, uniqueness theorems, or first-principles derivations are presented that reduce the reported performance gains to quantities defined by the method itself. The co-evolutionary narrative is supported by qualitative observations of oscillating solve rates and expanding coverage rather than any closed-loop mathematical reduction or self-citation chain. This is a standard empirical RLVR study whose central claims rest on external test sets and are therefore self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach inherits standard assumptions from RLVR and LoRA fine-tuning literature while introducing population-level operators whose effectiveness is demonstrated empirically rather than derived.

free parameters (2)
  • LoRA mutation and crossover rates
    Evolution operators require rate and selection hyperparameters that are not derived from first principles.
  • Population size and sub-population split
    Number of adapters and teacher/student ratio chosen to enable the arms race dynamic.
axioms (1)
  • domain assumption Programmatic verifier supplies accurate and unbiased rewards for generated problems
    Central to the self-play loop; any systematic bias in verification would collapse the claimed co-evolution.

pith-pipeline@v0.9.0 · 5785 in / 1203 out tokens · 41416 ms · 2026-05-19T21:33:52.891464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 17 internal anchors

  1. [1]

    and Mathieu, Micha

    Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M. and Mathieu, Micha. Grandmaster Level in. Nature , volume =. 2019 , doi =

  2. [2]

    International Conference on Learning Representations , year =

    Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , author =. International Conference on Learning Representations , year =

  3. [3]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

  4. [4]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author =. 2017 , eprint =. doi:10.48550/arXiv.1707.06347 , url =

  5. [5]

    Machine Learning , volume =

    Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , author =. Machine Learning , volume =. 1992 , doi =

  6. [6]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Hu, Jian and Liu, Jason Klein and Xu, Haotian and Shen, Wei , year =. doi:10.48550/arXiv.2501.03262 , url =. 2501.03262 , archivePrefix =

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  8. [8]

    Illuminating search spaces by mapping elites

    Illuminating Search Spaces by Mapping Elites , author =. 2015 , eprint =. doi:10.48550/arXiv.1504.04909 , url =

  9. [9]

    Using Centroidal

    Vassiliades, Vassilis and Chatzilygeroudis, Konstantinos and Mouret, Jean-Baptiste , journal =. Using Centroidal. 2018 , doi =

  10. [10]

    Advances in Neural Information Processing Systems , volume =

    Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  11. [11]

    , booktitle =

    Wang, Rui and Lehman, Joel and Clune, Jeff and Stanley, Kenneth O. , booktitle =. 2019 , publisher =. doi:10.1145/3321707.3321799 , url =

  12. [12]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Evolving Curricula with Regret-Based Environment Design , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , volume =

  13. [13]

    International Conference on Learning Representations , year =

    Emergent Tool Use from Multi-Agent Autocurricula , author =. International Conference on Learning Representations , year =

  14. [14]

    Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

    Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , author =. 2016 , eprint =. doi:10.48550/arXiv.1603.01121 , url =

  15. [15]

    Qwen2.5-Coder Technical Report

    Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Dang, Kai and Fan, Yang and Zhang, Yichang and Yang, An and Men, Rui and Huang, Fei and Zheng, Bo and Miao, Yibo and Quan, Shanghaoran and Feng, Yunlong and Ren, Xingzhang and Ren, Xuancheng and Zhou...

  16. [16]

    Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , publisher =. doi:10.1145/3600006.3613165 , url =

  17. [17]

    and Stoica, Ion , booktitle =

    Sheng, Ying and Cao, Shiyi and Li, Dacheng and Hooper, Coleman and Lee, Nicholas and Yang, Shuo and Chou, Christopher and Zhu, Banghua and Zheng, Lianmin and Keutzer, Kurt and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. 2024 , url =

  18. [18]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =. doi:10.48550/arXiv.2107.03374 , url =

  19. [19]

    Is Your Code Generated by

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by. 2023 , url =

  20. [20]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author =. 2021 , eprint =. doi:10.48550/arXiv.2108.07732 , url =

  21. [21]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , year =. doi:10.48550/arXiv.2403.07974 , url =. 2403.07974 , archivePrefix =

  22. [22]

    Measuring Mathematical Problem Solving with the

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

  23. [23]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =. doi:10.48550/arXiv.2110.14168 , url =

  24. [24]

    Advances in Neural Information Processing Systems , volume =

    Solving Quantitative Reasoning Problems with Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  25. [25]

    O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , url =

  26. [26]

    International Conference on Learning Representations , year =

    Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

  27. [27]

    Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H. , booktitle =. 2022 , url =

  28. [28]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data , author =. 2025 , eprint =. doi:10.48550/arXiv.2505.03335 , url =

  29. [29]

    Population Based Training of Neural Networks

    Population Based Training of Neural Networks , author =. 2017 , eprint =. doi:10.48550/arXiv.1711.09846 , url =

  30. [30]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  31. [31]

    2024 , volume =

    Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin , booktitle =. 2024 , volume =

  32. [32]

    2023 , url =

    Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =. 2023 , url =

  33. [33]

    International Conference on Learning Representations , year =

    Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

  34. [34]

    and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , booktitle =

    Jain, Neel and Chiang, Ping-Yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , booktitle =. 2024 , url =

  35. [35]

    2023 , address =

    Valipour, Mojtaba and Rezagholizadeh, Mehdi and Kobyzev, Ivan and Ghodsi, Ali , booktitle =. 2023 , address =. doi:10.18653/v1/2023.eacl-main.239 , url =

  36. [36]

    Della-merging: Reducing interference in model merging through magnitude-based sampling

    Deep, Pala Tej and Bhardwaj, Rishabh and Poria, Soujanya , year =. doi:10.48550/arXiv.2406.11617 , url =. 2406.11617 , archivePrefix =

  37. [37]

    Advances in Neural Information Processing Systems , volume =

    Merging Models with Fisher-Weighted Averaging , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  38. [38]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

  39. [39]

    2025 , doi =

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal =. 2025 , doi =

  40. [40]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V. and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Chris and Soldaini, L...

  41. [41]

    , booktitle =

    Chen, Jiaqi and Zhang, Bang and Ma, Ruotian and Wang, Peisong and Liang, Xiaodan and Tu, Zhaopeng and Li, Xiaolong and Wong, Kwan-Yee K. , booktitle =. 2025 , url =

  42. [42]

    ICLR 2026 Workshop on AI with Recursive Self-Improvement , year =

    Jana, Swadesh and Sancaktar, Cansu and Dani. ICLR 2026 Workshop on AI with Recursive Self-Improvement , year =. 2603.15957 , archivePrefix =

  43. [43]

    Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

    Liu, Bo and Jin, Chuanyang and Kim, Seungone and Yuan, Weizhe and Zhao, Wenting and Kulikov, Ilia and Li, Xian and Sukhbaatar, Sainbayar and Lanchantin, Jack and Weston, Jason , year =. doi:10.48550/arXiv.2510.24684 , url =. 2510.24684 , archivePrefix =

  44. [44]

    doi:10.48550/arXiv.2602.05472 , url =

    Duan, Yiwen and Ye, Jing and Zhao, Xinpei , year =. doi:10.48550/arXiv.2602.05472 , url =. 2602.05472 , archivePrefix =

  45. [45]

    doi:10.48550/arXiv.2601.18292 , url =

    Tan, Zhewen and Yu, Wenhan and Si, Jianfeng and Liu, Tongxin and Guan, Kaiqi and Jin, Huiyan and Tao, Jiawen and Yuan, Xiaokun and Ma, Duohe and Zhang, Xiangzheng and Yang, Tong and Sun, Lin , year =. doi:10.48550/arXiv.2601.18292 , url =. 2601.18292 , archivePrefix =

  46. [46]

    Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

    Language Self-Play For Data-Free Training , author =. 2025 , eprint =. doi:10.48550/arXiv.2509.07414 , url =

  47. [47]

    2026 , doi =

    Chowdhury, Md Tahmid Ashraf and Ullah, Fasee and Hassan, Mohd Hilmi and Bhushan, Shashi and Kamal, Shahid and Khan, Arfat Ahmad , journal =. 2026 , doi =

  48. [48]

    2026 , eprint =

    Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability , author =. 2026 , eprint =. doi:10.48550/arXiv.2601.18778 , url =

  49. [49]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author =. 2025 , eprint =. doi:10.48550/arXiv.2507.17746 , url =

  50. [50]

    Nature Machine Intelligence , volume =

    Evolutionary Optimization of Model Merging Recipes , author =. Nature Machine Intelligence , volume =. 2025 , doi =

  51. [51]

    2025 , eprint =

    Nature-Inspired Population-Based Evolution of Large Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2503.01155 , url =

  52. [52]

    2025 , eprint =

    Evolution Strategies at the Hyperscale , author =. 2025 , eprint =. doi:10.48550/arXiv.2511.16652 , url =

  53. [53]

    2025 , eprint =

    Evolutionary Strategies for Scalable Alignment , author =. 2025 , eprint =. doi:10.48550/arXiv.2507.04453 , url =

  54. [54]

    Evolution strategies at scale: Llm fine-tuning beyond reinforcement learning.arXiv preprint arXiv:2509.24372, 2025

    Qiu, Xin and Gan, Yulu and Hayes, Conor F. and Liang, Qiyao and Xu, Yinggan and Dailey, Roberto and Meyerson, Elliot and Hodjat, Babak and Miikkulainen, Risto , year =. Evolution Strategies at Scale:. doi:10.48550/arXiv.2509.24372 , url =. 2509.24372 , archivePrefix =

  55. [55]

    Model Swarms: Collaborative Search to Adapt

    Feng, Shangbin and Wang, Zifeng and Wang, Yike and Ebrahimi, Sayna and Palangi, Hamid and Miculicich, Lesly and Kulshrestha, Achin and Rauschmayr, Nathalie and Choi, Yejin and Tsvetkov, Yulia and Lee, Chen-Yu and Pfister, Tomas , booktitle =. Model Swarms: Collaborative Search to Adapt. 2025 , volume =

  56. [56]

    Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-

    Feng, Shangbin and Wang, Zifeng and Goyal, Palash and Wang, Yike and Shi, Weijia and Xia, Huang and Palangi, Hamid and Zettlemoyer, Luke and Tsvetkov, Yulia and Lee, Chen-Yu and Pfister, Tomas , booktitle =. Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-. 2025 , url =

  57. [57]

    Huang, Q

    Huang, Chengsong and Liu, Qian and Lin, Bill Yuchen and Pang, Tianyu and Du, Chao and Lin, Min , year =. doi:10.48550/arXiv.2307.13269 , url =. 2307.13269 , archivePrefix =

  58. [58]

    and Buehler, Markus J

    Buehler, Eric L. and Buehler, Markus J. , year =. doi:10.48550/arXiv.2402.07148 , url =. 2402.07148 , archivePrefix =

  59. [59]

    and Tan, Qijun and Liu, Yuan , year =

    Ye, Ziyu and Agarwal, Rishabh and Liu, Tianqi and Joshi, Rishabh and Velury, Sarmishta and Le, Quoc V. and Tan, Qijun and Liu, Yuan , year =. doi:10.48550/arXiv.2411.00062 , url =. 2411.00062 , archivePrefix =

  60. [60]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Huang, Chengsong and Yu, Wenhao and Wang, Xiaoyang and Zhang, Hongming and Li, Zongxia and Li, Ruosen and Huang, Jiaxin and Mi, Haitao and Yu, Dong , year =. doi:10.48550/arXiv.2508.05004 , url =. 2508.05004 , archivePrefix =

  61. [61]

    2026 , url =

    Liu, Bo and Guertler, Leon and Yu, Simon and Liu, Zichen and Qi, Penghui and Balcells, Daniel and Liu, Mickel and Tan, Cheston and Shi, Weiyan and Lin, Min and Lee, Wee Sun and Jaques, Natasha , booktitle =. 2026 , url =

  62. [62]

    and Valentino, Marco and Minervini, Pasquale , year =

    Kwan, Wai-Chung and Leang, Joshua Ong Jun and Vougiouklis, Pavlos and Pan, Jeff Z. and Valentino, Marco and Minervini, Pasquale , year =. doi:10.48550/arXiv.2511.00602 , url =. 2511.00602 , archivePrefix =

  63. [63]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Self-Rewarding Language Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

  64. [64]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    Guan, Xinyu and Zhang, Li Lyna and Liu, Yifei and Shang, Ning and Sun, Youran and Zhu, Yi and Yang, Fan and Yang, Mao , year =. doi:10.48550/arXiv.2501.04519 , url =. 2501.04519 , archivePrefix =

  65. [65]

    , booktitle =

    Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , url =

  66. [66]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

  67. [67]

    2006 , url =

    Herbrich, Ralf and Minka, Tom and Graepel, Thore , booktitle =. 2006 , url =