Discovering Multiagent Learning Algorithms with Large Language Models

arxiv: 2602.16928 · v3 · submitted 2026-02-18 · 💻 cs.GT · cs.AI· cs.MA

Discovering Multiagent Learning Algorithms with Large Language Models

Zun Li , John Schultz , Daniel Hennes , Marc Lanctot This is my paper

Pith reviewed 2026-05-15 20:37 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.MA

keywords multi-agent reinforcement learningcounterfactual regret minimizationpolicy space response oracleslarge language modelsalgorithm discoverygeneralizationablation studies

0 comments p. Extension

The pith

LLM search over CFR and PSRO design spaces yields complex algorithms that distill to simpler cores with stronger generalization on unseen games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper deploys an LLM-powered evolutionary agent to explore algorithmic variants in two multi-agent game-solving paradigms. It produces Volatility-Adaptive Discounted CFR and Smoothed Hybrid Optimistic Regret PSRO, both competitive with hand-designed baselines across eighteen training games. Systematic ablations then isolate the minimal mechanisms responsible for performance, resulting in Warm-started Optimistic Predictive CFR and Projection Matching PSRO. These distilled versions retain or improve generalization while eliminating most of the original structural complexity. The work therefore supplies a concrete workflow for using language models to generate candidate algorithms and then reduce them to their essential components.

Core claim

Large language models can navigate the design spaces of counterfactual regret minimization and policy-space response oracles to produce competitive multi-agent solvers; however, the performance advantage on new games resides in a small algorithmic core that remains after the LLM-constructed synergistic mechanisms are removed by ablation.

What carries the argument

Distillation of LLM-generated algorithms via systematic ablation to recover minimal cores (Warm-started Optimistic Predictive CFR and Projection Matching PSRO) that carry the generalization benefit.

If this is right

Ablation after LLM search can routinely convert overfit, high-complexity solvers into leaner ones that generalize further.
The same discovery-plus-distillation pattern applies across both regret-minimization and oracle-response families of algorithms.
Manual refinement of MARL baselines can be partially replaced by an automated loop of generation followed by core extraction.
Structural complexity introduced to fit a training distribution often harms rather than helps performance on new distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be applied to other imperfect-information domains such as negotiation or security games to surface previously unknown minimal mechanisms.
Over-reliance on synergistic but non-essential components may be a general property of LLM-generated algorithms, suggesting a default post-processing step of ablation in future automated discovery efforts.
If the minimal cores prove robust, they could serve as new human-readable starting points for further theoretical analysis of regret bounds or equilibrium convergence rates.

Load-bearing premise

That the performance edge of the distilled minimal cores, isolated on the eighteen-game training collection, will persist on games and rule variants never seen during search or ablation.

What would settle it

Run the distilled WOP-CFR and PM-PSRO on a fresh suite of imperfect-information games outside the original eighteen and observe whether they lose their advantage over the original complex LLM outputs or over standard human baselines.

read the original abstract

Much of the advancement in Multi-Agent Reinforcement Learning (MARL) for imperfect-information games has historically depended on the manual, iterative refinement of algorithmic baselines. Recently, evolutionary coding agents powered by Large Language Models (LLMs) have emerged as powerful tools to automate this discovery process. In this work, we deploy one of such agentic frameworks, AlphaEvolve, to navigate the design spaces of two distinct game-theoretic paradigms: counterfactual regret minimization (CFR) and policy-space response oracles (PSRO). This automated search yielded two algorithms: Volatility-Adaptive Discounted (VAD-) CFR and Smoothed Hybrid Optimistic Regret (SHOR-) PSRO, which are consistently competitive with state-of-the-art human-designed baselines across an 18-game evaluation suite spanning Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants. However, because the LLM optimizes for fitness on a specific training set, it often constructs highly synergistic, complex mechanisms tailored to those environments. Through systematic ablation studies, we demonstrate that while these mechanisms are tightly coupled, the true driver of generalization lies in a minimal algorithmic core. By distilling the LLM's discoveries down to their most fundamental principles, we produce two minimal solvers: Warm-started Optimistic Predictive (WOP-)CFR and Projection Matching (PM-)PSRO. These distilled versions achieve superior performance on generalization with greatly reduced structural complexity, providing a clear methodology for using LLMs in algorithmic discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM search over CFR/PSRO spaces yields distillable algorithms, but same-suite ablations leave generalization open to question.

read the letter

The punchline is that this work shows LLMs can automate the discovery of new multi-agent learning algorithms in the CFR and PSRO families, and that distilling those discoveries down to simpler cores can improve generalization while cutting complexity. What is new is the use of AlphaEvolve to search the design spaces for these two paradigms, resulting in VAD-CFR and SHOR-PSRO that hold their own against human-designed baselines on the 18-game suite. The distillation to WOP-CFR and PM-PSRO is a solid step, as it identifies the key mechanisms driving performance and strips away the rest. The paper does well at providing a practical methodology for using these tools in algorithmic discovery, with ablations to support the minimal core idea. The main soft spot is the reliance on the same 18-game training suite for both the initial search and the ablations. This setup risks selecting for mechanisms that exploit regularities in Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants rather than principles that transfer to new games. The abstract does not mention a held-out test set or validation on entirely different game types, so the claim of superior generalization for the distilled versions rests on potentially circular evidence. Reproducibility would also benefit from released code and more detailed statistical analysis. This paper is for people in MARL and computational game theory who want to explore automated ways to generate and refine solvers. It could give ideas to anyone trying to move beyond manual algorithm design. I would send it to peer review. The core idea is worth referee scrutiny even with the current limitations on the evaluation.

Referee Report

2 major / 2 minor

Summary. The paper deploys the LLM-powered evolutionary agent AlphaEvolve to search the design spaces of counterfactual regret minimization (CFR) and policy-space response oracles (PSRO). It reports two discovered algorithms (VAD-CFR and SHOR-PSRO) that are competitive with human-designed baselines on an 18-game suite (Poker, Goofspiel, Liar's Dice, Blotto, Battleship variants). Systematic ablations then distill these to minimal cores (WOP-CFR and PM-PSRO) that the authors claim achieve superior generalization performance with substantially reduced structural complexity.

Significance. If the generalization claims hold after proper validation, the work supplies a concrete methodology for LLM-assisted algorithmic discovery in multi-agent game theory, including a useful emphasis on distilling complex discovered mechanisms to minimal, interpretable cores. This could accelerate progress beyond purely manual refinement of MARL baselines.

major comments (2)

[Ablation Studies] Ablation Studies section: the identification of the 'minimal algorithmic core' (WOP-CFR, PM-PSRO) and the claim of superior generalization both rely on fitness evaluations performed exclusively on the same 18-game suite used for the original AlphaEvolve search. No held-out test partition, cross-suite validation, or explicit separation between training and evaluation games is described, so the reported performance advantage may reflect selection bias rather than discovery of transferable principles.
[Evaluation Suite and Results] Evaluation Suite and Results: the abstract asserts that the distilled solvers are 'consistently competitive' and achieve 'superior performance on generalization,' yet the manuscript provides no statistical significance tests, confidence intervals, or variance measures across random seeds or game variants. This weakens the load-bearing claim that the minimal cores outperform baselines in a robust, generalizable manner.

minor comments (2)

[Algorithm Descriptions] The acronyms VAD-CFR, SHOR-PSRO, WOP-CFR, and PM-PSRO are introduced without a compact table that lists the precise modifications each makes to the respective baseline (CFR or PSRO) and the components removed during distillation.
[Figures] Figure captions and axis labels for the performance plots could be expanded to explicitly state the number of independent runs and the exact metric (e.g., exploitability or win rate) being plotted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting key aspects of our evaluation methodology. We address each major comment point by point below and commit to revisions that add statistical rigor and validation procedures.

read point-by-point responses

Referee: [Ablation Studies] Ablation Studies section: the identification of the 'minimal algorithmic core' (WOP-CFR, PM-PSRO) and the claim of superior generalization both rely on fitness evaluations performed exclusively on the same 18-game suite used for the original AlphaEvolve search. No held-out test partition, cross-suite validation, or explicit separation between training and evaluation games is described, so the reported performance advantage may reflect selection bias rather than discovery of transferable principles.

Authors: We agree that the AlphaEvolve search and all reported fitness evaluations used the same 18-game suite, which leaves open the possibility of selection bias. The suite was chosen for its diversity across Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants precisely to encourage transferable mechanisms, and the ablations show that the distilled cores preserve performance while removing synergistic but non-essential components. To strengthen the claim, we will add an explicit held-out partition (reserving 4–5 games) and report separate training versus test performance in the revised manuscript. revision: yes
Referee: [Evaluation Suite and Results] Evaluation Suite and Results: the abstract asserts that the distilled solvers are 'consistently competitive' and achieve 'superior performance on generalization,' yet the manuscript provides no statistical significance tests, confidence intervals, or variance measures across random seeds or game variants. This weakens the load-bearing claim that the minimal cores outperform baselines in a robust, generalizable manner.

Authors: We acknowledge that the current manuscript reports only mean performance without variance, confidence intervals, or significance tests. In the revision we will rerun all experiments with at least 5 random seeds, report standard deviations and 95% confidence intervals, and include paired statistical tests (Wilcoxon signed-rank) comparing WOP-CFR and PM-PSRO against the strongest baselines on each game. These additions will directly support the generalization claims with quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical discovery and ablation process is self-contained

full rationale

The paper presents an LLM-driven evolutionary search (AlphaEvolve) over CFR and PSRO design spaces on a fixed 18-game training suite, followed by manual ablations to distill minimal cores (WOP-CFR, PM-PSRO) and empirical performance comparisons against baselines. No first-principles derivation, uniqueness theorem, or mathematical reduction is claimed. All central results are obtained via search, ablation, and direct evaluation on the described suite; there are no equations or parameters that are fitted to a subset and then re-presented as independent predictions, nor any load-bearing self-citations that close a definitional loop. The methodology is therefore data-driven and externally falsifiable on held-out games or variants, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the LLM agent can productively explore the relevant design spaces and that the chosen game suite is representative for generalization testing.

axioms (2)

domain assumption AlphaEvolve can effectively navigate the design spaces of CFR and PSRO.
Invoked when the paper states it deploys the agent to discover the algorithms.
domain assumption Ablation studies on the training games isolate the minimal core responsible for generalization.
Central to the distillation step described in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1265 out tokens · 21964 ms · 2026-05-15T20:37:59.049046+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From I/O to Code with Discovery Agent
cs.LG 2026-05 unverdicted novelty 7.0

DIO-Agent frames IO2Code as LLM-driven evolutionary search over programs with a Transformation Priority Premise to favor simple hypotheses, outperforming baselines on a new IO2CodeBench.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper

[1]

""A class that updates cumulative regret using Adaptive Discounting with separate discounting for positive and negative regrets, and instantaneous regret boosting

Appendix 7.1. Source code of discovered algorithms Listing 5|VAD-CFR class RegretAccumulator: """A class that updates cumulative regret using Adaptive Discounting with separate discounting for positive and negative regrets, and instantaneous regret boosting. """ @staticmethod def _calculate_adaptive_params( iteration_number, cfr_regrets, base_alpha, base_...

work page
[7]

The end of the replace block: >>>>>>> REPLACE 33 Discovering Multiagent Learning Algorithms with Large Language Models 0 25 50 75 100 PSRO Iteration 10 10 10 8 10 6 10 4 10 2 100 Exploitability Kuhn Poker 0 25 50 75 100 PSRO Iteration 10 1 100 Leduc Poker 0 25 50 75 100 PSRO Iteration 10 4 10 3 10 2 10 1 100 Kuhn Poker (Players=3) 0 25 50 75 100 PSRO Iter...

work page
[8]

*SEARCH/REPLACE* blocks will replace *all* matching occurrences

The closing fence: ‘‘‘ Every *SEARCH* section must *EXACTLY MATCH* the existing file content, character for character, ↩→including all comments, docstrings, etc. *SEARCH/REPLACE* blocks will replace *all* matching occurrences. Include enough lines to make the SEARCH blocks uniquely match the lines to change. Keep *SEARCH/REPLACE* blocks concise. Break lar...

work page
[9]

**Empirical game**: Simulate payoffs between all policy combinations to form a game tensor

work page
[10]

↩→This determines what opponents the best-response oracle trains against

**Train-time meta-strategy**: Compute a distribution over current policies for each player. ↩→This determines what opponents the best-response oracle trains against

work page
[11]

**Best response**: Add a new policy for each player that best responds to opponents’ train-time ↩→meta-strategies

work page
[12]

**Your task**: Improve both the **train-time** and **eval-time** meta-strategy solvers

**Eval-time meta-strategy**: Compute a (possibly different) distribution for evaluation, e.g., ↩→to measure exploitability. **Your task**: Improve both the **train-time** and **eval-time** meta-strategy solvers. These ↩→serve different purposes: train-time guides population growth, eval-time assesses solution ↩→quality. # Available Utilities **Best Respon...

work page
[13]

The opening fence: ‘‘‘python

work page
[14]

The start of search block: <<<<<<< SEARCH

work page
[15]

A contiguous chunk of up to 4 lines to search for in the existing source code

work page
[16]

The dividing line: =======

work page
[17]

The lines to replace into the source code

work page
[18]

The end of the replace block: >>>>>>> REPLACE

work page
[19]

*SEARCH/REPLACE* blocks will replace *all* matching occurrences

The closing fence: ‘‘‘ Every *SEARCH* section must *EXACTLY MATCH* the existing file content, character for character, ↩→including all comments, docstrings, etc. *SEARCH/REPLACE* blocks will replace *all* matching occurrences. Include enough lines to make the SEARCH blocks uniquely match the lines to change. Keep *SEARCH/REPLACE* blocks concise. Break lar...

work page

[1] [1]

""A class that updates cumulative regret using Adaptive Discounting with separate discounting for positive and negative regrets, and instantaneous regret boosting

Appendix 7.1. Source code of discovered algorithms Listing 5|VAD-CFR class RegretAccumulator: """A class that updates cumulative regret using Adaptive Discounting with separate discounting for positive and negative regrets, and instantaneous regret boosting. """ @staticmethod def _calculate_adaptive_params( iteration_number, cfr_regrets, base_alpha, base_...

work page

[2] [7]

The end of the replace block: >>>>>>> REPLACE 33 Discovering Multiagent Learning Algorithms with Large Language Models 0 25 50 75 100 PSRO Iteration 10 10 10 8 10 6 10 4 10 2 100 Exploitability Kuhn Poker 0 25 50 75 100 PSRO Iteration 10 1 100 Leduc Poker 0 25 50 75 100 PSRO Iteration 10 4 10 3 10 2 10 1 100 Kuhn Poker (Players=3) 0 25 50 75 100 PSRO Iter...

work page

[3] [8]

*SEARCH/REPLACE* blocks will replace *all* matching occurrences

The closing fence: ‘‘‘ Every *SEARCH* section must *EXACTLY MATCH* the existing file content, character for character, ↩→including all comments, docstrings, etc. *SEARCH/REPLACE* blocks will replace *all* matching occurrences. Include enough lines to make the SEARCH blocks uniquely match the lines to change. Keep *SEARCH/REPLACE* blocks concise. Break lar...

work page

[4] [9]

**Empirical game**: Simulate payoffs between all policy combinations to form a game tensor

work page

[5] [10]

↩→This determines what opponents the best-response oracle trains against

**Train-time meta-strategy**: Compute a distribution over current policies for each player. ↩→This determines what opponents the best-response oracle trains against

work page

[6] [11]

**Best response**: Add a new policy for each player that best responds to opponents’ train-time ↩→meta-strategies

work page

[7] [12]

**Your task**: Improve both the **train-time** and **eval-time** meta-strategy solvers

**Eval-time meta-strategy**: Compute a (possibly different) distribution for evaluation, e.g., ↩→to measure exploitability. **Your task**: Improve both the **train-time** and **eval-time** meta-strategy solvers. These ↩→serve different purposes: train-time guides population growth, eval-time assesses solution ↩→quality. # Available Utilities **Best Respon...

work page

[8] [13]

The opening fence: ‘‘‘python

work page

[9] [14]

The start of search block: <<<<<<< SEARCH

work page

[10] [15]

A contiguous chunk of up to 4 lines to search for in the existing source code

work page

[11] [16]

The dividing line: =======

work page

[12] [17]

The lines to replace into the source code

work page

[13] [18]

The end of the replace block: >>>>>>> REPLACE

work page

[14] [19]

*SEARCH/REPLACE* blocks will replace *all* matching occurrences

The closing fence: ‘‘‘ Every *SEARCH* section must *EXACTLY MATCH* the existing file content, character for character, ↩→including all comments, docstrings, etc. *SEARCH/REPLACE* blocks will replace *all* matching occurrences. Include enough lines to make the SEARCH blocks uniquely match the lines to change. Keep *SEARCH/REPLACE* blocks concise. Break lar...

work page