Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

Daniel Hennes; Ian Gemp; Kate Larson; Kevin R. McKee; Luke Marris; Marc Lanctot; Michael P. Wellman; Paul Muller; Yoram Bachrach; Zun Li

arxiv: 2302.00797 · v4 · submitted 2023-02-01 · 💻 cs.AI · cs.GT· cs.LG· cs.MA

Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

Zun Li , Marc Lanctot , Kevin R. McKee , Luke Marris , Ian Gemp , Daniel Hennes , Paul Muller , Kate Larson

show 2 more authors

Yoram Bachrach Michael P. Wellman

This is my paper

Pith reviewed 2026-05-24 10:01 UTC · model grok-4.3

classification 💻 cs.AI cs.GTcs.LGcs.MA

keywords opponent modelinggame-theoretic reinforcement learningMonte Carlo Tree Searchgenerative modelsNash bargainingPolicy Space Response Oraclesmulti-agent systemsimperfect information games

0 comments

The pith

Generative Best Response uses MCTS and a learned deep generative model to scale opponent modeling to large imperfect-information games and produce human-comparable negotiation agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a training regime that combines Monte Carlo Tree Search with a deep generative model to compute best responses without domain-specific heuristics. This Generative Best Response method is embedded in Policy Space Response Oracles to iteratively build an offline opponent population whose mixture is chosen via Nash bargaining solution concepts. During play the same search procedure maintains an online Bayesian belief over the current co-player and selects actions against it. In bilateral bargaining games the resulting agents reach social welfare and Nash bargaining scores comparable to those achieved when humans negotiate among themselves.

Core claim

Generative Best Response (GenBR) is a best-response algorithm based on Monte-Carlo Tree Search with a learned deep generative model that samples world states during planning; it scales to large imperfect-information domains, integrates into Policy Space Response Oracles to automate offline opponent-model construction via iterative game-theoretic reasoning and bargaining-based population mixtures, and supports online Bayesian co-player prediction that yields policies whose social welfare and Nash bargaining scores with humans match those of human-human play.

What carries the argument

Generative Best Response (GenBR): Monte-Carlo Tree Search whose state sampling is driven by a learned deep generative model instead of domain heuristics, used both inside PSRO for offline population construction and for online best-response play.

If this is right

Search with generative modeling produces stronger policies both during population training and at test time.
The same procedure enables online Bayesian updating of beliefs over the current co-player.
Agents reach social welfare and Nash bargaining scores with humans that are statistically comparable to human-human negotiation scores.
Bargaining-theory solution concepts can replace heuristic mixture rules when selecting which opponent policies to retain in the population.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend to other imperfect-information settings where an accurate generative model of states can be learned from self-play data.
Replacing the bargaining mixture with other solution concepts could trade off individual payoff against collective welfare in different proportions.
Because GenBR is described as plug-and-play, it could be substituted into other multi-agent algorithms that currently rely on exact best-response oracles.

Load-bearing premise

The deep generative model that supplies world states to the tree search must remain accurate when the game grows large or when the distribution of play shifts.

What would settle it

A controlled test in which the generative model is trained on one set of bargaining instances and then used for planning on a held-out set of larger or differently distributed instances; if the resulting policies lose their reported advantage over baselines, the claim fails.

Figures

Figures reproduced from arXiv: 2302.00797 by Daniel Hennes, Ian Gemp, Kate Larson, Kevin R. McKee, Luke Marris, Marc Lanctot, Michael P. Wellman, Paul Muller, Yoram Bachrach, Zun Li.

**Figure 1.** Figure 1: Example negotiation game in extensive-form. In “Deal or No Deal”, the game starts at the empty history (∅), chance samples a public pool of resources and private preferences for each player, then players alternate proposals for how to split the resources. 𝑢𝑖(𝜋 0 𝑖 , 𝜇) − 𝑢𝑖(𝜇) ≤ 0 for all 𝑖 ∈ N, 𝜋0 𝑖 ∈ 𝛱𝑖 . Similarly, define 𝑢𝑖(𝜋 0 𝑖 , 𝜇|𝜋 00 𝑖 ) to be the expected utility of deviating to 𝜋 0 𝑖 given that … view at source ↗

**Figure 3.** Figure 3: Three-Player Colored Trails. been actively studied by the AI community (Grosz et al., 2004; Ficici & Pfeffer, 2008b; Gal et al., 2010b). Colored Trails does not require search since the number of moves is small, so we use classical RL based oracles (DQN and Boltzmann DQN) to isolate the effects of the new metastrategy solvers. Furthermore, it has a property that most benchmark games do not: it is paramete… view at source ↗

**Figure 4.** Figure 4: Best response performance using different generative models, against (left) uniform random opponent, (middle) DQN response to uniform random, (right) self-play DQN opponent. Uniform samples a legal preference vector uniformly at random, bad1 always samples the first legal instance in the database, bad2 always samples the last legal instance in the database, cheat always samples the actual underlying world … view at source ↗

**Figure 5.** Figure 5: Empirical reduction in Pareto Gap on test game configurations, and example evolution toward Pareto front (right). agrams, see Appendix G.2). The best-performing MSS is NBS-joint, beating the next best by a full 3 points. The NBS meta-strategy solvers comprise five of the six best MSSs under this evaluation. An example of the evolution of the expected score over PSRO iterations is also shown, moving toward… view at source ↗

**Figure 6.** Figure 6: Convergence to Nash eq. and social welfare across game types. NASHCONV in 3P Leduc poker using (a) DQN oracles vs. (d) exact oracles. NASHCONV (b) and (e) social welfare in Sheriff. NASHCONV (c) and social welfare (f) in 2P Tiny Bridge. games, three 𝑛-player zero-sum games, two common-payoff games, and four general-sum games): instances of Kuhn and Leduc poker, Liar’s dice, Trade Comm, Tiny Bridge, Battles… view at source ↗

**Figure 7.** Figure 7: Empirical Convergence to Nash Equilibria using Exact vs. DQN Best Response versus in Two-Player Zero-Sum Benchmark Games. G.2. Colored Trails The full Pareto gap graphs as a function of training time is shown in [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical Convergence to Nash Equilibria using Exact vs. DQN Best Response in 𝑛-Player Zero-Sum Benchmark Games. the roles (first-mover or second-mover in DoND) to our agents. Then we decide the selected agents based on this empirical game matrix. Specifically, we want to select: (1) the most competitive agents (2) the most collaborative agents and (3) the fairest agent in our pool in a principled way. Now… view at source ↗

**Figure 9.** Figure 9: Empirical Convergence to Nash Equilibria and Social Welfare in Common Payoff Benchmark Games [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Empirical Convergence to Nash Equilibria and Social Welfare in General-Sum Benchmark Games [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Average Pareto gap using DQN best response (top: (a) and (b)) and Boltzmann DQN (bottom: (c) and (d)), training gap (left: (a) and (c)) and gap on held-out test boards (right: (b) and (d)) [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of the expected outcomes of the PSRO agents using the DQN best response type and social welfare MSS. Each diagram depicts the outcome of the agent for a single configuration of Colored Trails: circles represent the rational outcomes (pure joint strategies where players have non-negative gain). The outer surface of the convex hull represents the Pareto front/envelope. To make a 2D image, the prop… view at source ↗

**Figure 13.** Figure 13: Value of exploits found by ABR with uinform beliefs against search-enhanced PSRO agents. The y-axis here is an approximation of NashConv in the same sense as used in (Timbers et al., 2022) [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant reads general study information. (b) The participant reads tutorial information about the game [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant reads tutorial information about the game. (b) The participant reads tutorial information about the game [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant reads tutorial information about the game. (b) The participant reads tutorial information about the game [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant reads tutorial information about the game. (b) The participant reads tutorial information about the game [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant reads tutorial information about the game. (b) The participant reads information about the comprehension test [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant takes the comprehension test. (b) The participant takes the comprehension test [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

**Figure 20.** Figure 20: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant takes the comprehension test. (b) The participant takes the comprehension test [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Screenshots of the participant interface for the “Deal or No Deal” study. The participant sees their results for the comprehension test. If they answered all four questions correctly, the participant waits to be randomly assigned to a game session [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗

**Figure 22.** Figure 22: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant proposes a deal to the other player. (b) The other player chooses to confirm the deal or proposes a new deal [PITH_FULL_IMAGE:figures/full_fig_p039_22.png] view at source ↗

**Figure 23.** Figure 23: Screenshots of the participant interface for the “Deal or No Deal” study. (a) The participant waits for the next game to start. (b) The participant sees their bonus and waits to begin the post-task questionnaire [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗

read the original abstract

Opponent modeling methods typically involve two crucial steps: building a belief distribution over opponents' strategies, and exploiting this opponent model by playing a best response. However, existing approaches typically require domain-specific heurstics to come up with such a model, and algorithms for approximating best responses are hard to scale in large, imperfect information domains. In this work, we introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning. We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS) with a learned deep generative model that samples world states during planning. This new method scales to large imperfect information domains and can be plug and play in a variety of multiagent algorithms. We use this new method under the framework of Policy Space Response Oracles (PSRO), to automate the generation of an \emph{offline opponent model} via iterative game-theoretic reasoning and population-based training. We propose using solution concepts based on bargaining theory to build up an opponent mixture, which we find identifying profiles that are near the Pareto frontier. Then GenBR keeps updating an \emph{online opponent model} and reacts against it during gameplay. We conduct behavioral studies where human participants negotiate with our agents in Deal-or-No-Deal, a class of bilateral bargaining games. Search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenBR uses a learned generative model inside MCTS for best-response approximation in PSRO plus bargaining for opponent mixtures, but the accuracy of that model is not shown and is load-bearing for the scaling and human results.

read the letter

The paper's core move is GenBR: Monte Carlo tree search that draws world-state samples from a trained deep generative model instead of relying on domain heuristics. They plug this into PSRO to build an offline population, then use Nash bargaining concepts to select opponent mixtures near the Pareto frontier. During play they keep an online Bayesian update over the co-player and respond with GenBR. The human study in Deal-or-No-Deal is the main empirical piece, claiming the resulting agents reach social welfare and Nash bargaining scores comparable to human-human trades.

Referee Report

2 major / 2 minor

Summary. The paper introduces Generative Best Response (GenBR), an MCTS-based best-response algorithm that employs a learned deep generative model to sample world states during planning in large imperfect-information games. This is embedded in a PSRO framework to iteratively construct an offline opponent population using Nash bargaining solution concepts for mixture selection, with an additional online Bayesian update for co-player prediction at test time. Behavioral experiments in the Deal-or-No-Deal bilateral bargaining domain are reported to show that the resulting agents achieve social welfare and Nash bargaining scores comparable to those obtained in human-human negotiations.

Significance. If the generative model is shown to be sufficiently accurate, the work provides a generic, heuristic-free route to scalable opponent modeling that combines generative modeling, MCTS, and game-theoretic population training. The explicit use of bargaining-theoretic solution concepts to select opponent mixtures and the demonstration of online Bayesian adaptation are distinctive contributions that could influence multi-agent RL in imperfect-information settings and AI negotiation systems.

major comments (2)

[GenBR algorithm description] The description of GenBR (abstract and the section presenting the algorithm) supplies no training objective, network architecture, or quantitative metrics (reconstruction error, state validity rate, or ablation on sampling quality) for the deep generative model. This is load-bearing: the central claim that “search with generative modeling finds stronger policies” and enables reliable online Bayesian prediction rests on the assumption that sampled states are sufficiently accurate; without these diagnostics the PSRO iterations and test-time results cannot be evaluated.
[Behavioral studies / experimental results] The behavioral studies section reports human negotiation results but provides no information on participant numbers, trial counts, statistical tests, or explicit baselines (e.g., standard PSRO without the generative component or existing opponent-modeling methods). This undermines assessment of the claim that the agents achieve “comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.”

minor comments (2)

[Abstract] Abstract contains two typographical errors: “heurstics” and “Respoonse.”
[PSRO and online update sections] The notation used for the opponent mixture weights and the online Bayesian update is introduced only in prose; explicit equations would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional detail will strengthen the manuscript. We address each major comment below and will incorporate revisions as noted.

read point-by-point responses

Referee: [GenBR algorithm description] The description of GenBR (abstract and the section presenting the algorithm) supplies no training objective, network architecture, or quantitative metrics (reconstruction error, state validity rate, or ablation on sampling quality) for the deep generative model. This is load-bearing: the central claim that “search with generative modeling finds stronger policies” and enables reliable online Bayesian prediction rests on the assumption that sampled states are sufficiently accurate; without these diagnostics the PSRO iterations and test-time results cannot be evaluated.

Authors: We agree that the current manuscript provides insufficient detail on the generative model component of GenBR. In the revised version we will add the training objective (including the specific loss function used), network architecture specifications, and quantitative diagnostics such as reconstruction error, state validity rate, and an ablation study on sampling quality. These additions will directly support the claims regarding policy strength and online prediction reliability. revision: yes
Referee: [Behavioral studies / experimental results] The behavioral studies section reports human negotiation results but provides no information on participant numbers, trial counts, statistical tests, or explicit baselines (e.g., standard PSRO without the generative component or existing opponent-modeling methods). This undermines assessment of the claim that the agents achieve “comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.”

Authors: We agree that the behavioral studies section requires additional methodological and comparative detail. The revised manuscript will report participant numbers, trial counts, the statistical tests performed, and will include explicit baselines (standard PSRO without the generative model and at least one existing opponent-modeling approach) to allow proper evaluation of the human-comparable performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: method extends PSRO/MCTS with independent generative component

full rationale

The paper introduces GenBR as MCTS augmented by a learned generative model for state sampling, then embeds it in PSRO for offline population generation and online Bayesian updates. No equation or procedure reduces a claimed output to a fitted input by construction, nor does any load-bearing premise collapse to a self-citation whose validity is presupposed. PSRO is treated as an external, previously published framework; the generative model is trained on data and evaluated empirically rather than defined to reproduce its own training targets. The human negotiation results are presented as experimental outcomes, not as logical consequences of the method's own definitions. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that the generative model works well and that bargaining theory applies effectively to the opponent selection.

free parameters (1)

parameters of the generative model
The deep generative model is learned from data, introducing fitted parameters.

axioms (1)

domain assumption Nash bargaining solution concepts can be used to identify opponent mixtures near the Pareto frontier in the PSRO framework.
Invoked for building the opponent mixture.

pith-pipeline@v0.9.0 · 5863 in / 1231 out tokens · 24097 ms · 2026-05-24T10:01:58.186106+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper

[1]

Liar” ends the game, then both players reveal their dice. If the last bid is not satisﬁed, then the player who called “Liar

doi: 10.1038/s41586-019-1724-z. URL https: //doi.org/10.1038/s41586-019-1724-z . Wang, T. T., Gleave, A., Belrose, N., Tseng, T., Miller, J., Pelrine, K., Dennis, M. D., Duan, Y ., Pogrebniak, V ., Levine, S., and Russell, S. Adversarial policies beat superhuman Go AIs, 2022. URL https://arxiv. org/abs/2211.00241. Wang, Y ., Shi, Z., Yu, L., Wu, Y ., Sing...

work page doi:10.1038/s41586-019-1724-z 2022
[2]

Each player is independently dealt one of num items with uniform chance

work page
[3]

Player 1 makes one of num utterances utterances, which is observed by player 2

work page
[4]

Player 2 makes one of num utterances utterances, which is observed by player 1

work page
[5]

tournaments

Both players privately request one of the num items num items possible trades. The trade is successful if and only if both player 1 asks to trade its item for player 2’s item and player 2 asks to trade its item for player 1’s item. Both players receive a reward of 1 if the trade is successful and 0 otherwise. We use num items = num utterances = 10. E.1.5....

work page 2019
[6]

Read study instructions and gameplay tutorial (Figures 14–18)

work page
[7]

Take comprehension test (Figures 19 & 20)

work page
[8]

Wait for random assignment to a tournament with ﬁve other participants (HvH) or wait for agents to load for a tournament (HvA; Figure 21)

work page
[9]

Play episode of Deal or No Deal game (Figure 22)

work page
[10]

See score conﬁrmation for last episode and wait for next episode (Figure 23a)

work page
[11]

Repeat steps 4 and 5 for four additional episodes

work page
[12]

timed out

Note total earnings and transition to post-game questionnaire (Figure 23b). We required participants to answer all four questions in the comprehension test correctly to continue to the rest of the study. The majority of participants (71.4%) passed the test and were randomly sorted into tournaments in groups of 𝑛 = 6 (for the HvH condition) or 𝑛 = 1 (for t...

work page 2011

[1] [1]

Liar” ends the game, then both players reveal their dice. If the last bid is not satisﬁed, then the player who called “Liar

doi: 10.1038/s41586-019-1724-z. URL https: //doi.org/10.1038/s41586-019-1724-z . Wang, T. T., Gleave, A., Belrose, N., Tseng, T., Miller, J., Pelrine, K., Dennis, M. D., Duan, Y ., Pogrebniak, V ., Levine, S., and Russell, S. Adversarial policies beat superhuman Go AIs, 2022. URL https://arxiv. org/abs/2211.00241. Wang, Y ., Shi, Z., Yu, L., Wu, Y ., Sing...

work page doi:10.1038/s41586-019-1724-z 2022

[2] [2]

Each player is independently dealt one of num items with uniform chance

work page

[3] [3]

Player 1 makes one of num utterances utterances, which is observed by player 2

work page

[4] [4]

Player 2 makes one of num utterances utterances, which is observed by player 1

work page

[5] [5]

tournaments

Both players privately request one of the num items num items possible trades. The trade is successful if and only if both player 1 asks to trade its item for player 2’s item and player 2 asks to trade its item for player 1’s item. Both players receive a reward of 1 if the trade is successful and 0 otherwise. We use num items = num utterances = 10. E.1.5....

work page 2019

[6] [6]

Read study instructions and gameplay tutorial (Figures 14–18)

work page

[7] [7]

Take comprehension test (Figures 19 & 20)

work page

[8] [8]

Wait for random assignment to a tournament with ﬁve other participants (HvH) or wait for agents to load for a tournament (HvA; Figure 21)

work page

[9] [9]

Play episode of Deal or No Deal game (Figure 22)

work page

[10] [10]

See score conﬁrmation for last episode and wait for next episode (Figure 23a)

work page

[11] [11]

Repeat steps 4 and 5 for four additional episodes

work page

[12] [12]

timed out

Note total earnings and transition to post-game questionnaire (Figure 23b). We required participants to answer all four questions in the comprehension test correctly to continue to the rest of the study. The majority of participants (71.4%) passed the test and were randomly sorted into tournaments in groups of 𝑛 = 6 (for the HvH condition) or 𝑛 = 1 (for t...

work page 2011