Low-Variance and Zero-Variance Baselines for Extensive-Form Games

Martin Schmid; Michael Bowling; Trevor Davis

arxiv: 1907.09633 · v1 · pith:CLBIZD4Cnew · submitted 2019-07-22 · 💻 cs.GT · cs.AI

Low-Variance and Zero-Variance Baselines for Extensive-Form Games

Trevor Davis , Martin Schmid , Michael Bowling This is my paper

Pith reviewed 2026-05-24 17:17 UTC · model grok-4.3

classification 💻 cs.GT cs.AI

keywords extensive-form gamesvariance reductionbaselinesMonte Carlo CFRzero-variance estimatessampling schemescounterfactual regret minimization

0 comments

The pith

Predictive baselines enable zero-variance value estimates along sampled trajectories in extensive-form games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a general framework for baseline-corrected values that reduces the variance of Monte Carlo sampling in extensive-form games. Within this framework the authors introduce new baseline functions and prove that one choice, the predictive baseline, is optimal under particular sampling schemes. This optimality makes it possible to obtain exact value estimates from partial trajectories without enumerating the full game tree. A reader would care because current sampling algorithms for large imperfect-information games spend many iterations fighting variance; lowering that variance to zero changes how quickly those algorithms can converge.

Core claim

By defining a framework of baseline-corrected values in extensive-form games that generalizes earlier variance-reduction techniques, the authors show that new baseline functions produce substantially lower variance than prior methods, and that the predictive baseline is provably optimal for certain sampling schemes, allowing efficient computation of zero-variance value estimates even along sampled trajectories.

What carries the argument

The framework of baseline-corrected values, in which a chosen baseline function is subtracted from sampled payoffs to produce corrected estimates whose expectation remains unbiased.

If this is right

Monte Carlo Counterfactual Regret Minimization and similar sampling methods require fewer iterations to reach a given accuracy.
Value estimates for nodes reached by sampling can be treated as exact rather than noisy.
The same baseline framework can be applied to any sampling-based algorithm that estimates values along trajectories in extensive-form games.
Full tree traversal becomes unnecessary for obtaining precise values when the predictive baseline is used with compatible sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same corrective idea might be adapted to other imperfect-information settings such as partially observable Markov decision processes.
If the sampling distribution itself can be learned, the predictive baseline could be recomputed on the fly to maintain zero variance.
Combining the baseline with learned function approximation might further reduce the number of samples needed in very large games.

Load-bearing premise

The optimality proofs and zero-variance guarantee hold only for the specific sampling schemes and the particular way baselines are integrated into value estimation.

What would settle it

Implement the predictive baseline inside Monte Carlo CFR on a small, fully enumerated extensive-form game, draw many trajectories under the stated sampling scheme, and check whether the empirical variance of the corrected value estimates is numerically zero.

Figures

Figures reproduced from arXiv: 1907.09633 by Martin Schmid, Michael Bowling, Trevor Davis.

**Figure 2.** Figure 2: Log-log plots of POS MCCFR strategies with various baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Exploitability of continual resolving strategies based on the maximum number of evaluations al [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

read the original abstract

Extensive-form games (EFGs) are a common model of multi-agent interactions with imperfect information. State-of-the-art algorithms for solving these games typically perform full walks of the game tree that can prove prohibitively slow in large games. Alternatively, sampling-based methods such as Monte Carlo Counterfactual Regret Minimization walk one or more trajectories through the tree, touching only a fraction of the nodes on each iteration, at the expense of requiring more iterations to converge due to the variance of sampled values. In this paper, we extend recent work that uses baseline estimates to reduce this variance. We introduce a framework of baseline-corrected values in EFGs that generalizes the previous work. Within our framework, we propose new baseline functions that result in significantly reduced variance compared to existing techniques. We show that one particular choice of such a function --- predictive baseline --- is provably optimal under certain sampling schemes. This allows for efficient computation of zero-variance value estimates even along sampled trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Generalizes prior baseline work into a framework for EFG sampling with a new predictive baseline proven optimal for zero variance under specific schemes, though the schemes may limit applicability.

read the letter

The paper's main advance is a framework for baseline-corrected values that extends earlier variance-reduction techniques in Monte Carlo sampling for extensive-form games, plus new baseline functions and a proof that the predictive baseline is optimal under certain sampling schemes. This lets them claim zero-variance estimates along sampled trajectories without full tree traversal. The generalization and the optimality result are the concrete new pieces; they give a cleaner way to think about subtracting baselines in EFGs than the prior ad-hoc approaches. The theoretical backing for zero variance is the part that stands out if the conditions hold. The soft spot is exactly the one the stress-test flags: the optimality proof appears to rest on sampling distributions that factor in a way allowing the baseline to equal the conditional expectation without already knowing the values. Standard outcome or external sampling in MCCFR does not obviously satisfy that without extra assumptions on the information sets or reach probabilities, and the abstract does not spell out whether those assumptions are needed or how restrictive they are. If the full theorems only cover contrived schemes, the practical payoff shrinks. This is aimed at people building sampling-based solvers for large imperfect-information games. It has enough formal content and a clear target problem to merit referee time, even if the experiments and assumption checks need tightening.

Referee Report

1 major / 1 minor

Summary. The paper extends baseline techniques for variance reduction in sampling-based solvers for extensive-form games (EFGs). It introduces a general framework for baseline-corrected values, proposes new baseline functions that reduce variance relative to prior methods, and identifies a predictive baseline that is provably optimal (zero variance) under certain sampling schemes, enabling efficient zero-variance estimates along sampled trajectories.

Significance. If the optimality result holds under realizable sampling distributions, the work would meaningfully advance Monte Carlo CFR variants by supplying both practical variance reduction and a theoretical guarantee of zero variance without full tree traversal. The generalization of earlier baseline work and the focus on EFG-specific information sets are positive features.

major comments (1)

[Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.

minor comments (1)

[Abstract] The abstract and introduction should explicitly list the sampling schemes (e.g., outcome sampling, external sampling) under which the predictive baseline is optimal, rather than referring only to “certain sampling schemes.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and the opportunity to clarify the scope of our optimality result. The concern about the conditional expectation and factorization is addressed below; we maintain that the result holds under the sampling schemes explicitly analyzed in the paper.

read point-by-point responses

Referee: [Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.

Authors: The predictive baseline is defined exactly as the conditional expectation E[V | I] at each information set I under the chosen sampling distribution. Theorem 1 proves zero variance precisely when this baseline can be realized from prior samples; the proof relies on the specific structure of the sampling schemes considered (outcome sampling and external sampling with the baseline computed via the product of reach probabilities along the sampled path). In these schemes the reach probabilities factor such that the conditional expectation is estimable without subtree traversal or knowledge of the full opposing strategy, because the sampling distribution is independent of the value function in the manner required by the martingale property used in the proof. We do not claim the result for arbitrary sampling distributions or for information-set structures that would induce non-factorizable dependencies; the paper restricts the claim to the schemes where the factorization is shown to hold. If the referee can exhibit a concrete counterexample within outcome or external sampling where the factorization fails, we would be happy to address it, but our analysis covers the standard MCCFR variants referenced in the manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivations and optimality proofs are independent of inputs

full rationale

The paper introduces a new framework for baseline-corrected values in EFGs that generalizes prior work, then derives new baseline functions including the predictive baseline. The claim of provable optimality under certain sampling schemes is presented as following from the framework's definitions and proofs (e.g., the zero-variance result along sampled trajectories). No quoted equations or steps reduce the result to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain. The sampling-scheme assumptions are stated explicitly as conditions for the result rather than smuggled in. This is the common case of a self-contained technical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, no free parameters, new axioms, or invented entities are explicitly introduced; the contribution is in the framework and baseline functions.

axioms (1)

domain assumption Standard mathematical properties of extensive-form games and counterfactual regret minimization hold.
The work builds directly on EFG models and sampling methods without introducing new axioms.

pith-pipeline@v0.9.0 · 5696 in / 1124 out tokens · 25753 ms · 2026-05-24T17:17:41.192559+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Natural actor–critic algorithms

Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009

work page 2009
[2]

Heads-up limit hold’em poker is solved

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, January 2015

work page 2015
[3]

Safe and nested subgame solving for imperfect-information games

Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30, 2017

work page 2017
[4]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018

work page 2018
[5]

Depth-limited solving for imperfect-information games

Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems 31, 2018

work page 2018
[6]

Solving imperfect information games using decomposition

Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, 2014

work page 2014
[7]

AIV AT: A new variance reduction technique for agent evaluation in imperfect information games

Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morrill, and Michael Bowling. AIV AT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[8]

CFR+, 2014

Computer Poker Research Group, University of Alberta and Oskari Tammelin. CFR+, 2014. URL http://webdocs.cs.ualberta.ca/~games/poker/cfr_plus.html. [Online; accessed 23-May-2019]

work page 2014
[9]

Efﬁcient Monte Carlo counterfactual regret minimization in games with many player actions

Richard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron. Efﬁcient Monte Carlo counterfactual regret minimization in games with many player actions. In Proceedings of the Twenty-Sixth Conference on Advances in Neural Information Processing Systems, 2012

work page 2012
[10]

Generalized sampling and variance in counterfactual regret minimization

Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, 2012

work page 2012
[11]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004

work page 2004
[12]

Jakobsen, Troels B

Sune K. Jakobsen, Troels B. Sørensen, and Vincent Conitzer. Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016

work page 2016
[13]

Doubly robust off-policy value evaluation for reinforcement learning

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016

work page 2016
[14]

Measuring the size of large no-limit poker games

Michael Johanson. Measuring the size of large no-limit poker games. Technical Report TR13-01, Department of Computing Science, University of Alberta, 2013

work page 2013
[15]

Accelerating best response calculation in large extensive games

Michael Johanson, Kevin Waugh, Michael Bowling, , and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011

work page 2011
[16]

Faster ﬁrst-order methods for extensive-form game solving

Christian Kroer, Kevin Waugh, Fatma Kilinç-Karzan, and Tuomas Sandholm. Faster ﬁrst-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 2015

work page 2015
[17]

Monte Carlo sampling for regret minimization in extensive games

Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22, 2009

work page 2009
[18]

Action-dependent control variates for policy optimization via Stein identity

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via Stein identity. InInternational Conference on Learning Representations, 2018

work page 2018
[19]

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, 356 6337:508–513, 2017

work page 2017
[20]

Osborne and Ariel Rubinstein

Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory . The MIT Press, 1994. ISBN 0-262-65040-1. 9

work page 1994
[21]

Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines

Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, and Michael Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, 2019

work page 2019
[22]

High-dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016

work page 2016
[23]

Bayes’ bluff: Opponent modelling in poker

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artﬁcial Intelligence, 2005

work page 2005
[24]

Actor-critic policy optimization in partially observable multiagent environments

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[25]

Solving heads-up limit Texas hold’em

Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence, 2015

work page 2015
[26]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016

work page 2016
[27]

The mirage of action-dependent baselines in reinforcement learning

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th Interna- tional Conference on Machine Learning, 2018

work page 2018
[28]

Monte Carlo continual resolving for online strategy computation in imperfect information games

Michal Šustr, V ojtˇech Kovaˇrík, and Viliam Lisý. Monte Carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019

work page 2019
[29]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

work page 1992
[30]

Bayen, Sham M

Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018

work page 2018
[31]

Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information

Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, and Jun Zhu. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information. CoRR, abs/1810.04433, 2018. URL http: //arxiv.org/abs/1810.04433

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Regret minimization in games with incomplete information

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2008. 10 Appendices A MCCFR with baseline-corrected values Pseudocode for MCCFR with baseline-corrected values is given in Algorithm 1. Quantities of the form σt(h,·) ...

work page 2008

[1] [1]

Natural actor–critic algorithms

Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009

work page 2009

[2] [2]

Heads-up limit hold’em poker is solved

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, January 2015

work page 2015

[3] [3]

Safe and nested subgame solving for imperfect-information games

Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30, 2017

work page 2017

[4] [4]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018

work page 2018

[5] [5]

Depth-limited solving for imperfect-information games

Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems 31, 2018

work page 2018

[6] [6]

Solving imperfect information games using decomposition

Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, 2014

work page 2014

[7] [7]

AIV AT: A new variance reduction technique for agent evaluation in imperfect information games

Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morrill, and Michael Bowling. AIV AT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[8] [8]

CFR+, 2014

Computer Poker Research Group, University of Alberta and Oskari Tammelin. CFR+, 2014. URL http://webdocs.cs.ualberta.ca/~games/poker/cfr_plus.html. [Online; accessed 23-May-2019]

work page 2014

[9] [9]

Efﬁcient Monte Carlo counterfactual regret minimization in games with many player actions

Richard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron. Efﬁcient Monte Carlo counterfactual regret minimization in games with many player actions. In Proceedings of the Twenty-Sixth Conference on Advances in Neural Information Processing Systems, 2012

work page 2012

[10] [10]

Generalized sampling and variance in counterfactual regret minimization

Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, 2012

work page 2012

[11] [11]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004

work page 2004

[12] [12]

Jakobsen, Troels B

Sune K. Jakobsen, Troels B. Sørensen, and Vincent Conitzer. Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016

work page 2016

[13] [13]

Doubly robust off-policy value evaluation for reinforcement learning

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016

work page 2016

[14] [14]

Measuring the size of large no-limit poker games

Michael Johanson. Measuring the size of large no-limit poker games. Technical Report TR13-01, Department of Computing Science, University of Alberta, 2013

work page 2013

[15] [15]

Accelerating best response calculation in large extensive games

Michael Johanson, Kevin Waugh, Michael Bowling, , and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011

work page 2011

[16] [16]

Faster ﬁrst-order methods for extensive-form game solving

Christian Kroer, Kevin Waugh, Fatma Kilinç-Karzan, and Tuomas Sandholm. Faster ﬁrst-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 2015

work page 2015

[17] [17]

Monte Carlo sampling for regret minimization in extensive games

Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22, 2009

work page 2009

[18] [18]

Action-dependent control variates for policy optimization via Stein identity

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via Stein identity. InInternational Conference on Learning Representations, 2018

work page 2018

[19] [19]

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, 356 6337:508–513, 2017

work page 2017

[20] [20]

Osborne and Ariel Rubinstein

Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory . The MIT Press, 1994. ISBN 0-262-65040-1. 9

work page 1994

[21] [21]

Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines

Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, and Michael Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, 2019

work page 2019

[22] [22]

High-dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016

work page 2016

[23] [23]

Bayes’ bluff: Opponent modelling in poker

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artﬁcial Intelligence, 2005

work page 2005

[24] [24]

Actor-critic policy optimization in partially observable multiagent environments

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018

[25] [25]

Solving heads-up limit Texas hold’em

Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence, 2015

work page 2015

[26] [26]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016

work page 2016

[27] [27]

The mirage of action-dependent baselines in reinforcement learning

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th Interna- tional Conference on Machine Learning, 2018

work page 2018

[28] [28]

Monte Carlo continual resolving for online strategy computation in imperfect information games

Michal Šustr, V ojtˇech Kovaˇrík, and Viliam Lisý. Monte Carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019

work page 2019

[29] [29]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

work page 1992

[30] [30]

Bayen, Sham M

Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018

work page 2018

[31] [31]

Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information

Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, and Jun Zhu. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information. CoRR, abs/1810.04433, 2018. URL http: //arxiv.org/abs/1810.04433

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Regret minimization in games with incomplete information

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2008. 10 Appendices A MCCFR with baseline-corrected values Pseudocode for MCCFR with baseline-corrected values is given in Algorithm 1. Quantities of the form σt(h,·) ...

work page 2008