pith. sign in

arxiv: 1907.09633 · v1 · pith:CLBIZD4Cnew · submitted 2019-07-22 · 💻 cs.GT · cs.AI

Low-Variance and Zero-Variance Baselines for Extensive-Form Games

Pith reviewed 2026-05-24 17:17 UTC · model grok-4.3

classification 💻 cs.GT cs.AI
keywords extensive-form gamesvariance reductionbaselinesMonte Carlo CFRzero-variance estimatessampling schemescounterfactual regret minimization
0
0 comments X

The pith

Predictive baselines enable zero-variance value estimates along sampled trajectories in extensive-form games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a general framework for baseline-corrected values that reduces the variance of Monte Carlo sampling in extensive-form games. Within this framework the authors introduce new baseline functions and prove that one choice, the predictive baseline, is optimal under particular sampling schemes. This optimality makes it possible to obtain exact value estimates from partial trajectories without enumerating the full game tree. A reader would care because current sampling algorithms for large imperfect-information games spend many iterations fighting variance; lowering that variance to zero changes how quickly those algorithms can converge.

Core claim

By defining a framework of baseline-corrected values in extensive-form games that generalizes earlier variance-reduction techniques, the authors show that new baseline functions produce substantially lower variance than prior methods, and that the predictive baseline is provably optimal for certain sampling schemes, allowing efficient computation of zero-variance value estimates even along sampled trajectories.

What carries the argument

The framework of baseline-corrected values, in which a chosen baseline function is subtracted from sampled payoffs to produce corrected estimates whose expectation remains unbiased.

If this is right

  • Monte Carlo Counterfactual Regret Minimization and similar sampling methods require fewer iterations to reach a given accuracy.
  • Value estimates for nodes reached by sampling can be treated as exact rather than noisy.
  • The same baseline framework can be applied to any sampling-based algorithm that estimates values along trajectories in extensive-form games.
  • Full tree traversal becomes unnecessary for obtaining precise values when the predictive baseline is used with compatible sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same corrective idea might be adapted to other imperfect-information settings such as partially observable Markov decision processes.
  • If the sampling distribution itself can be learned, the predictive baseline could be recomputed on the fly to maintain zero variance.
  • Combining the baseline with learned function approximation might further reduce the number of samples needed in very large games.

Load-bearing premise

The optimality proofs and zero-variance guarantee hold only for the specific sampling schemes and the particular way baselines are integrated into value estimation.

What would settle it

Implement the predictive baseline inside Monte Carlo CFR on a small, fully enumerated extensive-form game, draw many trajectories under the stated sampling scheme, and check whether the empirical variance of the corrected value estimates is numerically zero.

Figures

Figures reproduced from arXiv: 1907.09633 by Martin Schmid, Michael Bowling, Trevor Davis.

Figure 1
Figure 1. Figure 1: Log-log plots of convergence of MCCFR strategies with various baselines. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Log-log plots of POS MCCFR strategies with various baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Exploitability of continual resolving strategies based on the maximum number of evaluations al [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
read the original abstract

Extensive-form games (EFGs) are a common model of multi-agent interactions with imperfect information. State-of-the-art algorithms for solving these games typically perform full walks of the game tree that can prove prohibitively slow in large games. Alternatively, sampling-based methods such as Monte Carlo Counterfactual Regret Minimization walk one or more trajectories through the tree, touching only a fraction of the nodes on each iteration, at the expense of requiring more iterations to converge due to the variance of sampled values. In this paper, we extend recent work that uses baseline estimates to reduce this variance. We introduce a framework of baseline-corrected values in EFGs that generalizes the previous work. Within our framework, we propose new baseline functions that result in significantly reduced variance compared to existing techniques. We show that one particular choice of such a function --- predictive baseline --- is provably optimal under certain sampling schemes. This allows for efficient computation of zero-variance value estimates even along sampled trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper extends baseline techniques for variance reduction in sampling-based solvers for extensive-form games (EFGs). It introduces a general framework for baseline-corrected values, proposes new baseline functions that reduce variance relative to prior methods, and identifies a predictive baseline that is provably optimal (zero variance) under certain sampling schemes, enabling efficient zero-variance estimates along sampled trajectories.

Significance. If the optimality result holds under realizable sampling distributions, the work would meaningfully advance Monte Carlo CFR variants by supplying both practical variance reduction and a theoretical guarantee of zero variance without full tree traversal. The generalization of earlier baseline work and the focus on EFG-specific information sets are positive features.

major comments (1)
  1. [Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.
minor comments (1)
  1. [Abstract] The abstract and introduction should explicitly list the sampling schemes (e.g., outcome sampling, external sampling) under which the predictive baseline is optimal, rather than referring only to “certain sampling schemes.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and the opportunity to clarify the scope of our optimality result. The concern about the conditional expectation and factorization is addressed below; we maintain that the result holds under the sampling schemes explicitly analyzed in the paper.

read point-by-point responses
  1. Referee: [Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.

    Authors: The predictive baseline is defined exactly as the conditional expectation E[V | I] at each information set I under the chosen sampling distribution. Theorem 1 proves zero variance precisely when this baseline can be realized from prior samples; the proof relies on the specific structure of the sampling schemes considered (outcome sampling and external sampling with the baseline computed via the product of reach probabilities along the sampled path). In these schemes the reach probabilities factor such that the conditional expectation is estimable without subtree traversal or knowledge of the full opposing strategy, because the sampling distribution is independent of the value function in the manner required by the martingale property used in the proof. We do not claim the result for arbitrary sampling distributions or for information-set structures that would induce non-factorizable dependencies; the paper restricts the claim to the schemes where the factorization is shown to hold. If the referee can exhibit a concrete counterexample within outcome or external sampling where the factorization fails, we would be happy to address it, but our analysis covers the standard MCCFR variants referenced in the manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivations and optimality proofs are independent of inputs

full rationale

The paper introduces a new framework for baseline-corrected values in EFGs that generalizes prior work, then derives new baseline functions including the predictive baseline. The claim of provable optimality under certain sampling schemes is presented as following from the framework's definitions and proofs (e.g., the zero-variance result along sampled trajectories). No quoted equations or steps reduce the result to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain. The sampling-scheme assumptions are stated explicitly as conditions for the result rather than smuggled in. This is the common case of a self-contained technical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, no free parameters, new axioms, or invented entities are explicitly introduced; the contribution is in the framework and baseline functions.

axioms (1)
  • domain assumption Standard mathematical properties of extensive-form games and counterfactual regret minimization hold.
    The work builds directly on EFG models and sampling methods without introducing new axioms.

pith-pipeline@v0.9.0 · 5696 in / 1124 out tokens · 25753 ms · 2026-05-24T17:17:41.192559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Natural actor–critic algorithms

    Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009

  2. [2]

    Heads-up limit hold’em poker is solved

    Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, January 2015

  3. [3]

    Safe and nested subgame solving for imperfect-information games

    Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30, 2017

  4. [4]

    Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

    Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018

  5. [5]

    Depth-limited solving for imperfect-information games

    Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems 31, 2018

  6. [6]

    Solving imperfect information games using decomposition

    Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014

  7. [7]

    AIV AT: A new variance reduction technique for agent evaluation in imperfect information games

    Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morrill, and Michael Bowling. AIV AT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  8. [8]

    CFR+, 2014

    Computer Poker Research Group, University of Alberta and Oskari Tammelin. CFR+, 2014. URL http://webdocs.cs.ualberta.ca/~games/poker/cfr_plus.html. [Online; accessed 23-May-2019]

  9. [9]

    Efficient Monte Carlo counterfactual regret minimization in games with many player actions

    Richard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron. Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In Proceedings of the Twenty-Sixth Conference on Advances in Neural Information Processing Systems, 2012

  10. [10]

    Generalized sampling and variance in counterfactual regret minimization

    Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012

  11. [11]

    Variance reduction techniques for gradient estimates in reinforcement learning

    Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004

  12. [12]

    Jakobsen, Troels B

    Sune K. Jakobsen, Troels B. Sørensen, and Vincent Conitzer. Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016

  13. [13]

    Doubly robust off-policy value evaluation for reinforcement learning

    Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016

  14. [14]

    Measuring the size of large no-limit poker games

    Michael Johanson. Measuring the size of large no-limit poker games. Technical Report TR13-01, Department of Computing Science, University of Alberta, 2013

  15. [15]

    Accelerating best response calculation in large extensive games

    Michael Johanson, Kevin Waugh, Michael Bowling, , and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011

  16. [16]

    Faster first-order methods for extensive-form game solving

    Christian Kroer, Kevin Waugh, Fatma Kilinç-Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 2015

  17. [17]

    Monte Carlo sampling for regret minimization in extensive games

    Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22, 2009

  18. [18]

    Action-dependent control variates for policy optimization via Stein identity

    Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via Stein identity. InInternational Conference on Learning Representations, 2018

  19. [19]

    Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356 6337:508–513, 2017

  20. [20]

    Osborne and Ariel Rubinstein

    Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory . The MIT Press, 1994. ISBN 0-262-65040-1. 9

  21. [21]

    Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines

    Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, and Michael Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, 2019

  22. [22]

    High-dimensional continuous control using generalized advantage estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016

  23. [23]

    Bayes’ bluff: Opponent modelling in poker

    Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artficial Intelligence, 2005

  24. [24]

    Actor-critic policy optimization in partially observable multiagent environments

    Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems (NeurIPS), 2018

  25. [25]

    Solving heads-up limit Texas hold’em

    Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015

  26. [26]

    Thomas and Emma Brunskill

    Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016

  27. [27]

    The mirage of action-dependent baselines in reinforcement learning

    George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th Interna- tional Conference on Machine Learning, 2018

  28. [28]

    Monte Carlo continual resolving for online strategy computation in imperfect information games

    Michal Šustr, V ojtˇech Kovaˇrík, and Viliam Lisý. Monte Carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019

  29. [29]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

  30. [30]

    Bayen, Sham M

    Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018

  31. [31]

    Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information

    Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, and Jun Zhu. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information. CoRR, abs/1810.04433, 2018. URL http: //arxiv.org/abs/1810.04433

  32. [32]

    Regret minimization in games with incomplete information

    Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2008. 10 Appendices A MCCFR with baseline-corrected values Pseudocode for MCCFR with baseline-corrected values is given in Algorithm 1. Quantities of the form σt(h,·) ...