Low-Variance and Zero-Variance Baselines for Extensive-Form Games
Pith reviewed 2026-05-24 17:17 UTC · model grok-4.3
The pith
Predictive baselines enable zero-variance value estimates along sampled trajectories in extensive-form games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining a framework of baseline-corrected values in extensive-form games that generalizes earlier variance-reduction techniques, the authors show that new baseline functions produce substantially lower variance than prior methods, and that the predictive baseline is provably optimal for certain sampling schemes, allowing efficient computation of zero-variance value estimates even along sampled trajectories.
What carries the argument
The framework of baseline-corrected values, in which a chosen baseline function is subtracted from sampled payoffs to produce corrected estimates whose expectation remains unbiased.
If this is right
- Monte Carlo Counterfactual Regret Minimization and similar sampling methods require fewer iterations to reach a given accuracy.
- Value estimates for nodes reached by sampling can be treated as exact rather than noisy.
- The same baseline framework can be applied to any sampling-based algorithm that estimates values along trajectories in extensive-form games.
- Full tree traversal becomes unnecessary for obtaining precise values when the predictive baseline is used with compatible sampling.
Where Pith is reading between the lines
- The same corrective idea might be adapted to other imperfect-information settings such as partially observable Markov decision processes.
- If the sampling distribution itself can be learned, the predictive baseline could be recomputed on the fly to maintain zero variance.
- Combining the baseline with learned function approximation might further reduce the number of samples needed in very large games.
Load-bearing premise
The optimality proofs and zero-variance guarantee hold only for the specific sampling schemes and the particular way baselines are integrated into value estimation.
What would settle it
Implement the predictive baseline inside Monte Carlo CFR on a small, fully enumerated extensive-form game, draw many trajectories under the stated sampling scheme, and check whether the empirical variance of the corrected value estimates is numerically zero.
Figures
read the original abstract
Extensive-form games (EFGs) are a common model of multi-agent interactions with imperfect information. State-of-the-art algorithms for solving these games typically perform full walks of the game tree that can prove prohibitively slow in large games. Alternatively, sampling-based methods such as Monte Carlo Counterfactual Regret Minimization walk one or more trajectories through the tree, touching only a fraction of the nodes on each iteration, at the expense of requiring more iterations to converge due to the variance of sampled values. In this paper, we extend recent work that uses baseline estimates to reduce this variance. We introduce a framework of baseline-corrected values in EFGs that generalizes the previous work. Within our framework, we propose new baseline functions that result in significantly reduced variance compared to existing techniques. We show that one particular choice of such a function --- predictive baseline --- is provably optimal under certain sampling schemes. This allows for efficient computation of zero-variance value estimates even along sampled trajectories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends baseline techniques for variance reduction in sampling-based solvers for extensive-form games (EFGs). It introduces a general framework for baseline-corrected values, proposes new baseline functions that reduce variance relative to prior methods, and identifies a predictive baseline that is provably optimal (zero variance) under certain sampling schemes, enabling efficient zero-variance estimates along sampled trajectories.
Significance. If the optimality result holds under realizable sampling distributions, the work would meaningfully advance Monte Carlo CFR variants by supplying both practical variance reduction and a theoretical guarantee of zero variance without full tree traversal. The generalization of earlier baseline work and the focus on EFG-specific information sets are positive features.
major comments (1)
- [Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.
minor comments (1)
- [Abstract] The abstract and introduction should explicitly list the sampling schemes (e.g., outcome sampling, external sampling) under which the predictive baseline is optimal, rather than referring only to “certain sampling schemes.”
Simulated Author's Rebuttal
We thank the referee for their careful reading and the opportunity to clarify the scope of our optimality result. The concern about the conditional expectation and factorization is addressed below; we maintain that the result holds under the sampling schemes explicitly analyzed in the paper.
read point-by-point responses
-
Referee: [Section 4 and optimality proof] Section 4 and the optimality proof (presumably Theorem 1): the claim that the predictive baseline yields zero-variance estimates along sampled trajectories requires that the baseline equals the conditional expectation of the sampled value given the information at the decision point. For standard outcome-sampling or external-sampling MCCFR, this conditional expectation depends on the unknown strategy profile and reach probabilities; the factorization needed to compute the baseline from prior samples without traversing the subtree is not shown to hold in general EFGs and may fail when information partitions induce cycles or when the sampling distribution does not decouple from the values being estimated.
Authors: The predictive baseline is defined exactly as the conditional expectation E[V | I] at each information set I under the chosen sampling distribution. Theorem 1 proves zero variance precisely when this baseline can be realized from prior samples; the proof relies on the specific structure of the sampling schemes considered (outcome sampling and external sampling with the baseline computed via the product of reach probabilities along the sampled path). In these schemes the reach probabilities factor such that the conditional expectation is estimable without subtree traversal or knowledge of the full opposing strategy, because the sampling distribution is independent of the value function in the manner required by the martingale property used in the proof. We do not claim the result for arbitrary sampling distributions or for information-set structures that would induce non-factorizable dependencies; the paper restricts the claim to the schemes where the factorization is shown to hold. If the referee can exhibit a concrete counterexample within outcome or external sampling where the factorization fails, we would be happy to address it, but our analysis covers the standard MCCFR variants referenced in the manuscript. revision: no
Circularity Check
No significant circularity; derivations and optimality proofs are independent of inputs
full rationale
The paper introduces a new framework for baseline-corrected values in EFGs that generalizes prior work, then derives new baseline functions including the predictive baseline. The claim of provable optimality under certain sampling schemes is presented as following from the framework's definitions and proofs (e.g., the zero-variance result along sampled trajectories). No quoted equations or steps reduce the result to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain. The sampling-scheme assumptions are stated explicitly as conditions for the result rather than smuggled in. This is the common case of a self-contained technical derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard mathematical properties of extensive-form games and counterfactual regret minimization hold.
Reference graph
Works this paper leans on
-
[1]
Natural actor–critic algorithms
Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009
work page 2009
-
[2]
Heads-up limit hold’em poker is solved
Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, January 2015
work page 2015
-
[3]
Safe and nested subgame solving for imperfect-information games
Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30, 2017
work page 2017
-
[4]
Superhuman AI for heads-up no-limit poker: Libratus beats top professionals
Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018
work page 2018
-
[5]
Depth-limited solving for imperfect-information games
Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems 31, 2018
work page 2018
-
[6]
Solving imperfect information games using decomposition
Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014
work page 2014
-
[7]
AIV AT: A new variance reduction technique for agent evaluation in imperfect information games
Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morrill, and Michael Bowling. AIV AT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[8]
Computer Poker Research Group, University of Alberta and Oskari Tammelin. CFR+, 2014. URL http://webdocs.cs.ualberta.ca/~games/poker/cfr_plus.html. [Online; accessed 23-May-2019]
work page 2014
-
[9]
Efficient Monte Carlo counterfactual regret minimization in games with many player actions
Richard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron. Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In Proceedings of the Twenty-Sixth Conference on Advances in Neural Information Processing Systems, 2012
work page 2012
-
[10]
Generalized sampling and variance in counterfactual regret minimization
Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012
work page 2012
-
[11]
Variance reduction techniques for gradient estimates in reinforcement learning
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004
work page 2004
-
[12]
Sune K. Jakobsen, Troels B. Sørensen, and Vincent Conitzer. Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016
work page 2016
-
[13]
Doubly robust off-policy value evaluation for reinforcement learning
Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016
work page 2016
-
[14]
Measuring the size of large no-limit poker games
Michael Johanson. Measuring the size of large no-limit poker games. Technical Report TR13-01, Department of Computing Science, University of Alberta, 2013
work page 2013
-
[15]
Accelerating best response calculation in large extensive games
Michael Johanson, Kevin Waugh, Michael Bowling, , and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011
work page 2011
-
[16]
Faster first-order methods for extensive-form game solving
Christian Kroer, Kevin Waugh, Fatma Kilinç-Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 2015
work page 2015
-
[17]
Monte Carlo sampling for regret minimization in extensive games
Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22, 2009
work page 2009
-
[18]
Action-dependent control variates for policy optimization via Stein identity
Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via Stein identity. InInternational Conference on Learning Representations, 2018
work page 2018
-
[19]
Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356 6337:508–513, 2017
work page 2017
-
[20]
Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory . The MIT Press, 1994. ISBN 0-262-65040-1. 9
work page 1994
-
[21]
Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, and Michael Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, 2019
work page 2019
-
[22]
High-dimensional continuous control using generalized advantage estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016
work page 2016
-
[23]
Bayes’ bluff: Opponent modelling in poker
Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artficial Intelligence, 2005
work page 2005
-
[24]
Actor-critic policy optimization in partially observable multiagent environments
Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[25]
Solving heads-up limit Texas hold’em
Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015
work page 2015
-
[26]
Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016
work page 2016
-
[27]
The mirage of action-dependent baselines in reinforcement learning
George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th Interna- tional Conference on Machine Learning, 2018
work page 2018
-
[28]
Monte Carlo continual resolving for online strategy computation in imperfect information games
Michal Šustr, V ojtˇech Kovaˇrík, and Viliam Lisý. Monte Carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019
work page 2019
-
[29]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992
work page 1992
-
[30]
Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018
work page 2018
-
[31]
Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information
Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, and Jun Zhu. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information. CoRR, abs/1810.04433, 2018. URL http: //arxiv.org/abs/1810.04433
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Regret minimization in games with incomplete information
Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2008. 10 Appendices A MCCFR with baseline-corrected values Pseudocode for MCCFR with baseline-corrected values is given in Algorithm 1. Quantities of the form σt(h,·) ...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.