Entropy-Regularized Stochastic Games

Mohamadreza Ahmadi; Takashi Tanaka; Ufuk Topcu; Yagiz Savas

arxiv: 1907.11543 · v2 · pith:IM7MSN3Knew · submitted 2019-07-26 · 🧮 math.OC

Entropy-Regularized Stochastic Games

Yagiz Savas , Mohamadreza Ahmadi , Takashi Tanaka , Ufuk Topcu This is my paper

Pith reviewed 2026-05-24 15:39 UTC · model grok-4.3

classification 🧮 math.OC

keywords entropy-regularized stochastic gamesvalue existenceMarkovian strategiesstationary strategiesconvex optimizationzero-sum gamesdiscounted stochastic games

0 comments

The pith

Entropy-regularized stochastic games have values attained by Markovian and stationary mixed strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces entropy regularization into two-player zero-sum stochastic games so each player maximizes its expected payoff together with the causal entropy of its own strategy. This models the case where players must act without perfect knowledge of the environment's transition probabilities. The central result establishes that a value exists for both finite N-stage games and infinite discounted games. Markovian mixed strategies are sufficient to attain the value in the N-stage case, while stationary mixed strategies suffice in the discounted case. The optimal strategies and values are recovered by solving convex optimization problems.

Core claim

In entropy-regularized stochastic games, a value exists in both the N-stage and discounted settings. Markovian mixed strategies attain this value for N-stage games, and stationary mixed strategies attain it for discounted games. These quantities are obtained by solving convex optimization problems derived from the regularized payoff.

What carries the argument

The entropy-regularized objective that adds the causal entropy of a player's strategy to its expected payoff, balancing payoff maximization against uncertainty in the transition model.

If this is right

The value and optimal strategies of both game classes can be computed by solving convex programs.
The strategy space reduces to Markovian policies for finite-horizon problems.
The strategy space reduces to stationary policies for discounted problems.
Regularization produces strategies that remain well-defined when transition probabilities differ from the assumed model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization may allow value existence proofs for games with more than two players.
Numerical comparisons could measure how much the regularized strategies improve performance when the true transitions deviate from the model used at design time.
The convex programs may admit efficient first-order solvers that scale to large state spaces.

Load-bearing premise

The entropy term accurately represents each player's belief about the degree of misinformation in the transition model.

What would settle it

An explicit stochastic game instance in which, after adding the entropy term, no pair of strategies achieves a common value or in which only non-Markovian strategies attain the value.

Figures

Figures reproduced from arXiv: 1907.11543 by Mohamadreza Ahmadi, Takashi Tanaka, Ufuk Topcu, Yagiz Savas.

read the original abstract

In two-player zero-sum stochastic games, where two competing players make decisions under uncertainty, a pair of optimal strategies is traditionally described by Nash equilibrium and computed under the assumption that the players have perfect information about the stochastic transition model of the environment. However, implementing such strategies may make the players vulnerable to unforeseen changes in the environment. In this paper, we introduce entropy-regularized stochastic games where each player aims to maximize the causal entropy of its strategy in addition to its expected payoff. The regularization term balances each player's rationality with its belief about the level of misinformation about the transition model. We consider both entropy-regularized $N$-stage and entropy-regularized discounted stochastic games, and establish the existence of a value in both games. Moreover, we prove the sufficiency of Markovian and stationary mixed strategies to attain the value, respectively, in $N$-stage and discounted games. Finally, we present algorithms, which are based on convex optimization problems, to compute the optimal strategies. In a numerical example, we demonstrate the proposed method on a motion planning scenario and illustrate the effect of the regularization term on the expected payoff.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds causal entropy regularization to zero-sum stochastic games and shows existence plus Markov/stationary strategy sufficiency carry over directly from the unregularized case.

read the letter

The core contribution is straightforward: once you add the causal entropy terms to the payoffs, the standard dynamic-programming argument for N-stage games and the contraction-mapping argument for discounted games still go through, so a value exists and the usual strategy classes remain sufficient. They also recast the saddle-point problems as convex programs, which is useful for computation. The motion-planning example illustrates that higher regularization produces smoother, less brittle policies, as one would expect. That part is clean and the math checks out on the terms given in the abstract and stress-test note. The interpretive framing about players hedging against model misinformation is just motivation and does not enter the proofs, so it does not affect soundness. The regularization weight remains a free parameter with no guidance on how to set it from data. The single numerical example is illustrative rather than a stress test of robustness. Readers already working on stochastic games or entropy-regularized control will find the existence and algorithm pieces immediately usable; outsiders will see it as a modest, self-contained extension. The work is coherent on its own terms and the results are reproducible in principle from the stated reformulations, so it is worth sending to referees who can check the full derivations and any edge cases in the convex programs.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces entropy-regularized two-player zero-sum stochastic games in which each player augments its expected payoff with a causal-entropy regularization term on its strategy. It claims to prove existence of a value for both the finite-horizon N-stage case and the infinite-horizon discounted case, to establish sufficiency of Markovian mixed strategies (N-stage) and stationary mixed strategies (discounted), to supply convex-optimization reformulations that yield algorithms for computing the regularized equilibria, and to illustrate the effect of the regularization parameter on a motion-planning example.

Significance. If the existence and strategy-class results hold, the work supplies a mathematically grounded regularization device that may confer robustness to transition uncertainty while preserving tractable computation via convex programs. The explicit convex reformulations and the numerical demonstration constitute concrete strengths; the approach could be of interest to researchers working on robust stochastic control and regularized game-theoretic RL.

minor comments (3)

§2 (or wherever the regularized payoff is first defined): the precise definition of the causal-entropy term for a general stochastic game (including the dependence on the transition kernel) should be stated explicitly before the existence theorems, to make the subsequent dynamic-programming or fixed-point arguments self-contained.
The numerical example (motion-planning scenario) reports qualitative effects of the regularization strength but does not include quantitative robustness metrics (e.g., performance under perturbed transition probabilities); adding such a comparison would strengthen the claimed motivation without altering the theoretical claims.
Notation: the symbol used for the regularization parameter should be introduced once and used consistently; several passages appear to switch between λ and another symbol without explicit redefinition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its potential significance for robust stochastic control and regularized RL, and recommendation of minor revision. We are pleased that the existence results, strategy sufficiency, convex reformulations, and numerical example were viewed as strengths.

Circularity Check

0 steps flagged

Derivation is self-contained; no circular steps identified

full rationale

The paper defines entropy-regularized stochastic games by augmenting standard payoffs with an explicit causal-entropy regularization term. It then invokes standard dynamic-programming recursion (N-stage case) and contraction/fixed-point arguments (discounted case) to establish value existence and the sufficiency of Markovian/stationary strategies. These steps operate directly on the newly defined regularized game and do not reduce any claimed result to a fitted parameter, a self-referential definition, or a load-bearing self-citation. The interpretive motivation about “misinformation” is framing only and is not used in the formal statements or proofs. The convex-optimization algorithms are derived from the same well-posed regularized formulation and serve as computational tools rather than circular predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard existence results from stochastic game theory plus the modeling choice of the entropy term; no free parameters are explicitly fitted in the abstract, but the regularization strength acts as a tunable hyperparameter.

free parameters (1)

regularization strength
The weight on the entropy term is a modeling choice that balances payoff and entropy; its value is not derived but selected for the numerical example.

axioms (2)

domain assumption Existence of value and optimal strategies in standard (non-regularized) stochastic games
The paper extends classical results; the abstract invokes this background without re-deriving it.
domain assumption Convexity of the regularized objective allowing convex optimization reformulation
Stated as the basis for the algorithms.

invented entities (1)

entropy-regularized value no independent evidence
purpose: To incorporate both payoff and strategy entropy in the game value
New quantity defined by the regularization; no independent evidence provided beyond the definition.

pith-pipeline@v0.9.0 · 5727 in / 1406 out tokens · 21234 ms · 2026-05-24T15:39:31.088067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Stochastic games,

L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences , vol. 39, no. 10, pp. 1095–1100, 1953

work page 1953
[2]

Non-cooperative games,

J. Nash, “Non-cooperative games,” Annals of Mathematics , vol. 54, no. 2, pp. 286–295, 1951

work page 1951
[3]

J. K. Goeree, C. A. Holt, and T. R. Palfrey, Quantal Response Equilibrium: A Stochastic Theory of Games . Princeton University Press, 2016

work page 2016
[4]

Information theory and statistical mechanics,

E. T. Jaynes, “Information theory and statistical mechanics,” Physical Review, vol. 106, no. 4, p. 620, 1957

work page 1957
[5]

The principle of maximum causal entropy for estimating interacting processes,

B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “The principle of maximum causal entropy for estimating interacting processes,” IEEE Transactions on Information Theory , vol. 59, no. 4, pp. 1966–1980, 2013

work page 1966
[6]

Modeling interaction via the principle of maximum causal entropy,

——, “Modeling interaction via the principle of maximum causal entropy,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010, pp. 1255–1262

work page 2010
[7]

Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints

Y . Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximiza- tion for Markov decision processes under temporal logic constraints,” arXiv:1807.03223v2 [math.OC], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Balancing two- player stochastic games with soft Q-learning,

J. Grau-Moya, F. Leibfried, and H. Bou-Ammar, “Balancing two- player stochastic games with soft Q-learning,” in Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelli- gence, 2018

work page 2018
[9]

Stochastic games with sensing costs,

M. Ahmadi, S. Bharadwaj, T. Tanaka, and U. Topcu, “Stochastic games with sensing costs,” in 56th Annual Allerton Conference on Communication, Control, and Computing , 2018

work page 2018
[10]

Learning in games via rein- forcement and regularization,

P. Mertikopoulos and W. H. Sandholm, “Learning in games via rein- forcement and regularization,” Mathematics of Operations Research , vol. 41, no. 4, pp. 1297–1324, 2016

work page 2016
[11]

Individual Q-learning in normal form games,

D. S. Leslie and E. J. Collins, “Individual Q-learning in normal form games,” SIAM Journal on Control and Optimization , vol. 44, no. 2, pp. 495–514, 2005

work page 2005
[12]

What game are we playing? Differentiably learning games from incomplete observations,

C. K. Ling, J. Z. Kolter, and F. Fang, “What game are we playing? Differentiably learning games from incomplete observations,” in Pro- ceedings of Conference on Neural Information Processing Systems , 2017

work page 2017
[13]

Fudenberg and D

D. Fudenberg and D. Levine, The Theory of Learning in Games . The MIT Press, 1998

work page 1998
[14]

Quantal response equilibria for normal form games,

R. D. McKelvey and T. R. Palfrey, “Quantal response equilibria for normal form games,” Games and Economic Behavior , vol. 10, no. 1, pp. 6–38, 1995

work page 1995
[15]

Discounted robust stochastic games and an application to queueing control,

E. Kardes ¸, F. Ord´o˜nez, and R. W. Hall, “Discounted robust stochastic games and an application to queueing control,” Operations Research, vol. 59, no. 2, pp. 365–382, 2011

work page 2011
[16]

Robust game theory,

M. Aghassi and D. Bertsimas, “Robust game theory,” Mathematical Programming, vol. 107, no. 1-2, pp. 231–273, 2006

work page 2006
[17]

Reinforcement learning with deep energy-based policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in Proceedings of the Thirty-fourth International Conference on Machine Learning , 2017

work page 2017
[18]

Taming the noise in reinforce- ment learning via soft updates,

R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforce- ment learning via soft updates,” in Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence , 2016, pp. 202– 211

work page 2016
[19]

Efﬁcient computation of optimal actions,

E. Todorov, “Efﬁcient computation of optimal actions,” Proceedings of the National Academy of Sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009

work page 2009
[20]

The asymptotic theory of stochastic games,

T. Bewley and E. Kohlberg, “The asymptotic theory of stochastic games,” Mathematics of Operations Research , vol. 1, no. 3, pp. 197– 208, 1976

work page 1976
[21]

Discounted stochastic games: The ﬁnite case,

S. Sorin, “Discounted stochastic games: The ﬁnite case,” in Stochastic Games and Applications . Springer, 2003, pp. 51–55

work page 2003
[22]

Directed information for channels with feedback,

G. Kramer, “Directed information for channels with feedback,” Ph.D. dissertation, ETH Zurich, 1998

work page 1998
[23]

Maximum causal entropy correlated equilibria for Markov games,

B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Maximum causal entropy correlated equilibria for Markov games,” in International Conference on Autonomous Agents and Multiagent Systems , 2011, pp. 207–214

work page 2011
[24]

Zur theorie der gesellschaftsspiele,

J. v. Neumann, “Zur theorie der gesellschaftsspiele,” Mathematische Annalen, vol. 100, no. 1, pp. 295–320, 1928

work page 1928
[25]

R. T. Rockafellar and R. J.-B. Wets, Variational analysis. Springer Science & Business Media, 2009, vol. 317

work page 2009
[26]

D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientiﬁc, 1996

work page 1996
[27]

Neyman and S

A. Neyman and S. Sorin, Stochastic games and applications. Springer Science & Business Media, 2003, vol. 570

work page 2003
[28]

M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[29]

ECOS: An SOCP solver for embedded systems,

A. Domahidi, E. Chu, and S. Boyd, “ECOS: An SOCP solver for embedded systems,” in European Control Conference (ECC) , 2013, pp. 3071–3076

work page 2013
[30]

CVXPY: A Python-embedded modeling language for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016
[31]

Computing proper equilibria of zero-sum games,

P. B. Miltersen and T. B. Sørensen, “Computing proper equilibria of zero-sum games,” in International Conference on Computers and Games, 2006, pp. 200–211

work page 2006
[32]

T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2006. APPENDIX A Proof of Proposition 1: We show the sufﬁciency of Markovian strategies only for the maximin problem. The proof for the minimax formulation follows the same lines with the arguments provided below. The proof is based on backward induction on the stage number 1...

work page 2006

[1] [1]

Stochastic games,

L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences , vol. 39, no. 10, pp. 1095–1100, 1953

work page 1953

[2] [2]

Non-cooperative games,

J. Nash, “Non-cooperative games,” Annals of Mathematics , vol. 54, no. 2, pp. 286–295, 1951

work page 1951

[3] [3]

J. K. Goeree, C. A. Holt, and T. R. Palfrey, Quantal Response Equilibrium: A Stochastic Theory of Games . Princeton University Press, 2016

work page 2016

[4] [4]

Information theory and statistical mechanics,

E. T. Jaynes, “Information theory and statistical mechanics,” Physical Review, vol. 106, no. 4, p. 620, 1957

work page 1957

[5] [5]

The principle of maximum causal entropy for estimating interacting processes,

B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “The principle of maximum causal entropy for estimating interacting processes,” IEEE Transactions on Information Theory , vol. 59, no. 4, pp. 1966–1980, 2013

work page 1966

[6] [6]

Modeling interaction via the principle of maximum causal entropy,

——, “Modeling interaction via the principle of maximum causal entropy,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010, pp. 1255–1262

work page 2010

[7] [7]

Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints

Y . Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximiza- tion for Markov decision processes under temporal logic constraints,” arXiv:1807.03223v2 [math.OC], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Balancing two- player stochastic games with soft Q-learning,

J. Grau-Moya, F. Leibfried, and H. Bou-Ammar, “Balancing two- player stochastic games with soft Q-learning,” in Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelli- gence, 2018

work page 2018

[9] [9]

Stochastic games with sensing costs,

M. Ahmadi, S. Bharadwaj, T. Tanaka, and U. Topcu, “Stochastic games with sensing costs,” in 56th Annual Allerton Conference on Communication, Control, and Computing , 2018

work page 2018

[10] [10]

Learning in games via rein- forcement and regularization,

P. Mertikopoulos and W. H. Sandholm, “Learning in games via rein- forcement and regularization,” Mathematics of Operations Research , vol. 41, no. 4, pp. 1297–1324, 2016

work page 2016

[11] [11]

Individual Q-learning in normal form games,

D. S. Leslie and E. J. Collins, “Individual Q-learning in normal form games,” SIAM Journal on Control and Optimization , vol. 44, no. 2, pp. 495–514, 2005

work page 2005

[12] [12]

What game are we playing? Differentiably learning games from incomplete observations,

C. K. Ling, J. Z. Kolter, and F. Fang, “What game are we playing? Differentiably learning games from incomplete observations,” in Pro- ceedings of Conference on Neural Information Processing Systems , 2017

work page 2017

[13] [13]

Fudenberg and D

D. Fudenberg and D. Levine, The Theory of Learning in Games . The MIT Press, 1998

work page 1998

[14] [14]

Quantal response equilibria for normal form games,

R. D. McKelvey and T. R. Palfrey, “Quantal response equilibria for normal form games,” Games and Economic Behavior , vol. 10, no. 1, pp. 6–38, 1995

work page 1995

[15] [15]

Discounted robust stochastic games and an application to queueing control,

E. Kardes ¸, F. Ord´o˜nez, and R. W. Hall, “Discounted robust stochastic games and an application to queueing control,” Operations Research, vol. 59, no. 2, pp. 365–382, 2011

work page 2011

[16] [16]

Robust game theory,

M. Aghassi and D. Bertsimas, “Robust game theory,” Mathematical Programming, vol. 107, no. 1-2, pp. 231–273, 2006

work page 2006

[17] [17]

Reinforcement learning with deep energy-based policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in Proceedings of the Thirty-fourth International Conference on Machine Learning , 2017

work page 2017

[18] [18]

Taming the noise in reinforce- ment learning via soft updates,

R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforce- ment learning via soft updates,” in Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence , 2016, pp. 202– 211

work page 2016

[19] [19]

Efﬁcient computation of optimal actions,

E. Todorov, “Efﬁcient computation of optimal actions,” Proceedings of the National Academy of Sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009

work page 2009

[20] [20]

The asymptotic theory of stochastic games,

T. Bewley and E. Kohlberg, “The asymptotic theory of stochastic games,” Mathematics of Operations Research , vol. 1, no. 3, pp. 197– 208, 1976

work page 1976

[21] [21]

Discounted stochastic games: The ﬁnite case,

S. Sorin, “Discounted stochastic games: The ﬁnite case,” in Stochastic Games and Applications . Springer, 2003, pp. 51–55

work page 2003

[22] [22]

Directed information for channels with feedback,

G. Kramer, “Directed information for channels with feedback,” Ph.D. dissertation, ETH Zurich, 1998

work page 1998

[23] [23]

Maximum causal entropy correlated equilibria for Markov games,

B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Maximum causal entropy correlated equilibria for Markov games,” in International Conference on Autonomous Agents and Multiagent Systems , 2011, pp. 207–214

work page 2011

[24] [24]

Zur theorie der gesellschaftsspiele,

J. v. Neumann, “Zur theorie der gesellschaftsspiele,” Mathematische Annalen, vol. 100, no. 1, pp. 295–320, 1928

work page 1928

[25] [25]

R. T. Rockafellar and R. J.-B. Wets, Variational analysis. Springer Science & Business Media, 2009, vol. 317

work page 2009

[26] [26]

D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientiﬁc, 1996

work page 1996

[27] [27]

Neyman and S

A. Neyman and S. Sorin, Stochastic games and applications. Springer Science & Business Media, 2003, vol. 570

work page 2003

[28] [28]

M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[29] [29]

ECOS: An SOCP solver for embedded systems,

A. Domahidi, E. Chu, and S. Boyd, “ECOS: An SOCP solver for embedded systems,” in European Control Conference (ECC) , 2013, pp. 3071–3076

work page 2013

[30] [30]

CVXPY: A Python-embedded modeling language for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016

[31] [31]

Computing proper equilibria of zero-sum games,

P. B. Miltersen and T. B. Sørensen, “Computing proper equilibria of zero-sum games,” in International Conference on Computers and Games, 2006, pp. 200–211

work page 2006

[32] [32]

T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2006. APPENDIX A Proof of Proposition 1: We show the sufﬁciency of Markovian strategies only for the maximin problem. The proof for the minimax formulation follows the same lines with the arguments provided below. The proof is based on backward induction on the stage number 1...

work page 2006