pith. machine review for the scientific record. sign in

arxiv: 1008.0530 · v1 · submitted 2010-08-03 · 💻 cs.GT

Recognition: unknown

Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor

Authors on Pith no claims yet
classification 💻 cs.GT
keywords algorithmgammaiterationfracdiscountfactorgameshoward
0
0 comments X
read the original abstract

Ye showed recently that the simplex method with Dantzig pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most $O(\frac{mn}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations, where $n$ is the number of states, $m$ is the total number of actions in the MDP, and $0<\gamma<1$ is the discount factor. We improve Ye's analysis in two respects. First, we improve the bound given by Ye and show that Howard's policy iteration algorithm actually terminates after at most $O(\frac{m}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard's policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, resolving a long standing open problem.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Planning in entropy-regularized Markov decision processes and games

    cs.LG 2026-04 unverdicted novelty 7.0

    SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.