Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

Donghwan Lee; Narim Jeong

arxiv: 2604.04394 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.SY· eess.SY

Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

Narim Jeong , Donghwan Lee This is my paper

Pith reviewed 2026-05-10 19:45 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords finite-time convergenceQ-value iterationStackelberg gamesgeneral-sum Markov gamesmulti-agent reinforcement learningswitching systemscomparison systemscontrol-theoretic analysis

0 comments

The pith

Stackelberg Q-value iteration converges with explicit finite-time error bounds in general-sum Markov games under a relaxed policy condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that Q-value iteration reaches the Stackelberg equilibrium values in finite time for two-player general-sum Markov games. It does so by defining a relaxed policy condition that fits the leader-follower structure and recasting the updates as a switching dynamical system. Upper and lower comparison systems then sandwich the Q-function errors, producing explicit bounds that shrink to zero. A reader cares because this supplies the first time-bounded guarantees for multi-agent RL in competitive, non-zero-sum settings where only asymptotic results existed before.

Core claim

Under the relaxed policy condition, the Q-value iteration dynamics in general-sum Stackelberg Markov games form a switching system that can be bounded above and below by comparison systems, which deliver finite-time error bounds on the Q-functions and characterize their convergence to the equilibrium.

What carries the argument

The switching system model together with its upper and lower comparison systems, which sandwich the iteration errors and yield explicit finite-time bounds.

If this is right

The Q-functions converge to their Stackelberg equilibrium values within a finite, explicitly bounded number of iterations.
Explicit error bounds become available for the first time in general-sum Stackelberg settings.
The control-theoretic comparison-system technique supplies a template for analyzing other iterative learners in leader-follower games.
Convergence properties are now characterized by concrete rates rather than only asymptotic statements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same comparison-system approach could be adapted to obtain finite-time results for policy iteration or actor-critic methods in Stackelberg games.
The error bounds could be turned into practical stopping rules or sample-complexity estimates when deploying Stackelberg RL in applications.
This work suggests that dynamical-systems tools may yield stability and rate results for wider classes of multi-agent RL beyond pure Stackelberg interaction.

Load-bearing premise

The relaxed policy condition holds and the Q-updates can be accurately represented as a switching system possessing valid upper and lower comparison systems.

What would settle it

A concrete two-state Stackelberg Markov game in which the observed Q-function error exceeds the derived upper comparison bound after the number of steps predicted by the analysis.

Figures

Figures reproduced from arXiv: 2604.04394 by Donghwan Lee, Narim Jeong.

**Figure 3.** Figure 3: Illustration of Q-function error of the follower and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Reinforcement learning has been successful both empirically and theoretically in single-agent settings, but extending these results to multi-agent reinforcement learning in general-sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds for the Q-functions and characterize their convergence properties. Our results provide a novel control-theoretic perspective on Stackelberg learning. Moreover, to the best of the authors' knowledge, this paper offers the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first finite-time bounds for Q-value iteration in general-sum Stackelberg games by modeling the dynamics as a switching system with comparison bounds.

read the letter

The main point is that this work supplies the first explicit finite-time error bounds and convergence characterization for Q-value iteration in two-player general-sum Markov games under Stackelberg leader-follower interactions. It does so by defining a relaxed policy condition that fits the best-response structure and then recasting the iteration as a switched system whose trajectories are bounded above and below by constructed comparison systems. That control-theoretic move extends the single-agent and zero-sum finite-time results in a direct way and gives a clean way to track how the Q-functions behave when the follower switches responses. The abstract and the approach look technically coherent on that front, and the bounds appear to avoid overly strong contraction assumptions by leveraging the comparison lemma instead. If the proofs close the details, this is a useful addition for analyzing convergence rates in hierarchical multi-agent settings. The soft spot sits where the stress test points: the relaxed policy condition must actually be inherited by the iterates, and the comparison inequalities must hold at every switch, including when the follower's best response is discontinuous. The paper needs to show that these do not fail on positive-measure sets of states; without an a-priori argument or explicit verification in the proofs, the finite-time claims rest on that step. Minor gaps in tightness or example validation would be normal at this stage, but a load-bearing hole there would shrink the result. This is for theorists working on multi-agent RL with asymmetric roles or commitment, especially those who already use switched-system or comparison techniques. A reader who wants tools for designing algorithms with provable rates in Stackelberg games will find value. I would send it to peer review. The novelty in the setting and the method is clear enough that referees should check the technical steps and see whether the comparison argument holds without extra hidden assumptions.

Referee Report

1 major / 1 minor

Summary. The paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. It introduces a relaxed policy condition tailored to the Stackelberg setting, models the Q-value iteration dynamics as a switching system, constructs upper and lower comparison systems, and derives finite-time error bounds for the Q-functions while characterizing their convergence. The authors claim this yields the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.

Significance. If the bounds are valid, this work is significant for providing the first explicit finite-time analysis of Q-value iteration in general-sum Stackelberg games, an area where single-agent RL techniques do not directly extend. The control-theoretic framing via switching systems and comparison systems offers a structured way to obtain non-asymptotic bounds and could generalize to other multi-agent settings. The approach is technically distinctive and addresses a clear gap in the multi-agent RL literature.

major comments (1)

The finite-time error bounds rest on two load-bearing assumptions: (1) that the introduced relaxed policy condition is satisfied by the policies generated during iteration, and (2) that the explicitly constructed upper and lower comparison systems dominate the switched vector field in every mode. The manuscript does not supply an a-priori argument showing that the relaxed condition is automatically inherited from the Stackelberg best-response structure, nor does it verify that the comparison inequalities continue to hold when the follower’s best response changes discontinuously. If either fails on a positive-measure set of states, the derived bounds are invalid. This issue must be resolved with a concrete verification argument before the central claim can be accepted.

minor comments (1)

The abstract states the novelty claim with the qualifier 'to the best of the authors' knowledge'; this should be supported by a concise but explicit literature comparison in the introduction or related-work section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below and will revise the manuscript to incorporate a more explicit verification of the key assumptions.

read point-by-point responses

Referee: The finite-time error bounds rest on two load-bearing assumptions: (1) that the introduced relaxed policy condition is satisfied by the policies generated during iteration, and (2) that the explicitly constructed upper and lower comparison systems dominate the switched vector field in every mode. The manuscript does not supply an a-priori argument showing that the relaxed condition is automatically inherited from the Stackelberg best-response structure, nor does it verify that the comparison inequalities continue to hold when the follower’s best response changes discontinuously. If either fails on a positive-measure set of states, the derived bounds are invalid. This issue must be resolved with a concrete verification argument before the central claim can be accepted.

Authors: We agree that the manuscript would benefit from a more explicit, self-contained verification of these two assumptions to ensure the bounds are rigorously justified. In the revised version we will add a dedicated subsection that (i) proves the relaxed policy condition is inherited by construction from the Stackelberg best-response operator applied to the current Q-estimates, and (ii) shows that the comparison-system inequalities continue to hold across discontinuities of the follower’s best-response map. The argument for (ii) proceeds by establishing that the set of states at which the best response is discontinuous has measure zero under standard continuity assumptions on the reward functions, combined with a uniform bound on the jump size of the switched vector field. These additions will be placed immediately before the statement of the main finite-time theorem so that the logical chain is fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard comparison systems to explicitly modeled dynamics

full rationale

The paper defines a relaxed policy condition for the Stackelberg setting, represents Q-value iteration dynamics as a switching system, and constructs upper/lower comparison systems to obtain finite-time bounds. These steps are presented as independent modeling and bounding arguments drawn from control theory rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equation or claim reduces to its own input by construction; the result remains falsifiable against the validity of the modeling assumptions and comparison inequalities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on a new relaxed policy condition and switching-system modeling whose validity is assumed for the bounds to hold.

axioms (1)

domain assumption Relaxed policy condition tailored to the Stackelberg setting
Introduced to enable the switching-system model and comparison bounds.

pith-pipeline@v0.9.0 · 5431 in / 1032 out tokens · 60789 ms · 2026-05-10T19:45:03.987913+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

relaxed policy condition tailored to the Stackelberg setting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Q-learning,

C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, vol. 8, no. 3, pp. 279–292, 1992

work page 1992
[2]

A theoretical analysis of deep q-learning,

J. Fan, Z. Wang, Y . Xie, and Z. Yang, “A theoretical analysis of deep q-learning,” inLearning for dynamics and control. PMLR, 2020, pp. 486–489

work page 2020
[3]

Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,

G. Li, Y . Wei, Y . Chi, Y . Gu, and Y . Chen, “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041, 2020

work page arXiv 2006
[4]

arXiv preprint arXiv:2102.01567 , year=

Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A Lyapunov theory for finite-sample guarantees of asynchronous Q- learning and TD-learning variants,”arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021
[5]

Periodic Q-learning,

D. Lee and N. He, “Periodic Q-learning,” inLearning for dynamics and control, 2020, pp. 582–598

work page 2020
[6]

Multi-agent reinforcement learning: Independent vs. cooperative agents,

M. Tanet al., “Multi-agent reinforcement learning: Independent vs. cooperative agents,” inProceedings of the tenth international confer- ence on machine learning, 1993, pp. 330–337

work page 1993
[7]

Nash q-learning for general-sum stochastic games,

J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,”Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003

work page 2003
[8]

Markov games as a framework for multi-agent rein- forcement learning,

M. L. Littman, “Markov games as a framework for multi-agent rein- forcement learning,” inMachine learning proceedings 1994. Elsevier, 1994, pp. 157–163

work page 1994
[9]

Correlated q-learning,

A. Greenwald, K. Hall, R. Serranoet al., “Correlated q-learning,” in ICML, vol. 3, 2003, pp. 242–249

work page 2003
[10]

Value-function reinforcement learning in markov games,

M. L. Littman, “Value-function reinforcement learning in markov games,”Cognitive systems research, vol. 2, no. 1, pp. 55–66, 2001

work page 2001
[11]

Finite-sample guarantees for nash q-learning with linear function approximation,

P. Cisneros-Velarde and S. Koyejo, “Finite-sample guarantees for nash q-learning with linear function approximation,” inUncertainty in Artificial Intelligence. PMLR, 2023, pp. 424–432

work page 2023
[12]

arXiv preprint arXiv:2110.04184 , year=

Z. Song, S. Mei, and Y . Bai, “When can we learn general-sum markov games with a large number of players sample-efficiently?”arXiv preprint arXiv:2110.04184, 2021

work page arXiv 2021
[13]

Regret minimization and convergence to equilibria in general-sum markov games,

L. Erez, T. Lancewicki, U. Sherman, T. Koren, and Y . Mansour, “Regret minimization and convergence to equilibria in general-sum markov games,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9343–9373

work page 2023
[14]

Cyclic equilibria in markov games,

M. Zinkevich, A. Greenwald, and M. Littman, “Cyclic equilibria in markov games,”Advances in neural information processing systems, vol. 18, 2005

work page 2005
[15]

On the convergence of policy gradient methods to nash equilibria in general stochastic games,

A. Giannou, K. Lotidis, P. Mertikopoulos, and E.-V . Vlatakis- Gkaragkounis, “On the convergence of policy gradient methods to nash equilibria in general stochastic games,”Advances in Neural Information Processing Systems, vol. 35, pp. 7128–7141, 2022

work page 2022
[16]

Planning for autonomous cars that leverage effects on human actions

D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions.” inRobotics: Science and systems, vol. 2. Ann Arbor, MI, USA, 2016, pp. 1–9

work page 2016
[17]

Game- theoretic planning for self-driving cars in multivehicle competitive scenarios,

M. Wang, Z. Wang, J. Talbot, J. C. Gerdes, and M. Schwager, “Game- theoretic planning for self-driving cars in multivehicle competitive scenarios,”IEEE Transactions on Robotics, vol. 37, no. 4, pp. 1313– 1325, 2021

work page 2021
[18]

Krishna,Auction theory

V . Krishna,Auction theory. Academic press, 2009

work page 2009
[19]

Stackelberg security games: Looking beyond a decade of success

A. Sinha, F. Fang, B. An, C. Kiekintveld, and M. Tambe, “Stackelberg security games: Looking beyond a decade of success.” IJCAI, 2018

work page 2018
[20]

Bas ¸ar and G

T. Bas ¸ar and G. J. Olsder,Dynamic noncooperative game theory. SIAM, 1998

work page 1998
[21]

Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study,

T. Fiez, B. Chasnov, and L. Ratliff, “Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study,” inInternational conference on machine learning. PMLR, 2020, pp. 3133–3144

work page 2020
[22]

arXiv preprint arXiv:1906.01217 , year=

T. Fiez, B. Chasnov, and L. J. Ratliff, “Convergence of learning dynamics in stackelberg games,”arXiv preprint arXiv:1906.01217, 2019

work page arXiv 1906
[23]

Stackelberg policy gradient: Evaluating the performance of leaders and followers,

Q.-L. Vu, Z. Alumbaugh, R. Ching, Q. Ding, A. Mahajan, B. Chasnov, S. Burden, and L. J. Ratliff, “Stackelberg policy gradient: Evaluating the performance of leaders and followers,” inICLR 2022 Workshop on Gamification and Multiagent Solutions, 2022

work page 2022
[24]

A survey on bilevel optimization under uncertainty,

Y . Beck, I. Ljubi ´c, and M. Schmidt, “A survey on bilevel optimization under uncertainty,”European Journal of Operational Research, vol. 311, no. 2, pp. 401–426, 2023

work page 2023
[25]

Stack- elberg actor-critic: Game-theoretic reinforcement learning algorithms,

L. Zheng, T. Fiez, Z. Alumbaugh, B. Chasnov, and L. J. Ratliff, “Stack- elberg actor-critic: Game-theoretic reinforcement learning algorithms,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 9217–9224

work page 2022
[26]

Learning in stackelberg markov games,

J. He, A. L. Liu, and Y . Chen, “Learning in stackelberg markov games,”arXiv preprint arXiv:2509.16296, 2025

work page arXiv 2025
[27]

Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers?

H. Zhong, Z. Yang, Z. Wang, and M. I. Jordan, “Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers?”arXiv preprint arXiv:2112.13521, 2021

work page arXiv 2021
[28]

Learning correlated stackelberg equi- librium in general-sum multi-leader-single-follower games,

Y . Yu, H. Xu, and H. Chen, “Learning correlated stackelberg equi- librium in general-sum multi-leader-single-follower games,”arXiv preprint arXiv:2210.12470, 2022

work page arXiv 2022
[29]

Finding a multiple follower stackelberg equilibrium: A fully first-order method,

A. Niu, K. Wang, and J. Ziani, “Finding a multiple follower stackelberg equilibrium: A fully first-order method,”arXiv preprint arXiv:2509.08161, 2025

work page arXiv 2025
[30]

Liberzon,Switching in systems and control

D. Liberzon,Switching in systems and control. Springer, 2003, vol. 190

work page 2003
[31]

Convergence problems of general-sum multiagent rein- forcement learning,

M. Bowling, “Convergence problems of general-sum multiagent rein- forcement learning,” inICML, 2000, pp. 89–94

work page 2000
[32]

Sample-efficient learning of stackelberg equilibria in general-sum games,

Y . Bai, C. Jin, H. Wang, and C. Xiong, “Sample-efficient learning of stackelberg equilibria in general-sum games,”Advances in Neural Information Processing Systems, vol. 34, pp. 25 799–25 811, 2021

work page 2021
[33]

A unified switching system perspective and convergence analysis of Q-learning algorithms,

D. Lee and N. He, “A unified switching system perspective and convergence analysis of Q-learning algorithms,” in34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020
[34]

A discrete-time switching system analysis of Q-learning,

D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,”SIAM Journal on Control and Optimization (accepted), 2022

work page 2022
[35]

Finite-time error analysis of soft q-learning: Switching system approach,

N. Jeong and D. Lee, “Finite-time error analysis of soft q-learning: Switching system approach,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 3897–3904

work page 2024
[36]

Finite-time analysis of minimax q-learning,

——, “Finite-time analysis of minimax q-learning,” inReinforcement Learning Conference, 2025

work page 2025
[37]

Stochastic games,

L. S. Shapley, “Stochastic games,”Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953

work page 1953
[38]

A multi-agent q-learning with value function approximation based on single-leader multi-followers stack- elberg game,

C. Zhu, W. Yu, and H. Wang, “A multi-agent q-learning with value function approximation based on single-leader multi-followers stack- elberg game,” in2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 2023, pp. 1229–1234

work page 2023
[39]

Nonlinear systems third edition (2002),

H. K. Khalil, “Nonlinear systems third edition (2002),” 2002

work page 2002
[40]

Boundedness of iterates in Q-learning,

A. Gosavi, “Boundedness of iterates in Q-learning,”Systems & Control letters, vol. 55, no. 4, pp. 347–349, 2006

work page 2006

[1] [1]

Q-learning,

C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, vol. 8, no. 3, pp. 279–292, 1992

work page 1992

[2] [2]

A theoretical analysis of deep q-learning,

J. Fan, Z. Wang, Y . Xie, and Z. Yang, “A theoretical analysis of deep q-learning,” inLearning for dynamics and control. PMLR, 2020, pp. 486–489

work page 2020

[3] [3]

Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,

G. Li, Y . Wei, Y . Chi, Y . Gu, and Y . Chen, “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041, 2020

work page arXiv 2006

[4] [4]

arXiv preprint arXiv:2102.01567 , year=

Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A Lyapunov theory for finite-sample guarantees of asynchronous Q- learning and TD-learning variants,”arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021

[5] [5]

Periodic Q-learning,

D. Lee and N. He, “Periodic Q-learning,” inLearning for dynamics and control, 2020, pp. 582–598

work page 2020

[6] [6]

Multi-agent reinforcement learning: Independent vs. cooperative agents,

M. Tanet al., “Multi-agent reinforcement learning: Independent vs. cooperative agents,” inProceedings of the tenth international confer- ence on machine learning, 1993, pp. 330–337

work page 1993

[7] [7]

Nash q-learning for general-sum stochastic games,

J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,”Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003

work page 2003

[8] [8]

Markov games as a framework for multi-agent rein- forcement learning,

M. L. Littman, “Markov games as a framework for multi-agent rein- forcement learning,” inMachine learning proceedings 1994. Elsevier, 1994, pp. 157–163

work page 1994

[9] [9]

Correlated q-learning,

A. Greenwald, K. Hall, R. Serranoet al., “Correlated q-learning,” in ICML, vol. 3, 2003, pp. 242–249

work page 2003

[10] [10]

Value-function reinforcement learning in markov games,

M. L. Littman, “Value-function reinforcement learning in markov games,”Cognitive systems research, vol. 2, no. 1, pp. 55–66, 2001

work page 2001

[11] [11]

Finite-sample guarantees for nash q-learning with linear function approximation,

P. Cisneros-Velarde and S. Koyejo, “Finite-sample guarantees for nash q-learning with linear function approximation,” inUncertainty in Artificial Intelligence. PMLR, 2023, pp. 424–432

work page 2023

[12] [12]

arXiv preprint arXiv:2110.04184 , year=

Z. Song, S. Mei, and Y . Bai, “When can we learn general-sum markov games with a large number of players sample-efficiently?”arXiv preprint arXiv:2110.04184, 2021

work page arXiv 2021

[13] [13]

Regret minimization and convergence to equilibria in general-sum markov games,

L. Erez, T. Lancewicki, U. Sherman, T. Koren, and Y . Mansour, “Regret minimization and convergence to equilibria in general-sum markov games,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9343–9373

work page 2023

[14] [14]

Cyclic equilibria in markov games,

M. Zinkevich, A. Greenwald, and M. Littman, “Cyclic equilibria in markov games,”Advances in neural information processing systems, vol. 18, 2005

work page 2005

[15] [15]

On the convergence of policy gradient methods to nash equilibria in general stochastic games,

A. Giannou, K. Lotidis, P. Mertikopoulos, and E.-V . Vlatakis- Gkaragkounis, “On the convergence of policy gradient methods to nash equilibria in general stochastic games,”Advances in Neural Information Processing Systems, vol. 35, pp. 7128–7141, 2022

work page 2022

[16] [16]

Planning for autonomous cars that leverage effects on human actions

D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions.” inRobotics: Science and systems, vol. 2. Ann Arbor, MI, USA, 2016, pp. 1–9

work page 2016

[17] [17]

Game- theoretic planning for self-driving cars in multivehicle competitive scenarios,

M. Wang, Z. Wang, J. Talbot, J. C. Gerdes, and M. Schwager, “Game- theoretic planning for self-driving cars in multivehicle competitive scenarios,”IEEE Transactions on Robotics, vol. 37, no. 4, pp. 1313– 1325, 2021

work page 2021

[18] [18]

Krishna,Auction theory

V . Krishna,Auction theory. Academic press, 2009

work page 2009

[19] [19]

Stackelberg security games: Looking beyond a decade of success

A. Sinha, F. Fang, B. An, C. Kiekintveld, and M. Tambe, “Stackelberg security games: Looking beyond a decade of success.” IJCAI, 2018

work page 2018

[20] [20]

Bas ¸ar and G

T. Bas ¸ar and G. J. Olsder,Dynamic noncooperative game theory. SIAM, 1998

work page 1998

[21] [21]

Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study,

T. Fiez, B. Chasnov, and L. Ratliff, “Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study,” inInternational conference on machine learning. PMLR, 2020, pp. 3133–3144

work page 2020

[22] [22]

arXiv preprint arXiv:1906.01217 , year=

T. Fiez, B. Chasnov, and L. J. Ratliff, “Convergence of learning dynamics in stackelberg games,”arXiv preprint arXiv:1906.01217, 2019

work page arXiv 1906

[23] [23]

Stackelberg policy gradient: Evaluating the performance of leaders and followers,

Q.-L. Vu, Z. Alumbaugh, R. Ching, Q. Ding, A. Mahajan, B. Chasnov, S. Burden, and L. J. Ratliff, “Stackelberg policy gradient: Evaluating the performance of leaders and followers,” inICLR 2022 Workshop on Gamification and Multiagent Solutions, 2022

work page 2022

[24] [24]

A survey on bilevel optimization under uncertainty,

Y . Beck, I. Ljubi ´c, and M. Schmidt, “A survey on bilevel optimization under uncertainty,”European Journal of Operational Research, vol. 311, no. 2, pp. 401–426, 2023

work page 2023

[25] [25]

Stack- elberg actor-critic: Game-theoretic reinforcement learning algorithms,

L. Zheng, T. Fiez, Z. Alumbaugh, B. Chasnov, and L. J. Ratliff, “Stack- elberg actor-critic: Game-theoretic reinforcement learning algorithms,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 9217–9224

work page 2022

[26] [26]

Learning in stackelberg markov games,

J. He, A. L. Liu, and Y . Chen, “Learning in stackelberg markov games,”arXiv preprint arXiv:2509.16296, 2025

work page arXiv 2025

[27] [27]

Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers?

H. Zhong, Z. Yang, Z. Wang, and M. I. Jordan, “Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers?”arXiv preprint arXiv:2112.13521, 2021

work page arXiv 2021

[28] [28]

Learning correlated stackelberg equi- librium in general-sum multi-leader-single-follower games,

Y . Yu, H. Xu, and H. Chen, “Learning correlated stackelberg equi- librium in general-sum multi-leader-single-follower games,”arXiv preprint arXiv:2210.12470, 2022

work page arXiv 2022

[29] [29]

Finding a multiple follower stackelberg equilibrium: A fully first-order method,

A. Niu, K. Wang, and J. Ziani, “Finding a multiple follower stackelberg equilibrium: A fully first-order method,”arXiv preprint arXiv:2509.08161, 2025

work page arXiv 2025

[30] [30]

Liberzon,Switching in systems and control

D. Liberzon,Switching in systems and control. Springer, 2003, vol. 190

work page 2003

[31] [31]

Convergence problems of general-sum multiagent rein- forcement learning,

M. Bowling, “Convergence problems of general-sum multiagent rein- forcement learning,” inICML, 2000, pp. 89–94

work page 2000

[32] [32]

Sample-efficient learning of stackelberg equilibria in general-sum games,

Y . Bai, C. Jin, H. Wang, and C. Xiong, “Sample-efficient learning of stackelberg equilibria in general-sum games,”Advances in Neural Information Processing Systems, vol. 34, pp. 25 799–25 811, 2021

work page 2021

[33] [33]

A unified switching system perspective and convergence analysis of Q-learning algorithms,

D. Lee and N. He, “A unified switching system perspective and convergence analysis of Q-learning algorithms,” in34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020

[34] [34]

A discrete-time switching system analysis of Q-learning,

D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,”SIAM Journal on Control and Optimization (accepted), 2022

work page 2022

[35] [35]

Finite-time error analysis of soft q-learning: Switching system approach,

N. Jeong and D. Lee, “Finite-time error analysis of soft q-learning: Switching system approach,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 3897–3904

work page 2024

[36] [36]

Finite-time analysis of minimax q-learning,

——, “Finite-time analysis of minimax q-learning,” inReinforcement Learning Conference, 2025

work page 2025

[37] [37]

Stochastic games,

L. S. Shapley, “Stochastic games,”Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953

work page 1953

[38] [38]

A multi-agent q-learning with value function approximation based on single-leader multi-followers stack- elberg game,

C. Zhu, W. Yu, and H. Wang, “A multi-agent q-learning with value function approximation based on single-leader multi-followers stack- elberg game,” in2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 2023, pp. 1229–1234

work page 2023

[39] [39]

Nonlinear systems third edition (2002),

H. K. Khalil, “Nonlinear systems third edition (2002),” 2002

work page 2002

[40] [40]

Boundedness of iterates in Q-learning,

A. Gosavi, “Boundedness of iterates in Q-learning,”Systems & Control letters, vol. 55, no. 4, pp. 347–349, 2006

work page 2006