arxiv: 2604.17457 · v4 · submitted 2026-04-19 · 🧮 math.OC · cs.AI· cs.SY· eess.SY

Recognition: unknown

Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:49 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.SYeess.SY

keywords Q-value iterationpolicy identificationswitching systemsjoint spectral radiusBellman operatoroptimal policydiscounted Markov decision processesvalue iteration geometry

0 comments

The pith

Q-value iteration enters a tube around the optimal Q-function in finite time, making the greedy policy optimal before full convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models discounted Q-value iteration as a switching dynamical system whose modes are the greedy policies with respect to the current Q-estimate. It identifies an affine subspace X1 consisting of the optimal Q-function plus any constant shift, and shows that an invariant tube around this subspace lies inside the set of Q-functions whose greedy policies are optimal. Once the iterates enter this tube, the induced policy is already optimal, and the distance to the tube contracts exponentially at a rate governed by the joint spectral radius of the family of projected operators transverse to the all-ones direction. A sympathetic reader cares because this separates the moment when good actions are identified from the slower remaining contraction toward exact optimal values, potentially allowing earlier termination or better algorithm design.

Core claim

Q-VI reaches the optimal action class in finite time by entering an invariant tube around X1 = Q* + span(1), which is contained in the practically optimal solution set. For every ε > 0 the distance to X1 satisfies an exponential bound with rate (ρ-bar + ε)^k, where ρ-bar is the joint spectral radius of the projected switching family restricted to directions transverse to X1. When ρ-bar < γ this transverse convergence is faster than the classical contraction rate. The analysis separates fast policy identification from the subsequent convergence to Q*, which may still be governed by the all-ones mode.

What carries the argument

The invariant tube around the affine subspace X1 = Q* + span(1) that is contained in the practically optimal solution set, together with the joint spectral radius of the projected switching family of greedy-policy operators restricted to the subspace transverse to the constants.

If this is right

The greedy policy extracted from the current Q-estimate becomes optimal as soon as the iterate enters the tube, even while the distance to Q* is still large.
The rate at which the optimal action class is reached is governed by the transverse joint spectral radius and can be strictly smaller than the discount factor γ.
Spectral and graph-theoretic conditions on the MDP determine whether the strict inequality ρ-bar < γ holds.
After policy identification, further convergence inside the tube can still be limited by the all-ones eigenvector at the classical rate γ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early-stopping rules could monitor distance to the tube rather than the full Bellman residual.
The same transverse-spectral analysis may extend to asynchronous or approximate value iteration variants.
MDP instances satisfying the graph-theoretic conditions for ρ-bar < γ could be preferentially chosen or constructed to accelerate policy learning.
The separation of timescales suggests that policy-gradient or actor-critic methods might also benefit from explicit tube geometry around optimal action sets.

Load-bearing premise

Discounted Q-value iteration can be modeled as a switching system whose modes are the greedy policies, and an invariant tube around X1 exists inside the set of Q-functions that induce optimal policies.

What would settle it

A finite MDP in which the sequence of Q-iterates never enters any bounded neighborhood of Q* + span(1) yet the greedy policies extracted along the way remain suboptimal for infinitely many steps.

Figures

Figures reproduced from arXiv: 2604.17457 by Donghwan Lee.

**Figure 1.** Figure 1: Geometric mechanism for finitetime POSS identification. The geometric mechanism considered in this paper is shown schematically in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Normalized comparison of kQk − Q∗k∞ and dist2(Qk, X1) for the toy MDP. The dashed curves indicate the reference rates γ k and (γ|λ2|) k . The plot shows that the iterate approaches the affine set X1 faster than it approaches the optimal Q-function Q∗ [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Orthogonal projection of a single full Q-VI trajec [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Orthogonal projections of 12 full Q-VI trajectories onto the tilted plane, where the initial conditions are chosen uniformly on a circle of radius 2 in the plane. The dashed line is the slice of X1, the shaded strip is the slice of Tδ, and the dotted curve is the initial circle. The plot gives a global geometric view of trajectories being first attracted toward X1 and then converging to Q∗ . [2] D. P. Bert… view at source ↗

**Figure 5.** Figure 5: Two-dimensional projection of tabular Q-learnin [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized comparison of kQk − Q∗k∞ and dist2(Qk, X1) for the toy MDP. The dashed curves indicate the reference rates γ k and (γ|λ2|) k . The plot shows that the iterate approaches the affine set X1 faster than it approaches the optimal Q-function Q∗ . size is chosen as αt = 0.35 1+0.01 t . To visualize the geometry, we consider the two-dimensional affine plane identical to the previous Q-VI case. As befor… view at source ↗

**Figure 7.** Figure 7: Orthogonal projection of a single full Q-VI trajec [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Orthogonal projections of 12 full Q-VI trajectories onto the tilted plane, where the initial conditions are chosen uniformly on a circle of radius 2 in the plane. The dashed line is the slice of X1, the shaded strip is the slice of Tδ, and the dotted curve is the initial circle. The plot gives a global geometric view of trajectories being first attracted toward X1 and then converging to Q∗ . >0. Therefore … view at source ↗

**Figure 9.** Figure 9: Two-dimensional projection of tabular Q-learnin [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Q-value iteration (Q-VI) is usually analyzed through the \(\gamma\)-contraction of the Bellman operator. This argument proves convergence to \(Q^*\), but it gives only a coarse account of when the induced greedy policy becomes optimal. We study discounted Q-VI as a switching system and focus on the practically optimal solution set (POSS), the set of \(Q\)-functions whose tie-broken greedy policies are optimal. The main result shows that Q-VI reaches the optimal action class in finite time by entering an invariant tube around \(\mathcal X_1=Q^*+\operatorname{span}(\mathbf 1)\), which is contained in the POSS. For every \(\varepsilon>0\), the distance to \(\mathcal X_1\) satisfies an exponential bound with rate \((\bar\rho+\varepsilon)^k\), where \(\bar\rho\) is the joint spectral radius of the projected switching family restricted to directions transverse to \(\mathcal X_1\). When \(\bar\rho<\gamma\), this transverse convergence is faster than the classical contraction rate. The analysis separates fast policy identification from the subsequent convergence to \(Q^*\), which may still be governed by the all-ones mode. We also give spectral and graph-theoretic conditions under which the strict inequality \(\bar\rho<\gamma\) holds or fails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that Q-value iteration can lock in optimal actions faster than it converges to Q* by entering an invariant tube whose transverse shrinkage is governed by a joint spectral radius that can beat gamma.

read the letter

The main point is that Q-value iteration reaches the optimal action class in finite time by entering a small invariant tube around the line X1 = Q* plus constants, and once inside that tube the distance to the line contracts at a rate set by the joint spectral radius of the projected switching maps, which can be strictly less than the discount factor under the right MDP structure. This cleanly separates the policy-identification phase from the remaining value convergence along the all-ones direction.

Referee Report

2 major / 3 minor

Summary. The paper models discounted Q-value iteration as a switching system whose modes are the greedy policies with respect to the current Q-function. It defines the practically optimal solution set (POSS) as the set of Q-functions whose tie-broken greedy policies are optimal, and shows that Q-VI enters an invariant tube of small transverse radius around the affine line X1 = Q* + span(1) in finite time; this tube lies inside the POSS. Inside the tube the active modes are optimal policies, each of which leaves X1 invariant, and the transverse dynamics on the quotient space are governed by a switching family whose joint spectral radius ρ-bar satisfies ρ-bar ≤ γ. Consequently the transverse distance to X1 contracts at rate (ρ-bar + ε)^k for any ε > 0, which is strictly faster than the classical γ-rate whenever ρ-bar < γ. Spectral and graph-theoretic conditions are supplied under which the strict inequality holds.

Significance. If the central claims hold, the work supplies a geometrically precise account of the finite-time policy-identification phase of Q-VI that is separated from the subsequent convergence to Q*. The decomposition into longitudinal and transverse components, the explicit construction of an invariant tube inside the POSS, and the use of the joint spectral radius to obtain a potentially tighter transverse rate constitute a genuine advance over the standard contraction argument. The provision of verifiable conditions for ρ-bar < γ is a concrete strength that makes the result falsifiable and potentially useful for algorithm analysis.

major comments (2)

[§3.2] §3.2 (Tube invariance): The proof that any sufficiently small transverse perturbation keeps the argmax set inside the optimal actions relies on a uniform positive gap between optimal and suboptimal action values at Q*. The manuscript should state explicitly that this gap is positive and uniform because the state and action spaces are finite; without that sentence the invariance argument is incomplete for the general case.
[§4.1] §4.1 (Definition of the projected switching family): The joint spectral radius ρ-bar is defined on the quotient space transverse to X1. The paper must verify that each optimal policy induces a well-defined linear map on this quotient (i.e., that T_π maps X1 to itself) and that the family is finite; both facts are used to bound ρ-bar ≤ γ and should be recorded as a short lemma.

minor comments (3)

[§2] The notation X1, POSS, and the transverse projection operator are introduced without a small illustrative example (e.g., a two-state chain); adding one would make the geometric construction easier to follow.
[Theorem 1] Theorem 1 states the exponential bound with rate (ρ-bar + ε)^k; the proof should cite the precise definition of joint spectral radius used (e.g., the one based on the spectral radius of the semigroup or the max-norm growth rate) so that readers can verify the ε-approximation step.
[§5] The graph-theoretic conditions in §5 are stated for the case of a single optimal policy; the manuscript should note whether the same spectral test extends immediately to multiple optimal policies or requires an additional union-graph construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The suggested clarifications strengthen the presentation, and we will incorporate them in the revised version.

read point-by-point responses

Referee: [§3.2] §3.2 (Tube invariance): The proof that any sufficiently small transverse perturbation keeps the argmax set inside the optimal actions relies on a uniform positive gap between optimal and suboptimal action values at Q*. The manuscript should state explicitly that this gap is positive and uniform because the state and action spaces are finite; without that sentence the invariance argument is incomplete for the general case.

Authors: We agree that the argument for tube invariance relies on a uniform positive gap, which holds because the finite state-action space implies a positive minimum gap between optimal and suboptimal Q*-values. We will add an explicit sentence in §3.2 stating this fact to make the invariance proof self-contained. revision: yes
Referee: [§4.1] §4.1 (Definition of the projected switching family): The joint spectral radius ρ-bar is defined on the quotient space transverse to X1. The paper must verify that each optimal policy induces a well-defined linear map on this quotient (i.e., that T_π maps X1 to itself) and that the family is finite; both facts are used to bound ρ-bar ≤ γ and should be recorded as a short lemma.

Authors: We will add a short lemma in §4.1 that records these two facts: (i) every optimal policy π* satisfies T_π*(Q* + c·1) = Q* + γc·1, so the induced map is well-defined on the quotient transverse to X1, and (ii) the switching family is finite because there are only finitely many deterministic policies. This lemma directly supports the subsequent bound ρ-bar ≤ γ. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper analyzes Q-VI via switching systems and the joint spectral radius of the transverse projected family, a standard external concept from control theory. The invariant tube inside the POSS follows from the uniform action gap at Q* (finite actions) plus classical γ-contraction; once inside, optimal modes map the line X1 to itself and the transverse dynamics are bounded by the JSR by definition. No parameter is fitted to the target rate, no self-citation supplies a load-bearing uniqueness theorem, and no step renames or re-derives its own input. The claimed faster transverse rate when ρ-bar < γ is a direct consequence of the JSR definition applied to the quotient space, not a construction internal to the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on modeling Q-VI as a switching system, the existence of an invariant tube inside the POSS, and the well-definedness of the joint spectral radius for the transverse dynamics. These are introduced or assumed in the paper rather than derived from prior literature.

axioms (2)

domain assumption Discounted Q-value iteration can be represented as a finite switching system whose modes correspond to greedy policies
Invoked to replace the single Bellman operator with a family of linear updates whose joint spectral radius governs transverse convergence.
ad hoc to paper There exists an invariant tube around X1 = Q* + span(1) that is contained in the practically optimal solution set
This is the load-bearing geometric object whose invariance enables the finite-time policy identification claim.

invented entities (3)

Practically Optimal Solution Set (POSS) no independent evidence
purpose: Set of Q-functions whose tie-broken greedy policies are optimal
New set introduced to focus analysis on policy optimality rather than exact Q* convergence.
Invariant tube around X1 no independent evidence
purpose: Geometric region guaranteeing optimal greedy policy
Central construct whose finite-time entry is the main result.
Joint spectral radius of the projected switching family (transverse to X1) no independent evidence
purpose: Quantity that bounds the exponential rate of approach to the tube
New spectral object whose comparison to γ yields the faster-than-contraction claim.

pith-pipeline@v0.9.0 · 5537 in / 1911 out tokens · 41754 ms · 2026-05-10T05:49:45.098612+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Switching-Geometry Analysis of Deflated Q-Value Iteration
math.OC 2026-05 unverdicted novelty 7.0

Deflated Q-VI is algebraically equivalent to recentering standard Q-VI, yet its error dynamics are governed by the joint spectral radius of a projected switching system that can be strictly smaller than the discount factor γ.

Reference graph

Works this paper leans on

67 extracted references · 5 canonical work pages · cited by 1 Pith paper

[1]

Dynamic programming,

R. Bellman, “Dynamic programming,” science, vol. 153, no. 3731, pp. 34–37, 1966. 10 Figure 4: Orthogonal projections of 12 full Q-VI trajectories onto the tilted plane, where the init ial conditions are chosen uniformly on a circle of radius 2 in the plane. The dashed line is the slice of X1, the shaded strip is the slice of Tδ, and the dotted curve is th...

1966
[2]

D. P . Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. Athena Scientiﬁc Belmont, MA, 1996

1996
[3]

Dynamic programming and optimal contr ol 4th edition, volume ii,

D. P . Bertsekas, “Dynamic programming and optimal contr ol 4th edition, volume ii,” Athena Scientiﬁc , 2015

2015
[4]

M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

2014
[5]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT Press, 1998

1998
[6]

Liberzon, Switching in systems and control

D. Liberzon, Switching in systems and control . Springer Science & Business Media, 2003

2003
[7]

Stability and stabilizabili ty of switched linear systems: a survey of recent results,

H. Lin and P . J. Antsaklis, “Stability and stabilizabili ty of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009

2009
[8]

Resilient stabilization of sw itched linear control systems against adversarial switching,

J. Hu, J. Shen, and D. Lee, “Resilient stabilization of sw itched linear control systems against adversarial switching,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3820–3834, 2017

2017
[9]

Nonlinear systems,

H. K. Khalil, “Nonlinear systems,” Upper Saddle River , 2002

2002
[10]

The Lyapunov expone nt and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to a pproximate,

J. N. Tsitsiklis and V . D. Blondel, “The Lyapunov expone nt and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to a pproximate,” Mathematics of Control, Signals and Systems , vol. 10, no. 1, pp. 31–40, 1997

1997
[11]

A note on the joint spectral ra dius,

G.-C. Rota and G. Strang, “A note on the joint spectral ra dius,” Indag. Math, vol. 22, no. 4, pp. 379–381, 1960

1960
[12]

Computationally efﬁcie nt approximations of the joint spectral radius,

V . D. Blondel and Y . Nesterov, “Computationally efﬁcie nt approximations of the joint spectral radius,” SIAM Journal on Matrix Analysis and Applications , vol. 27, no. 1, pp. 256–272, 2005

2005
[13]

Central limit theorem for nonstation ary Markov chains. I,

R. L. Dobrushin, “Central limit theorem for nonstation ary Markov chains. I,” Theory of Prob- ability and Its Applications , vol. 1, no. 1, pp. 65–80, 1956

1956
[14]

Weak ergodicity in non-ho mogeneous Markov chains,

J. Hajnal and M. S. Bartlett, “Weak ergodicity in non-ho mogeneous Markov chains,” in Math- ematical Proceedings of the Cambridge Philosophical Socie ty, vol. 54, no. 2, 1958, pp. 233– 246

1958
[15]

Products of indecomposable, aperiodic , stochastic matrices,

J. Wolfowitz, “Products of indecomposable, aperiodic , stochastic matrices,” Proceedings of the American Mathematical Society , vol. 14, no. 5, pp. 733–737, 1963

1963
[16]

Seneta, Non-negative matrices and Markov chains

E. Seneta, Non-negative matrices and Markov chains . New Y ork: Springer Science & Busi- ness Media, 2006. 11 Figure 5: Two-dimensional projection of tabular Q-learnin g trajectories onto the tilted plane Q∗ + span{ˆ1, ˆd}, displayed in the rotated coordinates p = (u − v)/ √ 2 and q = (u + v)/ √

2006
[17]

The dotted circle indicates the boundary of the common init ial set in the (u, v )-plane

The dashed line q = p is the projection of X1, and the shaded strip is the projection of the tube Tδ, namely |q − p| ≤ 2δ. The dotted circle indicates the boundary of the common init ial set in the (u, v )-plane. Starting from 12 different initial points on this circle, the stochastic Q-l earning trajectories tend to enter the strip and then move toward t...
[18]

A uniﬁed switching system perspective and convergence analysis of Q- learning algorithms,

D. Lee and N. He, “A uniﬁed switching system perspective and convergence analysis of Q- learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

2020
[19]

A discrete-time switching syst em analysis of Q-learning,

D. Lee, J. Hu, and N. He, “A discrete-time switching syst em analysis of Q-learning,” SIAM Journal on Control and Optimization , vol. 61, no. 3, pp. 1861–1880, 2023

2023
[20]

Final iteration convergence bound of Q-learni ng: Switching system approach,

D. Lee, “Final iteration convergence bound of Q-learni ng: Switching system approach,” IEEE Transactions on Automatic Control, vol. 69, no. 7, pp. 4765–4772, 2024

2024
[21]

Finite-time analysis of asynchro nous Q-learning under diminishing step-size from control-theoretic view,

H.-D. Lim and D. Lee, “Finite-time analysis of asynchro nous Q-learning under diminishing step-size from control-theoretic view,” IEEE Access, vol. 12, pp. 149 916–149 939, 2024

2024
[22]

On some geometric behavior of value iteration o n the orthant: Switching system perspective,

D. Lee, “On some geometric behavior of value iteration o n the orthant: Switching system perspective,” in 2023 62nd IEEE Conference on Decision and Control (CDC) , 2023, pp. 4911– 4916

2023
[23]

Convex programs and Lyapunov function s for reinforcement learning: A uniﬁed perspective on the analysis of value-based methods ,

X. Guo and B. Hu, “Convex programs and Lyapunov function s for reinforcement learning: A uniﬁed perspective on the analysis of value-based methods ,” in 2022 American Control Conference (ACC), 2022, pp. 3317–3322

2022
[24]

A Lyapunov -based version of the value iteration algorithm formulated as a discrete-time switched afﬁne sys tem,

R. Iervolino, M. Tipaldi, and A. Forootani, “A Lyapunov -based version of the value iteration algorithm formulated as a discrete-time switched afﬁne sys tem,” International Journal of Con- trol, vol. 96, no. 3, pp. 577–592, 2023

2023
[25]

A switching control strategy for policy selection in stochastic dynamic programming problems,

M. Tipaldi, R. Iervolino, P . R. Massenio, and D. Naso, “A switching control strategy for policy selection in stochastic dynamic programming problems,” Automatica, vol. 171, p. 111884, 2025

2025
[26]

A ﬁrst-order approach t o accelerated value iteration,

V . Goyal and J. Grand-Clement, “A ﬁrst-order approach t o accelerated value iteration,” Opera- tions research, vol. 71, no. 2, pp. 517–535, 2023. 12

2023
[27]

The value func- tion polytope in reinforcement learning,

R. Dadashi, A. A. Taiga, N. Le Roux, D. Schuurmans, and M. G. Bellemare, “The value func- tion polytope in reinforcement learning,” in International Conference on Machine Learning , 2019, pp. 1486–1495

2019
[28]

Geometric policy iteration for Markov decision processes,

Y . Wu and J. A. De Loera, “Geometric policy iteration for Markov decision processes,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 2070–2078

2022
[29]

Properties of turnpike functi ons for discounted ﬁnite mdps,

E. A. Feinberg and G. He, “Properties of turnpike functi ons for discounted ﬁnite mdps,” arXiv preprint arXiv:2502.05375, 2025

work page arXiv 2025
[30]

On val ue iteration convergence in connected MDPs,

A. Mustaﬁn, A. Olshevsky, and I. C. Paschalidis, “On val ue iteration convergence in connected MDPs,” arXiv preprint arXiv:2406.09592 , 2024

work page arXiv 2024
[31]

Analysis of value iteration through absolute probability sequences,

A. Mustaﬁn, S. Colla, A. Olshevsky, and I. C. Paschalidi s, “Analysis of value iteration through absolute probability sequences,” arXiv preprint arXiv:2502.03244 , 2025

work page arXiv 2025
[32]

Geometric re-analysis of clas- sical MDP solving algorithms,

A. Mustaﬁn, A. Pakharev, A. Olshevsky, and I. C. Paschal idis, “Geometric re-analysis of clas- sical MDP solving algorithms,” arXiv preprint arXiv:2503.04203 , 2025

work page arXiv 2025
[33]

Dynamic pro- gramming through the lens of semismooth Newton-type method s,

M. Gargiani, A. Zanelli, D. Liao-McPherson, T. H. Summe rs, and J. Lygeros, “Dynamic pro- gramming through the lens of semismooth Newton-type method s,” IEEE Control Systems Let- ters, vol. 6, pp. 2996–3001, 2022

2022
[34]

A non-asymptotic theory of seminorm lyapunov stability: From deterministic to stoc hastic iterative algorithms,

Z. Chen, S. Zhang, Z. Zhang, S. U. Haque, and S. T. Magulur i, “A non-asymptotic theory of seminorm lyapunov stability: From deterministic to stoc hastic iterative algorithms,” arXiv preprint arXiv:2502.14208, 2025

work page arXiv 2025
[35]

Dobrushin’s ergodicity coefﬁcie nt for markov operators on cones,

S. Gaubert and Z. Qu, “Dobrushin’s ergodicity coefﬁcie nt for markov operators on cones,” Integral Equations and Operator Theory , vol. 81, no. 1, pp. 127–150, 2015

2015
[36]

Distribut ed asynchronous deterministic and stochastic gradient optimization algorithms,

J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distribut ed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control , vol. 31, no. 9, pp. 803–812, 1986

1986
[37]

D. A. Levin and Y . Peres, Markov chains and mixing times . American Mathematical Society, 2017, vol. 107. A Q-VI algorithm and switching-system representation This appendix gives the algorithmic and switching-system d etails used in the main text. Algorithm 1 expands the compact recursion in Equation (1), and the subsequent derivation shows how the same ...

2017
[38]

τD(B) < 1 if and only if B is scrambling
[39]

τD(BC ) ≤ τD(B)τD(C)
[40]

In particular ,τD is convex on the set of row-stochastic matrices

If Bi are row-stochastic and α i ≥ 0, ∑ i α i = 1, then τD ( ∑ i α iBi ) ≤ ∑ i α iτD(Bi). In particular ,τD is convex on the set of row-stochastic matrices. 16
[41]

Consequently, for every v ∈ R|Y|, ∥Bv∥dm ≤ τD(B)∥v∥dm

The Dobrushin coefﬁcient is the induced operator norm of B on the quotient space R|Y|/ span(1) equipped with the diameter seminorm: τD(B) = sup v /∈ span(1) ∥Bv∥dm ∥v∥dm . Consequently, for every v ∈ R|Y|, ∥Bv∥dm ≤ τD(B)∥v∥dm
[42]

Consequently, if Π ⊥ = I − 1 |Y|11⊤ , then sup v /∈ span(1) ∥Π ⊥ BΠ ⊥ v∥dm ∥v∥dm = τD(B)

F or every v ∈ R|Y| and every c ∈ R, ∥v + c1∥dm = ∥v∥dm. Consequently, if Π ⊥ = I − 1 |Y|11⊤ , then sup v /∈ span(1) ∥Π ⊥ BΠ ⊥ v∥dm ∥v∥dm = τD(B). Proof. See Appendix D.18. We can now state an if-and-only-if characterization of stri ctness for the full restricted switching family. Proposition 3 (Characterization of ¯ρ < γ by uniform scrambling) . The foll...
[43]

, π L− 1) ∈ Θ L, the product Bω is scrambling

There exists an integer L ≥ 1 such that, for every deterministic policy sequence ω = (π0, . . . , π L− 1) ∈ Θ L, the product Bω is scrambling
[44]

There exists an integer L ≥ 1 such that max ω ∈ Θ L τD(Bω ) < 1
[45]

, µ L− 1 : S → ∆ |A|

Equivalently, there exists an integer L ≥ 1 such that sup µ 0,...,µ L− 1 τD ( Bµ L− 1 · · ·Bµ 1 Bµ 0 ) < 1, where the supremum is taken over all stochastic policies µ 0, . . . , µ L− 1 : S → ∆ |A|. Moreover , if these equivalent conditions hold and ηL := min ω ∈ Θ L min x,y ∈Y ∑ υ ∈Y min{(Bω )xυ , (Bω )yυ }, then ηL > 0 and ¯ρ ≤ γ(1 − ηL)1/L < γ. Proof. S...
[46]

1 0 . 3 0 . 6 ] , P 2 = [ 0. 2 0 . 5 0 . 3
[47]

3 0 . 3 0 . 4 ] , R = [ 1. 0 0 . 2
[48]

3 ] , respectively

2 0 . 3 ] , respectively. Here Pa = P (· | ·, a ) for a ∈ A = {1, 2}, and the (s, a )-entry of R represents the expected reward at state s ∈ S = {1, 2, 3} under action a ∈ A , i.e. R(s, a ). For this example, the unique optimal deterministic policy is π ∗ = ( π ∗ (1), π ∗ (2), π ∗ (3) ) = (1, 1, 1), 18 and the corresponding optimal Q-function is Q∗ = [ 18...
[49]

5430 ] , where the (s, a )-entry represents the optimal Q-function value at state s under action a, i.e

4947 17 . 5430 ] , where the (s, a )-entry represents the optimal Q-function value at state s under action a, i.e. Q∗ (s, a ). The corresponding minimum optimality gap is ∆ := min s∈S sep ¯∆ s, ¯∆ s := V ∗ (s) − max a /∈ Φ ∗ (s) Q∗ (s, a ) ≈ 0. 4022. We consider the afﬁne space X1 = Q∗ + span(1), and choose the tube radius δ = 0 . 4∆ ≈ 0. 1609, which sati...
[50]

8213 17 . 8696 ] . For this particular Q0, the tie-broken greedy action is already the optimal action at every state. Thus, this single trajectory is used to visualize ﬁnite-tim e entrance into the invariant tube, rather than to demonstrate ﬁrst-time policy identiﬁcation from a n on-optimal greedy policy. Starting from this Q0, we run Q-VI for 50 iteratio...
[51]

The dotted circle indicates the boundary of the common init ial set in the (u, v )-plane

The dashed line q = p is the projection of X1, and the shaded strip is the projection of the tube Tδ, namely |q − p| ≤ 2δ. The dotted circle indicates the boundary of the common init ial set in the (u, v )-plane. Starting from 12 different initial points on this circle, the stochastic Q-l earning trajectories tend to enter the strip and then move toward t...
[52]

Then ¯Ai(Π ⊥ v) = λ(Π ⊥ v)

Let (v, λ ) be an eigenvector-eigenvalue pair of Ai. Then ¯Ai(Π ⊥ v) = λ(Π ⊥ v). In particular , ifΠ ⊥ v ̸= 0, then (Π ⊥ v, λ ) is an eigenvector-eigenvalue pair of ¯Ai
[53]

24 Proof

(1, 0) is an eigenvector-eigenvalue pair of ¯Ai. 24 Proof. Recall that ¯Ai = Π ⊥ AiΠ ⊥ , Π ⊥ = I − 1 n 11⊤ , A i1 = γ1. To prove the ﬁrst statement, let (v, λ ) be an eigenvector-eigenvalue pair of Ai so that Aiv = λv . Since Π ⊥ v = v − 1⊤ v n 1, we have Ai(Π ⊥ v) = Aiv − 1⊤ v n Ai1 = λv − γ 1⊤ v n 1. Applying Π ⊥ to both sides yields ¯Ai(Π ⊥ v) = Π ⊥ Ai...
[54]

F or every t ≥ 0, V t+1 ε (x) ≥ ∥ x∥2 2 + β − 2 ε max i∈{ 1,...,M } V t ε ( ¯Aix)
[55]

F or every t ≥ 0 and every x ∈ Rn, V t ε (λx) = |λ|2V t ε (x) for all λ ∈ R, and V t ε (x) ≤ V t+1 ε (x)
[56]

There exists a constant Cε > 0 such that ∥x∥2 2 ≤ V t ε (x) ≤ Cε∥x∥2 2, ∀x ∈ Rn, ∀t ≥ 0
[57]

Moreover , ∥x∥2 2 ≤ V ∞ ε (x) ≤ Cε∥x∥2 2

F or every x ∈ Rn, the limit V ∞ ε (x) := lim t→∞ V t ε (x) exists and is ﬁnite. Moreover , ∥x∥2 2 ≤ V ∞ ε (x) ≤ Cε∥x∥2 2
[58]

The function pε(x) := √ V ∞ε (x) is a norm on Rn
[59]

The function V ∞ ε satisﬁes the Lyapunov inequality V ∞ ε ( ¯Aix) ≤ β 2 ε V ∞ ε (x), ∀x ∈ Rn, ∀i ∈ { 1, . . . , M }. Equivalently, pε( ¯Aix) ≤ βεpε(x), ∀x ∈ Rn, ∀i ∈ { 1, . . . , M }
[60]

, M }, where ∥ ¯Ai∥pε is the induced matrix norm generated by pε ∥ ¯Ai∥pε := sup x̸=0 pε( ¯Aix) pε(x)

Consequently, ∥ ¯Ai∥pε ≤ βε, ∀i ∈ { 1, . . . , M }, where ∥ ¯Ai∥pε is the induced matrix norm generated by pε ∥ ¯Ai∥pε := sup x̸=0 pε( ¯Aix) pε(x) . Therefore, we have ρ( ¯A1, ¯A2, . . . , ¯AM ) ≤ βε. Proof. We prove the statements one by one. Proof of 1). By deﬁnition, one has V t+1 ε (x) = t+1∑ k=0 β − 2k ε max ¯σ k∈{ 1,...,M } k   ¯Aσ k · · ·¯Aσ 1 x ...
[61]

This proves 1)

For k ≥ 1, writing k = j + 1, we obtain V t+1 ε (x) = ∥x∥2 2 + t∑ j=0 β − 2(j+1) ε max ¯σ j+1   ¯Aσ j+1 · · ·¯Aσ 1 x  2 2 = ∥x∥2 2 + β − 2 ε t∑ j=0 β − 2j ε max i∈{ 1,...,M } max ¯τj∈{ 1,...,M } j   ¯Aτj · · ·¯Aτ1 ¯Aix  2 2 ≥ ∥ x∥2 2 + β − 2 ε max i∈{ 1,...,M } t∑ j=0 β − 2j ε max ¯τj∈{ 1,...,M } j   ¯Aτj · · ·¯Aτ1 ¯Aix  2 2 = ∥x∥2 2 + β − 2 ε...
[62]

By the deﬁnition of the JSR, there exists an integer K ≥ 0 such that max ¯σ k∈{ 1,...,M } k   ¯Aσ k · · ·¯Aσ 1  1/k 2 ≤ η, ∀k ≥ K

For the upper bound, since ¯ρ < β ε, choose any number η such that ¯ρ < η < β ε. By the deﬁnition of the JSR, there exists an integer K ≥ 0 such that max ¯σ k∈{ 1,...,M } k   ¯Aσ k · · ·¯Aσ 1  1/k 2 ≤ η, ∀k ≥ K. Hence, for all k ≥ K, we have max ¯σ k∈{ 1,...,M } k   ¯Aσ k · · ·¯Aσ 1   2 ≤ ηk. Now deﬁne C0 := max { 1, max 0≤ k≤ K− 1 η− k max ¯σ k ∥...
[63]

Passing to the limit in the bounds of 3) yields ∥x∥2 2 ≤ V ∞ ε (x) ≤ Cε∥x∥2 2

Hence the limit V ∞ ε (x) := lim t→∞ V t ε (x) exists and is ﬁnite for every x ∈ Rn. Passing to the limit in the bounds of 3) yields ∥x∥2 2 ≤ V ∞ ε (x) ≤ Cε∥x∥2 2. Proof of 5). For each ﬁxed k ≥ 1, deﬁne νk(x) := β − k ε max ¯σ k∈{ 1,...,M } k ∥ ¯Aσ k · · ·¯Aσ 1 x∥2. Moreover, deﬁne ν0(x) := ∥x∥2. For each k ≥ 0, νk is a seminorm, since it is the pointwis...
[64]

F or any ε > 0 such that βε := ¯ρ + ε < 1, there exists a constant cε > 0 such that dist2(Qk, X1) ≤ cεβ k ε dist2(Q0, X1), ∀k ≥ 0. In particular , the convergence toward the afﬁne set X1, and hence toward the POSS X ∗ , admits exponential upper bounds at any rate larger than the J SR ¯ρ of the full restricted switching family, with the stochastic-policy c...
[65]

F or all k ≥ Kid, there exists a policy πk ∈ Θ ∗ such that Qk+1 − Q∗ = Aπ k (Qk − Q∗ )
[66]

Thus, after ﬁnite-time identiﬁcation of X ∗ , the transverse component admits exponential upper bounds at any rate larger than the JSR ¯ρ∗ of the restricted optimal family

F or any ε∗ > 0 such that β∗ := ¯ρ∗ + ε∗ < 1, there exists a constant ˜Cε∗ > 0 such that ∥zKid+ℓ∥2 ≤ ˜Cε∗ β ℓ ∗ ∥zKid∥2, ∀ℓ ≥ 0. Thus, after ﬁnite-time identiﬁcation of X ∗ , the transverse component admits exponential upper bounds at any rate larger than the JSR ¯ρ∗ of the restricted optimal family
[67]

In this case, there exists a constant Dε∗ > 0 such that ∥QKid+ℓ − Q∗ ∥2 ≤ Dε∗ γ ℓ∥QKid − Q∗∥2, ∀ℓ ≥ 0

If, in addition, ¯ρ∗ < γ , then one may choose ε∗ > 0 sufﬁciently small so that β∗ = ¯ρ∗ + ε∗ < γ . In this case, there exists a constant Dε∗ > 0 such that ∥QKid+ℓ − Q∗ ∥2 ≤ Dε∗ γ ℓ∥QKid − Q∗∥2, ∀ℓ ≥ 0. Proof. Write ek := Qk − Q∗ , α k := 1 n 1⊤ ek, z k := Π ⊥ ek, n := |S||A|. The global convergence estimate follows directly from Theorem 1 applied to the ...