Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

Donghwan Lee; Hyukjun Yang

arxiv: 2601.18840 · v3 · submitted 2026-01-26 · 💻 cs.LG · cs.SY· eess.SY

Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

Donghwan Lee , Hyukjun Yang This is my paper

Pith reviewed 2026-05-16 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords Bellman residualpolicy optimizationMarkov decision processconvergencestationarityfunction approximationreinforcement learning

0 comments

The pith

Bellman residual minimization for control establishes geometric properties, stationarity of solutions, and convergence guarantees for policy optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to lay the theoretical groundwork for applying Bellman residual minimization to policy optimization in Markov decision processes, rather than limiting it to policy evaluation. This involves directly minimizing the squared Bellman residual to find good policies. A reader might care because this method can offer more stable convergence when approximating value functions, which is common in practical reinforcement learning. The focus on control tasks addresses an area that has seen less study compared to evaluation.

Core claim

The authors show that the Bellman residual minimization objective for control has a specific geometry whose stationary points correspond to optimal solutions, and that iterative minimization converges to the optimal value function.

What carries the argument

The squared Bellman residual objective minimized jointly over value functions and policies.

If this is right

Stationary points of the residual objective satisfy the optimality conditions of the MDP.
The minimization converges reliably even with function approximation under regularity conditions.
It provides an alternative optimization path for control that avoids some instabilities of dynamic programming.
Foundational analysis enables further development of residual-based control algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could be combined with deep neural networks for large-scale control problems.
It might explain why some residual methods perform well in practice despite theoretical gaps.
Extensions to continuous state spaces could be tested using the same geometric arguments.

Load-bearing premise

Standard assumptions on the MDP hold, such as finite state-action spaces or suitable regularity conditions for any function approximation used.

What would settle it

Demonstrating an MDP example where the residual minimizer is not a stationary policy or where the algorithm diverges.

Figures

Figures reproduced from arXiv: 2601.18840 by Donghwan Lee, Hyukjun Yang.

**Figure 1.** Figure 1: Illustration of the oblique projection Theorem 2. The stationary point ¯θ with 0 ∈ ∂f( ¯θ) satisfies the OP-CBE, Qθ¯ = ΓΦ|Ψβ¯ T Qθ¯, where Ψβ¯ = (γPΠβ¯ − I)Φ and β¯ ∈ conv(Λ(Qθ¯)) is given in (3). Since a stationary point always exists, the OP-CBE always admits at least one solution (Lemma A.25), although the solution is not unique in general. The solution exists even when the composite operator ΓΦ|Ψβ T i… view at source ↗

**Figure 2.** Figure 2: Comparison of errors between the gradient descent on the SCBR (blue) and the P [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The surface of f(Q) Non-differentiability occurs precisely on the tie set {Q ∈ R |S×A| : Q(1, 1) = Q(1, 2)} because of the max operator. For a (possibly stochastic) policy β, we can represent Π β = β(·|1)⊤ ⊗ e ⊤ s = [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Convergence trajectory of Qk with the subgradient descent method [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: CBR objective function value f(Qk) with Qk generated with the subgradient descent method. A.4 CBR with linear function approximation Let Θ denote the set of deterministic policies π : S → A, so |Θ| = |A||S| < ∞. Let us consider the CBR objective with LFA f(θ) := 1 2 ∥TΦθ − Φθ∥ 2 2 , Moreover, let us define the set Sπ := {θ ∈ R m : π(s) ∈ arg maxa∈AQθ(s, a), ∀(s, a) ∈ S × A} for each π ∈ Θ, where arg max is… view at source ↗

**Figure 6.** Figure 6: The graph of the CBR objective, f(θ) = 1 2 ∥T(Φθ) − Φθ∥ 2 2 . then we can represent Π β = β(·|1)⊤ ⊗ e ⊤ s = [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗

**Figure 7.** Figure 7: The evolution of the iterates of the subgradient descent method converging to the [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗

**Figure 8.** Figure 8: The evolution of the SCBR objective function value over iterations on a log scale. [PITH_FULL_IMAGE:figures/full_fig_p053_8.png] view at source ↗

**Figure 9.** Figure 9: The convergence of Qk(1, 1) and Qk(1, 2) to Q∗ λ (1, 1) and Q∗ λ (1, 2), respectively, where the dashed lines indicate the optimal Q-values. Proof. First of all, we have ∥Qθ − FλQθ∥∞ =∥Qθ − Q ∗ λ + FλQ ∗ λ − FλQθ∥∞ ≥∥Qθ − Q ∗ λ∥∞ − ∥FλQ ∗ λ − FλQθ∥∞ ≥∥Qθ − Q ∗ λ∥∞ − γ∥Q ∗ λ − Qθ∥∞ =(1 − γ)∥Qθ − Q ∗ λ∥∞ ≥(1 − γ) 1 p |S × A| ∥Qθ − Q ∗ λ∥2 , where the second line is due to the reverse triangle inequality and … view at source ↗

**Figure 10.** Figure 10: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = 0.2897 60 [PITH_FULL_IMAGE:figures/full_fig_p060_10.png] view at source ↗

**Figure 11.** Figure 11: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = −0.2897 [PITH_FULL_IMAGE:figures/full_fig_p061_11.png] view at source ↗

**Figure 12.** Figure 12: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = 0 The first plot of ?? shows the evolution of the objective function value f when gradient descent is applied to the objective function in this example. We observe that the objective value decreases and converges to a constant. The second plot of ?? illustrates the convergence of θk under gra… view at source ↗

**Figure 13.** Figure 13: (1) The evolution of the objective function value [PITH_FULL_IMAGE:figures/full_fig_p062_13.png] view at source ↗

**Figure 14.** Figure 14: Deep learning experiments across five different tasks from the [PITH_FULL_IMAGE:figures/full_fig_p063_14.png] view at source ↗

read the original abstract

Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper fills a small gap by working out geometry, stationarity, and convergence for Bellman residual minimization in the control setting.

read the letter

The core contribution is extending Bellman residual methods, already studied for policy evaluation, to policy optimization. The authors lay out the geometry of the residual objective, characterize stationary points, and give convergence arguments under standard MDP assumptions such as finite spaces or suitable regularity for approximation. That extension is new enough to be worth recording, since most prior work stayed with evaluation or stuck to dynamic programming for control. They avoid circularity by grounding everything in ordinary MDP theory rather than fitted quantities. The writing is straightforward and the positioning against dynamic programming is clear. On the soft side, the results appear to rest on the usual finite-state or Lipschitz-type conditions that keep the minimization well-posed; if those are violated the guarantees do not transfer, and the paper does not seem to supply new tools for the approximate or continuous case. Convergence is shown but the practical note in the abstract that the method is often slower than DP is left untouched, so the theory does not yet explain when the stability advantage outweighs the efficiency cost. No load-bearing contradictions show up in the derivations that are visible. This is for readers already working on optimization views of RL or on alternatives to value iteration. A theorist who wants to cite a clean reference for residual-based control would find it useful. I would send it to peer review; the claims are modest and the gap is real, so referees can check the proofs and tighten the assumptions without starting from scratch.

Referee Report

0 major / 2 minor

Summary. The manuscript establishes foundational results for Bellman residual minimization applied to policy optimization (control) in Markov decision processes. It analyzes the geometry of the squared Bellman residual objective, derives stationarity conditions linking residual minimization to optimality, and provides convergence guarantees, contrasting this approach with dynamic programming while noting advantages for stable convergence under function approximation.

Significance. If the derivations hold, the work supplies needed theoretical grounding for an under-explored method in control tasks, complementing policy evaluation results and potentially supporting more stable RL algorithms with approximation. The geometry and stationarity analysis offers concrete insight into the objective landscape that is not available from standard DP theory alone.

minor comments (2)

[§2] The transition from policy evaluation to control residual definitions in the early sections would benefit from an explicit side-by-side equation comparison to highlight the added policy dependence.
[Notation] Notation for the control Bellman operator T^π versus the optimal T^* is introduced clearly but could be reinforced with a short table of symbols for readers less familiar with the control setting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript and for recommending acceptance. We are pleased that the foundational geometric, stationarity, and convergence results for Bellman residual minimization in control are recognized as providing useful theoretical grounding that complements existing dynamic programming theory.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper positions its foundational results for control Bellman residual minimization as building directly on standard MDP theory and dynamic programming, with geometry, stationarity, and convergence arguments derived under finite state-action spaces or regularity conditions for function approximation. No equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims retain independent content from the stated assumptions without renaming known results or smuggling ansatzes. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard MDP assumptions such as the existence of value functions and Bellman operators; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard Markov decision process assumptions including discounted infinite-horizon setting and existence of optimal policies
Invoked implicitly as background for all Bellman residual analysis.

pith-pipeline@v0.9.0 · 5422 in / 970 out tokens · 19138 ms · 2026-05-16T10:57:14.121274+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

f(θ)=½∥T Qθ−Qθ∥²₂ ... Clarke subdifferential ... oblique projected control Bellman equation (OP-CBE)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

piecewise quadratic ... homogeneous half-space convex cone ... generalized gradient descent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
cs.LG 2026-04 unverdicted novelty 6.0

Soft Bellman residual minimization with weighted Lp-norm aligns the objective with Bellman contraction as p increases and yields performance error bounds.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper

[1]

In particular, ifg̸= 0, then−gis a strict descent direction. 21 Proof.Since∂f(x) is closed and convex, the projection optimality condition (Bertsekas et al., 2003, Proposition 2.2.1, page 88) gives ⟨v−g,0−g⟩ ≤0∀v∈∂f(x), equivalently, it can be rewritten as follows: ⟨v, g⟩ ≥ ∥g∥ 2 ∀v∈∂f(x). Letd=−g. Then for allv∈∂f(x), ⟨v, d⟩=−⟨v, g⟩ ≤ −∥g∥ 2. Using the s...

work page 2003
[2]

The following result establishes the convergence of the algorithm under the above assumptions

The initial sublevel setL c :={x:f(x)≤c}withc=f(x 0)is bounded (hence compact). The following result establishes the convergence of the algorithm under the above assumptions. Theorem A.1.Let(x k)k≥0 be generated by Algorithm 1 under Assumption A.1. Then, we guar- antee the following results:

work page
[3]

(Monotone decrease)f(x k+1)≤f(x k)for allk≥0, and(f(x k))k≥0 converges

work page
[4]

(Sufficient decrease) For allk≥0, f(x k+1)≤f(x k)−σα k∥gk∥2 2,(11) and hence P∞ k=0 αk∥gk∥2 2 <∞

work page
[5]

Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point

(Stationarity of limit points) The sequence{x k} ⊂ L c withc=f(x 0)admits at least one limit point. Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point. Proof.Let us fixk≥0. Ifg k = 0, then 0∈∂f(x k) and the algorithm stops at a stationary point. Therefore, let us assumeg k ̸= 0. By Lemma A.18,f ◦(xk;d k) =f ◦(xk;−g k) =−∥g k∥2 2 <0. Moreo...

work page 1976
[6]

The Clark subdifferential offgiven by∂f(Q) ={(γPΠ β −I) ⊤(T Q−Q) :β∈conv(Λ(Q))} is nonempty, convex, and bounded

work page
[7]

5.fis coercive

We have∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞, and∥T Q−Q∥ ∞ ≥(1−γ)∥Q∥ ∞ −R max. 5.fis coercive

work page
[8]

7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

Every sublevel set,L c :={Q∈R |S||A| :f(Q)≤c}, is bounded for anyc≥0. 7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

work page
[9]

There existsµ >0such that for allQ, f(Q)−f ∗ ≤ 1 2µ dist(0, ∂f(Q)) 2 , wheref ∗ := minQ∈R|S×A| f(Q)anddist(0, ∂f(Q)) := inf{∥g∥ 2 :g∈∂f(Q)}. Proof.1. Sincefis a nonnegative sum of squares. Therefore,fis bounded below by 0. 27

work page
[10]

Therefore, it is locally Lipschitz by Lemma A.19

Moreover, by Proposition A.1,fis piecewise quadratic with finitely many pieces. Therefore, it is locally Lipschitz by Lemma A.19

work page
[11]

Sincefis locally Lipschitz by the second statement, the desired conclusion is directly obtained using Lemma A.9

work page
[12]

This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

For any (s, a)∈ S × A, we can derive the following bounds: |(T Q)(s, a)| ≤ |R(s, a)|+γ X s′∈S P(s ′|s, a)max a′∈A |Q(s′, a′)| ≤R max +γ∥Q∥ ∞. This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞. Moreover, we have ∥T Q−Q∥ ∞ ≥ ∥Q∥ ∞ − ∥T Q∥∞ ≥(1−γ)∥Q∥ ∞ −R max, where the last inequality uses∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

work page
[13]

Using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Hencef(Q)→ ∞as∥Q∥ ∞ → ∞, i.e.,fis coercive

work page
[14]

The coercivity in the fifth statement implies that the level sets are bounded

work page
[15]

Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c

For the last statement, using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Fixc≥0 and letQ∈ L c, i.e.,f(Q)≤c. Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c. Taking square roots gives (1−γ)∥Q∥ ∞ −R max ≤ √ 2c. Therefore, one gets the desired result

work page
[16]

whereσ min denotes the minimum singular value

First of all, we can derive the following inequalities: dist(0, ∂f(Q)) = min β∈conv{Λ(Q)} (I−γPΠ β) ⊤ (T Q−Q) 2 ≥min β∈conv{Λ(Q)} σmin((I−γPΠ β)⊤)∥T Q−Q∥ 2. whereσ min denotes the minimum singular value. Then, we obtain f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≤ 1 σmin((I−γPΠ β)⊤) 2 dist(0, ∂f(Q)) 2. Since minQ∈R|S×A| f(Q) =f ∗ = 0, this is exactly the desired bound. 28 Th...

work page 2004
[17]

Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator

The surface off(Q) is illustrated in Figure 3. Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator. For a (possibly stochastic) policyβ, we can represent Πβ =β(·|1) ⊤ ⊗e ⊤ s = β(1|1) 1−β(1|1) , β(1|1)∈[0,1], wheree s ∈R |S| is the standard basis vector where itss-th eleme...

work page
[18]

Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ)

This completes the proof. Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ). Then, we have (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 1)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , 0≤f(θ ∗ 1)−f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 − (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ...

work page
[19]

= (1 +γ) 2|S × S| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S×A| Qθ∗ 2 −Q ∗ 2 2 =q1(θ∗ 2)≤f(θ ∗ 2)≤q 2(θ∗

work page
[20]

The first inequality above proves the first statement

= (1 +γ) 2|S × A| 2 Qθ∗ 2 −Q ∗ 2 2 , respectively. The first inequality above proves the first statement. Moreover, sincef(θ ∗ 2)≤f(θ ∗ 1) and ΓΦ|ΦQ∗ −Q ∗ 2 2 = Qθ∗ 1 −Q ∗ 2 2 ≤ Qθ∗ 2 −Q ∗ 2 2, it follows that (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤ (1−γ) 2 2|S × A| Qθ∗ 2 −Q ∗ 2 2 ≤f(θ ∗ 2) ≤f(θ ∗ 1) ≤(1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 ,(13) which leads to (1−γ...

work page
[21]

The Clark subdifferential offgiven by∂f(Q) ={Φ ⊤(γPΠ β −I) ⊤(T Qθ−Qθ) :β∈conv(Λ(Q θ))} is nonempty, convex, and bounded

work page
[22]

We have∥T Q θ∥∞ ≤R max +γ∥Q θ∥∞, and∥T Q θ −Q θ∥∞ ≥(1−γ)∥Q θ∥∞ −R max

work page
[23]

5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ

Every sublevel setL c :={θ∈R m :f(θ)≤c}is bounded for anyc≥0. 5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ . The proofs of the above properties are identical to those in the tabular case; therefore, we do not repeat them here. The PL-type inequality that holds in the tabular case does not, in general, hold when linear function approximation is used. Lemma A.2...

work page
[24]

Therefore, there exists a unique stationary point (minimizer) in Bε, and it is given byθ ∗ = arg minθ∈Rm 1 2 T π∗ Qθ −Q θ 2

work page
[25]

The last bound can be readily obtained using Lemma A.8. Theorem A.8.Let(θ k)k≥0 be generated by θk+1 =θ k −α kgk, whereg k ∈arg min g∈∂f(θ k) ∥g∥2 = Proj ∂f(θ k)(0)andα k >0a step-size generated by backtracking search with Armijo rule (Boyd and Vandenberghe, 2004). Then, the sequence(θ k)k≥0 ⊂ L c with c=f(Q 0)admits at least one limit point ¯θ. Every lim...

work page 2004
[26]

This completes the proof

Since (γPΠ π ¯Q −I) is invertible from Lemma A.3, we haveF λ ¯Q= ¯Q. This completes the proof. Lemma A.29.The Hessian offis given by ∇2 Qf(Q) := (γPΠ πQ −I)(γPΠ πQ −I) ⊤ + γ λ   Ω1(Q) 0· · ·0 0 Ω 2(Q) ... ... ... ... ... 0 0· · ·0 Ω |S|(Q)   where Ωs(Q) :=ms diag(πQ(·|s))−π Q(·|s)πQ(·|s)⊤ diag(πQ(·|s)) :=   πQ(1|s) 0· · ·0 0π Q(2|s) ... ...

work page 1976

[1] [1]

In particular, ifg̸= 0, then−gis a strict descent direction. 21 Proof.Since∂f(x) is closed and convex, the projection optimality condition (Bertsekas et al., 2003, Proposition 2.2.1, page 88) gives ⟨v−g,0−g⟩ ≤0∀v∈∂f(x), equivalently, it can be rewritten as follows: ⟨v, g⟩ ≥ ∥g∥ 2 ∀v∈∂f(x). Letd=−g. Then for allv∈∂f(x), ⟨v, d⟩=−⟨v, g⟩ ≤ −∥g∥ 2. Using the s...

work page 2003

[2] [2]

The following result establishes the convergence of the algorithm under the above assumptions

The initial sublevel setL c :={x:f(x)≤c}withc=f(x 0)is bounded (hence compact). The following result establishes the convergence of the algorithm under the above assumptions. Theorem A.1.Let(x k)k≥0 be generated by Algorithm 1 under Assumption A.1. Then, we guar- antee the following results:

work page

[3] [3]

(Monotone decrease)f(x k+1)≤f(x k)for allk≥0, and(f(x k))k≥0 converges

work page

[4] [4]

(Sufficient decrease) For allk≥0, f(x k+1)≤f(x k)−σα k∥gk∥2 2,(11) and hence P∞ k=0 αk∥gk∥2 2 <∞

work page

[5] [5]

Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point

(Stationarity of limit points) The sequence{x k} ⊂ L c withc=f(x 0)admits at least one limit point. Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point. Proof.Let us fixk≥0. Ifg k = 0, then 0∈∂f(x k) and the algorithm stops at a stationary point. Therefore, let us assumeg k ̸= 0. By Lemma A.18,f ◦(xk;d k) =f ◦(xk;−g k) =−∥g k∥2 2 <0. Moreo...

work page 1976

[6] [6]

The Clark subdifferential offgiven by∂f(Q) ={(γPΠ β −I) ⊤(T Q−Q) :β∈conv(Λ(Q))} is nonempty, convex, and bounded

work page

[7] [7]

5.fis coercive

We have∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞, and∥T Q−Q∥ ∞ ≥(1−γ)∥Q∥ ∞ −R max. 5.fis coercive

work page

[8] [8]

7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

Every sublevel set,L c :={Q∈R |S||A| :f(Q)≤c}, is bounded for anyc≥0. 7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

work page

[9] [9]

There existsµ >0such that for allQ, f(Q)−f ∗ ≤ 1 2µ dist(0, ∂f(Q)) 2 , wheref ∗ := minQ∈R|S×A| f(Q)anddist(0, ∂f(Q)) := inf{∥g∥ 2 :g∈∂f(Q)}. Proof.1. Sincefis a nonnegative sum of squares. Therefore,fis bounded below by 0. 27

work page

[10] [10]

Therefore, it is locally Lipschitz by Lemma A.19

Moreover, by Proposition A.1,fis piecewise quadratic with finitely many pieces. Therefore, it is locally Lipschitz by Lemma A.19

work page

[11] [11]

Sincefis locally Lipschitz by the second statement, the desired conclusion is directly obtained using Lemma A.9

work page

[12] [12]

This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

For any (s, a)∈ S × A, we can derive the following bounds: |(T Q)(s, a)| ≤ |R(s, a)|+γ X s′∈S P(s ′|s, a)max a′∈A |Q(s′, a′)| ≤R max +γ∥Q∥ ∞. This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞. Moreover, we have ∥T Q−Q∥ ∞ ≥ ∥Q∥ ∞ − ∥T Q∥∞ ≥(1−γ)∥Q∥ ∞ −R max, where the last inequality uses∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

work page

[13] [13]

Using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Hencef(Q)→ ∞as∥Q∥ ∞ → ∞, i.e.,fis coercive

work page

[14] [14]

The coercivity in the fifth statement implies that the level sets are bounded

work page

[15] [15]

Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c

For the last statement, using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Fixc≥0 and letQ∈ L c, i.e.,f(Q)≤c. Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c. Taking square roots gives (1−γ)∥Q∥ ∞ −R max ≤ √ 2c. Therefore, one gets the desired result

work page

[16] [16]

whereσ min denotes the minimum singular value

First of all, we can derive the following inequalities: dist(0, ∂f(Q)) = min β∈conv{Λ(Q)} (I−γPΠ β) ⊤ (T Q−Q) 2 ≥min β∈conv{Λ(Q)} σmin((I−γPΠ β)⊤)∥T Q−Q∥ 2. whereσ min denotes the minimum singular value. Then, we obtain f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≤ 1 σmin((I−γPΠ β)⊤) 2 dist(0, ∂f(Q)) 2. Since minQ∈R|S×A| f(Q) =f ∗ = 0, this is exactly the desired bound. 28 Th...

work page 2004

[17] [17]

Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator

The surface off(Q) is illustrated in Figure 3. Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator. For a (possibly stochastic) policyβ, we can represent Πβ =β(·|1) ⊤ ⊗e ⊤ s = β(1|1) 1−β(1|1) , β(1|1)∈[0,1], wheree s ∈R |S| is the standard basis vector where itss-th eleme...

work page

[18] [18]

Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ)

This completes the proof. Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ). Then, we have (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 1)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , 0≤f(θ ∗ 1)−f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 − (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ...

work page

[19] [19]

= (1 +γ) 2|S × S| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S×A| Qθ∗ 2 −Q ∗ 2 2 =q1(θ∗ 2)≤f(θ ∗ 2)≤q 2(θ∗

work page

[20] [20]

The first inequality above proves the first statement

= (1 +γ) 2|S × A| 2 Qθ∗ 2 −Q ∗ 2 2 , respectively. The first inequality above proves the first statement. Moreover, sincef(θ ∗ 2)≤f(θ ∗ 1) and ΓΦ|ΦQ∗ −Q ∗ 2 2 = Qθ∗ 1 −Q ∗ 2 2 ≤ Qθ∗ 2 −Q ∗ 2 2, it follows that (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤ (1−γ) 2 2|S × A| Qθ∗ 2 −Q ∗ 2 2 ≤f(θ ∗ 2) ≤f(θ ∗ 1) ≤(1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 ,(13) which leads to (1−γ...

work page

[21] [21]

The Clark subdifferential offgiven by∂f(Q) ={Φ ⊤(γPΠ β −I) ⊤(T Qθ−Qθ) :β∈conv(Λ(Q θ))} is nonempty, convex, and bounded

work page

[22] [22]

We have∥T Q θ∥∞ ≤R max +γ∥Q θ∥∞, and∥T Q θ −Q θ∥∞ ≥(1−γ)∥Q θ∥∞ −R max

work page

[23] [23]

5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ

Every sublevel setL c :={θ∈R m :f(θ)≤c}is bounded for anyc≥0. 5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ . The proofs of the above properties are identical to those in the tabular case; therefore, we do not repeat them here. The PL-type inequality that holds in the tabular case does not, in general, hold when linear function approximation is used. Lemma A.2...

work page

[24] [24]

Therefore, there exists a unique stationary point (minimizer) in Bε, and it is given byθ ∗ = arg minθ∈Rm 1 2 T π∗ Qθ −Q θ 2

work page

[25] [25]

The last bound can be readily obtained using Lemma A.8. Theorem A.8.Let(θ k)k≥0 be generated by θk+1 =θ k −α kgk, whereg k ∈arg min g∈∂f(θ k) ∥g∥2 = Proj ∂f(θ k)(0)andα k >0a step-size generated by backtracking search with Armijo rule (Boyd and Vandenberghe, 2004). Then, the sequence(θ k)k≥0 ⊂ L c with c=f(Q 0)admits at least one limit point ¯θ. Every lim...

work page 2004

[26] [26]

This completes the proof

Since (γPΠ π ¯Q −I) is invertible from Lemma A.3, we haveF λ ¯Q= ¯Q. This completes the proof. Lemma A.29.The Hessian offis given by ∇2 Qf(Q) := (γPΠ πQ −I)(γPΠ πQ −I) ⊤ + γ λ   Ω1(Q) 0· · ·0 0 Ω 2(Q) ... ... ... ... ... 0 0· · ·0 Ω |S|(Q)   where Ωs(Q) :=ms diag(πQ(·|s))−π Q(·|s)πQ(·|s)⊤ diag(πQ(·|s)) :=   πQ(1|s) 0· · ·0 0π Q(2|s) ... ...

work page 1976