Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence
Pith reviewed 2026-05-16 10:57 UTC · model grok-4.3
The pith
Bellman residual minimization for control establishes geometric properties, stationarity of solutions, and convergence guarantees for policy optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that the Bellman residual minimization objective for control has a specific geometry whose stationary points correspond to optimal solutions, and that iterative minimization converges to the optimal value function.
What carries the argument
The squared Bellman residual objective minimized jointly over value functions and policies.
If this is right
- Stationary points of the residual objective satisfy the optimality conditions of the MDP.
- The minimization converges reliably even with function approximation under regularity conditions.
- It provides an alternative optimization path for control that avoids some instabilities of dynamic programming.
- Foundational analysis enables further development of residual-based control algorithms.
Where Pith is reading between the lines
- This framework could be combined with deep neural networks for large-scale control problems.
- It might explain why some residual methods perform well in practice despite theoretical gaps.
- Extensions to continuous state spaces could be tested using the same geometric arguments.
Load-bearing premise
Standard assumptions on the MDP hold, such as finite state-action spaces or suitable regularity conditions for any function approximation used.
What would settle it
Demonstrating an MDP example where the residual minimizer is not a stationary policy or where the algorithm diverges.
Figures
read the original abstract
Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes foundational results for Bellman residual minimization applied to policy optimization (control) in Markov decision processes. It analyzes the geometry of the squared Bellman residual objective, derives stationarity conditions linking residual minimization to optimality, and provides convergence guarantees, contrasting this approach with dynamic programming while noting advantages for stable convergence under function approximation.
Significance. If the derivations hold, the work supplies needed theoretical grounding for an under-explored method in control tasks, complementing policy evaluation results and potentially supporting more stable RL algorithms with approximation. The geometry and stationarity analysis offers concrete insight into the objective landscape that is not available from standard DP theory alone.
minor comments (2)
- [§2] The transition from policy evaluation to control residual definitions in the early sections would benefit from an explicit side-by-side equation comparison to highlight the added policy dependence.
- [Notation] Notation for the control Bellman operator T^π versus the optimal T^* is introduced clearly but could be reinforced with a short table of symbols for readers less familiar with the control setting.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of our manuscript and for recommending acceptance. We are pleased that the foundational geometric, stationarity, and convergence results for Bellman residual minimization in control are recognized as providing useful theoretical grounding that complements existing dynamic programming theory.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper positions its foundational results for control Bellman residual minimization as building directly on standard MDP theory and dynamic programming, with geometry, stationarity, and convergence arguments derived under finite state-action spaces or regularity conditions for function approximation. No equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims retain independent content from the stated assumptions without renaming known results or smuggling ansatzes. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard Markov decision process assumptions including discounted infinite-horizon setting and existence of optimal policies
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
f(θ)=½∥T Qθ−Qθ∥²₂ ... Clarke subdifferential ... oblique projected control Bellman equation (OP-CBE)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
piecewise quadratic ... homogeneous half-space convex cone ... generalized gradient descent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
Soft Bellman residual minimization with weighted Lp-norm aligns the objective with Bellman contraction as p increases and yields performance error bounds.
Reference graph
Works this paper leans on
-
[1]
In particular, ifg̸= 0, then−gis a strict descent direction. 21 Proof.Since∂f(x) is closed and convex, the projection optimality condition (Bertsekas et al., 2003, Proposition 2.2.1, page 88) gives ⟨v−g,0−g⟩ ≤0∀v∈∂f(x), equivalently, it can be rewritten as follows: ⟨v, g⟩ ≥ ∥g∥ 2 ∀v∈∂f(x). Letd=−g. Then for allv∈∂f(x), ⟨v, d⟩=−⟨v, g⟩ ≤ −∥g∥ 2. Using the s...
work page 2003
-
[2]
The following result establishes the convergence of the algorithm under the above assumptions
The initial sublevel setL c :={x:f(x)≤c}withc=f(x 0)is bounded (hence compact). The following result establishes the convergence of the algorithm under the above assumptions. Theorem A.1.Let(x k)k≥0 be generated by Algorithm 1 under Assumption A.1. Then, we guar- antee the following results:
-
[3]
(Monotone decrease)f(x k+1)≤f(x k)for allk≥0, and(f(x k))k≥0 converges
-
[4]
(Sufficient decrease) For allk≥0, f(x k+1)≤f(x k)−σα k∥gk∥2 2,(11) and hence P∞ k=0 αk∥gk∥2 2 <∞
-
[5]
Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point
(Stationarity of limit points) The sequence{x k} ⊂ L c withc=f(x 0)admits at least one limit point. Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point. Proof.Let us fixk≥0. Ifg k = 0, then 0∈∂f(x k) and the algorithm stops at a stationary point. Therefore, let us assumeg k ̸= 0. By Lemma A.18,f ◦(xk;d k) =f ◦(xk;−g k) =−∥g k∥2 2 <0. Moreo...
work page 1976
-
[6]
The Clark subdifferential offgiven by∂f(Q) ={(γPΠ β −I) ⊤(T Q−Q) :β∈conv(Λ(Q))} is nonempty, convex, and bounded
-
[7]
We have∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞, and∥T Q−Q∥ ∞ ≥(1−γ)∥Q∥ ∞ −R max. 5.fis coercive
-
[8]
7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ
Every sublevel set,L c :={Q∈R |S||A| :f(Q)≤c}, is bounded for anyc≥0. 7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ
-
[9]
There existsµ >0such that for allQ, f(Q)−f ∗ ≤ 1 2µ dist(0, ∂f(Q)) 2 , wheref ∗ := minQ∈R|S×A| f(Q)anddist(0, ∂f(Q)) := inf{∥g∥ 2 :g∈∂f(Q)}. Proof.1. Sincefis a nonnegative sum of squares. Therefore,fis bounded below by 0. 27
-
[10]
Therefore, it is locally Lipschitz by Lemma A.19
Moreover, by Proposition A.1,fis piecewise quadratic with finitely many pieces. Therefore, it is locally Lipschitz by Lemma A.19
-
[11]
Sincefis locally Lipschitz by the second statement, the desired conclusion is directly obtained using Lemma A.9
-
[12]
This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞
For any (s, a)∈ S × A, we can derive the following bounds: |(T Q)(s, a)| ≤ |R(s, a)|+γ X s′∈S P(s ′|s, a)max a′∈A |Q(s′, a′)| ≤R max +γ∥Q∥ ∞. This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞. Moreover, we have ∥T Q−Q∥ ∞ ≥ ∥Q∥ ∞ − ∥T Q∥∞ ≥(1−γ)∥Q∥ ∞ −R max, where the last inequality uses∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞
-
[13]
Using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Hencef(Q)→ ∞as∥Q∥ ∞ → ∞, i.e.,fis coercive
-
[14]
The coercivity in the fifth statement implies that the level sets are bounded
-
[15]
Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c
For the last statement, using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Fixc≥0 and letQ∈ L c, i.e.,f(Q)≤c. Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c. Taking square roots gives (1−γ)∥Q∥ ∞ −R max ≤ √ 2c. Therefore, one gets the desired result
-
[16]
whereσ min denotes the minimum singular value
First of all, we can derive the following inequalities: dist(0, ∂f(Q)) = min β∈conv{Λ(Q)} (I−γPΠ β) ⊤ (T Q−Q) 2 ≥min β∈conv{Λ(Q)} σmin((I−γPΠ β)⊤)∥T Q−Q∥ 2. whereσ min denotes the minimum singular value. Then, we obtain f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≤ 1 σmin((I−γPΠ β)⊤) 2 dist(0, ∂f(Q)) 2. Since minQ∈R|S×A| f(Q) =f ∗ = 0, this is exactly the desired bound. 28 Th...
work page 2004
-
[17]
The surface off(Q) is illustrated in Figure 3. Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator. For a (possibly stochastic) policyβ, we can represent Πβ =β(·|1) ⊤ ⊗e ⊤ s = β(1|1) 1−β(1|1) , β(1|1)∈[0,1], wheree s ∈R |S| is the standard basis vector where itss-th eleme...
-
[18]
Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ)
This completes the proof. Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ). Then, we have (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 1)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , 0≤f(θ ∗ 1)−f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 − (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ...
-
[19]
= (1 +γ) 2|S × S| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S×A| Qθ∗ 2 −Q ∗ 2 2 =q1(θ∗ 2)≤f(θ ∗ 2)≤q 2(θ∗
-
[20]
The first inequality above proves the first statement
= (1 +γ) 2|S × A| 2 Qθ∗ 2 −Q ∗ 2 2 , respectively. The first inequality above proves the first statement. Moreover, sincef(θ ∗ 2)≤f(θ ∗ 1) and ΓΦ|ΦQ∗ −Q ∗ 2 2 = Qθ∗ 1 −Q ∗ 2 2 ≤ Qθ∗ 2 −Q ∗ 2 2, it follows that (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤ (1−γ) 2 2|S × A| Qθ∗ 2 −Q ∗ 2 2 ≤f(θ ∗ 2) ≤f(θ ∗ 1) ≤(1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 ,(13) which leads to (1−γ...
-
[21]
The Clark subdifferential offgiven by∂f(Q) ={Φ ⊤(γPΠ β −I) ⊤(T Qθ−Qθ) :β∈conv(Λ(Q θ))} is nonempty, convex, and bounded
-
[22]
We have∥T Q θ∥∞ ≤R max +γ∥Q θ∥∞, and∥T Q θ −Q θ∥∞ ≥(1−γ)∥Q θ∥∞ −R max
-
[23]
5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ
Every sublevel setL c :={θ∈R m :f(θ)≤c}is bounded for anyc≥0. 5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ . The proofs of the above properties are identical to those in the tabular case; therefore, we do not repeat them here. The PL-type inequality that holds in the tabular case does not, in general, hold when linear function approximation is used. Lemma A.2...
-
[24]
Therefore, there exists a unique stationary point (minimizer) in Bε, and it is given byθ ∗ = arg minθ∈Rm 1 2 T π∗ Qθ −Q θ 2
-
[25]
The last bound can be readily obtained using Lemma A.8. Theorem A.8.Let(θ k)k≥0 be generated by θk+1 =θ k −α kgk, whereg k ∈arg min g∈∂f(θ k) ∥g∥2 = Proj ∂f(θ k)(0)andα k >0a step-size generated by backtracking search with Armijo rule (Boyd and Vandenberghe, 2004). Then, the sequence(θ k)k≥0 ⊂ L c with c=f(Q 0)admits at least one limit point ¯θ. Every lim...
work page 2004
-
[26]
Since (γPΠ π ¯Q −I) is invertible from Lemma A.3, we haveF λ ¯Q= ¯Q. This completes the proof. Lemma A.29.The Hessian offis given by ∇2 Qf(Q) := (γPΠ πQ −I)(γPΠ πQ −I) ⊤ + γ λ Ω1(Q) 0· · ·0 0 Ω 2(Q) ... ... ... ... ... 0 0· · ·0 Ω |S|(Q) where Ωs(Q) :=ms diag(πQ(·|s))−π Q(·|s)πQ(·|s)⊤ diag(πQ(·|s)) := πQ(1|s) 0· · ·0 0π Q(2|s) ... ...
work page 1976
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.