pith. sign in

arxiv: 2601.18840 · v3 · submitted 2026-01-26 · 💻 cs.LG · cs.SY· eess.SY

Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

Pith reviewed 2026-05-16 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords Bellman residualpolicy optimizationMarkov decision processconvergencestationarityfunction approximationreinforcement learning
0
0 comments X

The pith

Bellman residual minimization for control establishes geometric properties, stationarity of solutions, and convergence guarantees for policy optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to lay the theoretical groundwork for applying Bellman residual minimization to policy optimization in Markov decision processes, rather than limiting it to policy evaluation. This involves directly minimizing the squared Bellman residual to find good policies. A reader might care because this method can offer more stable convergence when approximating value functions, which is common in practical reinforcement learning. The focus on control tasks addresses an area that has seen less study compared to evaluation.

Core claim

The authors show that the Bellman residual minimization objective for control has a specific geometry whose stationary points correspond to optimal solutions, and that iterative minimization converges to the optimal value function.

What carries the argument

The squared Bellman residual objective minimized jointly over value functions and policies.

If this is right

  • Stationary points of the residual objective satisfy the optimality conditions of the MDP.
  • The minimization converges reliably even with function approximation under regularity conditions.
  • It provides an alternative optimization path for control that avoids some instabilities of dynamic programming.
  • Foundational analysis enables further development of residual-based control algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be combined with deep neural networks for large-scale control problems.
  • It might explain why some residual methods perform well in practice despite theoretical gaps.
  • Extensions to continuous state spaces could be tested using the same geometric arguments.

Load-bearing premise

Standard assumptions on the MDP hold, such as finite state-action spaces or suitable regularity conditions for any function approximation used.

What would settle it

Demonstrating an MDP example where the residual minimizer is not a stationary policy or where the algorithm diverges.

Figures

Figures reproduced from arXiv: 2601.18840 by Donghwan Lee, Hyukjun Yang.

Figure 1
Figure 1. Figure 1: Illustration of the oblique projection Theorem 2. The stationary point ¯θ with 0 ∈ ∂f( ¯θ) satisfies the OP-CBE, Qθ¯ = ΓΦ|Ψβ¯ T Qθ¯, where Ψβ¯ = (γPΠβ¯ − I)Φ and β¯ ∈ conv(Λ(Qθ¯)) is given in (3). Since a stationary point always exists, the OP-CBE always admits at least one solution (Lemma A.25), although the solution is not unique in general. The solution exists even when the composite oper￾ator ΓΦ|Ψβ T i… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of errors between the gradient descent on the SCBR (blue) and the P [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The surface of f(Q) Non-differentiability occurs precisely on the tie set {Q ∈ R |S×A| : Q(1, 1) = Q(1, 2)} because of the max operator. For a (possibly stochastic) policy β, we can represent Π β = β(·|1)⊤ ⊗ e ⊤ s = [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Convergence trajectory of Qk with the subgradient descent method [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CBR objective function value f(Qk) with Qk generated with the subgradient descent method. A.4 CBR with linear function approximation Let Θ denote the set of deterministic policies π : S → A, so |Θ| = |A||S| < ∞. Let us consider the CBR objective with LFA f(θ) := 1 2 ∥TΦθ − Φθ∥ 2 2 , Moreover, let us define the set Sπ := {θ ∈ R m : π(s) ∈ arg maxa∈AQθ(s, a), ∀(s, a) ∈ S × A} for each π ∈ Θ, where arg max is… view at source ↗
Figure 6
Figure 6. Figure 6: The graph of the CBR objective, f(θ) = 1 2 ∥T(Φθ) − Φθ∥ 2 2 . then we can represent Π β = β(·|1)⊤ ⊗ e ⊤ s = [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The evolution of the iterates of the subgradient descent method converging to the [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The evolution of the SCBR objective function value over iterations on a log scale. [PITH_FULL_IMAGE:figures/full_fig_p053_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The convergence of Qk(1, 1) and Qk(1, 2) to Q∗ λ (1, 1) and Q∗ λ (1, 2), respectively, where the dashed lines indicate the optimal Q-values. Proof. First of all, we have ∥Qθ − FλQθ∥∞ =∥Qθ − Q ∗ λ + FλQ ∗ λ − FλQθ∥∞ ≥∥Qθ − Q ∗ λ∥∞ − ∥FλQ ∗ λ − FλQθ∥∞ ≥∥Qθ − Q ∗ λ∥∞ − γ∥Q ∗ λ − Qθ∥∞ =(1 − γ)∥Qθ − Q ∗ λ∥∞ ≥(1 − γ) 1 p |S × A| ∥Qθ − Q ∗ λ∥2 , where the second line is due to the reverse triangle inequality and … view at source ↗
Figure 10
Figure 10. Figure 10: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = 0.2897 60 [PITH_FULL_IMAGE:figures/full_fig_p060_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = −0.2897 [PITH_FULL_IMAGE:figures/full_fig_p061_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: This figure illustrates R(Φ), Fλ(Φ¯θ), R(Ψβ¯), and the corresponding projected point for the stationary point ¯θ = 0 The first plot of ?? shows the evolution of the objective function value f when gradient descent is applied to the objective function in this example. We observe that the objective value decreases and converges to a constant. The second plot of ?? illustrates the convergence of θk under gra… view at source ↗
Figure 13
Figure 13. Figure 13: (1) The evolution of the objective function value [PITH_FULL_IMAGE:figures/full_fig_p062_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Deep learning experiments across five different tasks from the [PITH_FULL_IMAGE:figures/full_fig_p063_14.png] view at source ↗
read the original abstract

Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript establishes foundational results for Bellman residual minimization applied to policy optimization (control) in Markov decision processes. It analyzes the geometry of the squared Bellman residual objective, derives stationarity conditions linking residual minimization to optimality, and provides convergence guarantees, contrasting this approach with dynamic programming while noting advantages for stable convergence under function approximation.

Significance. If the derivations hold, the work supplies needed theoretical grounding for an under-explored method in control tasks, complementing policy evaluation results and potentially supporting more stable RL algorithms with approximation. The geometry and stationarity analysis offers concrete insight into the objective landscape that is not available from standard DP theory alone.

minor comments (2)
  1. [§2] The transition from policy evaluation to control residual definitions in the early sections would benefit from an explicit side-by-side equation comparison to highlight the added policy dependence.
  2. [Notation] Notation for the control Bellman operator T^π versus the optimal T^* is introduced clearly but could be reinforced with a short table of symbols for readers less familiar with the control setting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript and for recommending acceptance. We are pleased that the foundational geometric, stationarity, and convergence results for Bellman residual minimization in control are recognized as providing useful theoretical grounding that complements existing dynamic programming theory.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper positions its foundational results for control Bellman residual minimization as building directly on standard MDP theory and dynamic programming, with geometry, stationarity, and convergence arguments derived under finite state-action spaces or regularity conditions for function approximation. No equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims retain independent content from the stated assumptions without renaming known results or smuggling ansatzes. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard MDP assumptions such as the existence of value functions and Bellman operators; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Standard Markov decision process assumptions including discounted infinite-horizon setting and existence of optimal policies
    Invoked implicitly as background for all Bellman residual analysis.

pith-pipeline@v0.9.0 · 5422 in / 970 out tokens · 19138 ms · 2026-05-16T10:57:14.121274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem

    cs.LG 2026-04 unverdicted novelty 6.0

    Soft Bellman residual minimization with weighted Lp-norm aligns the objective with Bellman contraction as p increases and yields performance error bounds.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper

  1. [1]

    In particular, ifg̸= 0, then−gis a strict descent direction. 21 Proof.Since∂f(x) is closed and convex, the projection optimality condition (Bertsekas et al., 2003, Proposition 2.2.1, page 88) gives ⟨v−g,0−g⟩ ≤0∀v∈∂f(x), equivalently, it can be rewritten as follows: ⟨v, g⟩ ≥ ∥g∥ 2 ∀v∈∂f(x). Letd=−g. Then for allv∈∂f(x), ⟨v, d⟩=−⟨v, g⟩ ≤ −∥g∥ 2. Using the s...

  2. [2]

    The following result establishes the convergence of the algorithm under the above assumptions

    The initial sublevel setL c :={x:f(x)≤c}withc=f(x 0)is bounded (hence compact). The following result establishes the convergence of the algorithm under the above assumptions. Theorem A.1.Let(x k)k≥0 be generated by Algorithm 1 under Assumption A.1. Then, we guar- antee the following results:

  3. [3]

    (Monotone decrease)f(x k+1)≤f(x k)for allk≥0, and(f(x k))k≥0 converges

  4. [4]

    (Sufficient decrease) For allk≥0, f(x k+1)≤f(x k)−σα k∥gk∥2 2,(11) and hence P∞ k=0 αk∥gk∥2 2 <∞

  5. [5]

    Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point

    (Stationarity of limit points) The sequence{x k} ⊂ L c withc=f(x 0)admits at least one limit point. Every limit point¯xsatisfies 0∈∂f(¯x), i.e.,¯xis a stationary point. Proof.Let us fixk≥0. Ifg k = 0, then 0∈∂f(x k) and the algorithm stops at a stationary point. Therefore, let us assumeg k ̸= 0. By Lemma A.18,f ◦(xk;d k) =f ◦(xk;−g k) =−∥g k∥2 2 <0. Moreo...

  6. [6]

    The Clark subdifferential offgiven by∂f(Q) ={(γPΠ β −I) ⊤(T Q−Q) :β∈conv(Λ(Q))} is nonempty, convex, and bounded

  7. [7]

    5.fis coercive

    We have∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞, and∥T Q−Q∥ ∞ ≥(1−γ)∥Q∥ ∞ −R max. 5.fis coercive

  8. [8]

    7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

    Every sublevel set,L c :={Q∈R |S||A| :f(Q)≤c}, is bounded for anyc≥0. 7.Q∈ L c implies∥Q∥ ∞ ≤ Rmax+ √ 2c 1−γ

  9. [9]

    There existsµ >0such that for allQ, f(Q)−f ∗ ≤ 1 2µ dist(0, ∂f(Q)) 2 , wheref ∗ := minQ∈R|S×A| f(Q)anddist(0, ∂f(Q)) := inf{∥g∥ 2 :g∈∂f(Q)}. Proof.1. Sincefis a nonnegative sum of squares. Therefore,fis bounded below by 0. 27

  10. [10]

    Therefore, it is locally Lipschitz by Lemma A.19

    Moreover, by Proposition A.1,fis piecewise quadratic with finitely many pieces. Therefore, it is locally Lipschitz by Lemma A.19

  11. [11]

    Sincefis locally Lipschitz by the second statement, the desired conclusion is directly obtained using Lemma A.9

  12. [12]

    This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

    For any (s, a)∈ S × A, we can derive the following bounds: |(T Q)(s, a)| ≤ |R(s, a)|+γ X s′∈S P(s ′|s, a)max a′∈A |Q(s′, a′)| ≤R max +γ∥Q∥ ∞. This implies∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞. Moreover, we have ∥T Q−Q∥ ∞ ≥ ∥Q∥ ∞ − ∥T Q∥∞ ≥(1−γ)∥Q∥ ∞ −R max, where the last inequality uses∥T Q∥ ∞ ≤R max +γ∥Q∥ ∞

  13. [13]

    Using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Hencef(Q)→ ∞as∥Q∥ ∞ → ∞, i.e.,fis coercive

  14. [14]

    The coercivity in the fifth statement implies that the level sets are bounded

  15. [15]

    Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c

    For the last statement, using the bounds in the fourth statement, we have f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≥ 1 2 ∥T Q−Q∥ 2 ∞ ≥ 1 2 {(1−γ)∥Q∥ ∞ −R max}2 Fixc≥0 and letQ∈ L c, i.e.,f(Q)≤c. Then, it follows that 1 2 {(1−γ)∥Q∥ ∞ −R max}2 ≤c. Taking square roots gives (1−γ)∥Q∥ ∞ −R max ≤ √ 2c. Therefore, one gets the desired result

  16. [16]

    whereσ min denotes the minimum singular value

    First of all, we can derive the following inequalities: dist(0, ∂f(Q)) = min β∈conv{Λ(Q)} (I−γPΠ β) ⊤ (T Q−Q) 2 ≥min β∈conv{Λ(Q)} σmin((I−γPΠ β)⊤)∥T Q−Q∥ 2. whereσ min denotes the minimum singular value. Then, we obtain f(Q) = 1 2 ∥T Q−Q∥ 2 2 ≤ 1 σmin((I−γPΠ β)⊤) 2 dist(0, ∂f(Q)) 2. Since minQ∈R|S×A| f(Q) =f ∗ = 0, this is exactly the desired bound. 28 Th...

  17. [17]

    Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator

    The surface off(Q) is illustrated in Figure 3. Figure 3: The surface off(Q) Non-differentiability occurs precisely on the tie set{Q∈R |S×A| :Q(1,1) =Q(1,2)}because of the max operator. For a (possibly stochastic) policyβ, we can represent Πβ =β(·|1) ⊤ ⊗e ⊤ s = β(1|1) 1−β(1|1) , β(1|1)∈[0,1], wheree s ∈R |S| is the standard basis vector where itss-th eleme...

  18. [18]

    Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ)

    This completes the proof. Proposition A.5.Let us define θ∗ 1 := arg minθ∈Rm ∥Qθ −Q ∗∥2, θ ∗ 2 := arg minθ∈Rm f(θ). Then, we have (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 1)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , 0≤f(θ ∗ 1)−f(θ ∗ 2)≤ (1 +γ) 2|S × A| 2 − (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ...

  19. [19]

    = (1 +γ) 2|S × S| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 , (1−γ) 2 2|S×A| Qθ∗ 2 −Q ∗ 2 2 =q1(θ∗ 2)≤f(θ ∗ 2)≤q 2(θ∗

  20. [20]

    The first inequality above proves the first statement

    = (1 +γ) 2|S × A| 2 Qθ∗ 2 −Q ∗ 2 2 , respectively. The first inequality above proves the first statement. Moreover, sincef(θ ∗ 2)≤f(θ ∗ 1) and ΓΦ|ΦQ∗ −Q ∗ 2 2 = Qθ∗ 1 −Q ∗ 2 2 ≤ Qθ∗ 2 −Q ∗ 2 2, it follows that (1−γ) 2 2|S × A| ΓΦ|ΦQ∗ −Q ∗ 2 2 ≤ (1−γ) 2 2|S × A| Qθ∗ 2 −Q ∗ 2 2 ≤f(θ ∗ 2) ≤f(θ ∗ 1) ≤(1 +γ) 2|S × A| 2 ΓΦ|ΦQ∗ −Q ∗ 2 2 ,(13) which leads to (1−γ...

  21. [21]

    The Clark subdifferential offgiven by∂f(Q) ={Φ ⊤(γPΠ β −I) ⊤(T Qθ−Qθ) :β∈conv(Λ(Q θ))} is nonempty, convex, and bounded

  22. [22]

    We have∥T Q θ∥∞ ≤R max +γ∥Q θ∥∞, and∥T Q θ −Q θ∥∞ ≥(1−γ)∥Q θ∥∞ −R max

  23. [23]

    5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ

    Every sublevel setL c :={θ∈R m :f(θ)≤c}is bounded for anyc≥0. 5.θ∈ L c implies ∥Qθ∥∞ ≤ Rmax + √ 2c 1−γ . The proofs of the above properties are identical to those in the tabular case; therefore, we do not repeat them here. The PL-type inequality that holds in the tabular case does not, in general, hold when linear function approximation is used. Lemma A.2...

  24. [24]

    Therefore, there exists a unique stationary point (minimizer) in Bε, and it is given byθ ∗ = arg minθ∈Rm 1 2 T π∗ Qθ −Q θ 2

  25. [25]

    The last bound can be readily obtained using Lemma A.8. Theorem A.8.Let(θ k)k≥0 be generated by θk+1 =θ k −α kgk, whereg k ∈arg min g∈∂f(θ k) ∥g∥2 = Proj ∂f(θ k)(0)andα k >0a step-size generated by backtracking search with Armijo rule (Boyd and Vandenberghe, 2004). Then, the sequence(θ k)k≥0 ⊂ L c with c=f(Q 0)admits at least one limit point ¯θ. Every lim...

  26. [26]

    This completes the proof

    Since (γPΠ π ¯Q −I) is invertible from Lemma A.3, we haveF λ ¯Q= ¯Q. This completes the proof. Lemma A.29.The Hessian offis given by ∇2 Qf(Q) := (γPΠ πQ −I)(γPΠ πQ −I) ⊤ + γ λ   Ω1(Q) 0· · ·0 0 Ω 2(Q) ... ... ... ... ... 0 0· · ·0 Ω |S|(Q)   where Ωs(Q) :=ms diag(πQ(·|s))−π Q(·|s)πQ(·|s)⊤ diag(πQ(·|s)) :=   πQ(1|s) 0· · ·0 0π Q(2|s) ... ...