Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

Lars van der Laan; Nathan Kallus

arxiv: 2512.23927 · v2 · submitted 2025-12-30 · 📊 stat.ML · cs.LG

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

Lars van der Laan , Nathan Kallus This is my paper

Pith reviewed 2026-05-16 19:50 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords soft FQIstationary reweightinglocal linear convergenceoffline reinforcement learningBellman completenessfunction approximationfinite sample boundssoftmax policy

0 comments

The pith

Stationary reweighting in soft fitted Q-iteration produces local linear convergence to the projected fixed point without Bellman completeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard soft fitted Q-iteration can become unstable under function approximation because the regression projection uses the behavior distribution while the Bellman operator contracts in a different norm. The paper identifies that near the soft-optimal fixed point, the soft Bellman operator aligns with the policy evaluation operator for the soft-optimal policy, contracting in the stationary state-action norm of that policy. To exploit this, stationary-reweighted soft FQI reweights each regression step to the stationary distribution induced by the current softmax policy. Under approximate realizability and controlled weighting error, this yields finite-sample local linear convergence, separating statistical error from a geometrically damped error from weight estimation. The analysis also shows plain soft FQI is locally stable with on-policy sampling and frames temperature annealing as a way to reach the contracting region.

Core claim

Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy and therefore contracts in the policy's stationary state-action norm. Stationary-reweighted soft FQI reweights regression targets toward this stationary distribution at each step. Under approximate realizability and controlled weighting error the algorithm converges linearly in finite samples to the projected fixed point, with the weight-estimation error damped geometrically by the contraction factor.

What carries the argument

stationary norm alignment, in which the soft Bellman operator contracts in the stationary distribution of the softmax policy near the fixed point, implemented via reweighting of fitted regression targets

If this is right

Ordinary soft FQI is locally stable under on-policy stationary sampling even without Bellman completeness.
Temperature annealing acts as a continuation strategy to reach the local contraction region.
The finite-sample bound separates statistical error from geometrically damped weight-estimation error.
Local linear convergence holds to the projected fixed point under the stated conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar norm-alignment ideas could stabilize other offline RL algorithms by iteratively adjusting the projection distribution to match the contraction norm.
The results suggest that global Bellman completeness can be replaced by local conditions around the solution in many value-based methods.
Practical implementations might benefit from adaptive reweighting schedules that improve as the policy approaches optimality.

Load-bearing premise

That weighting error remains controlled and approximate realizability holds in a neighborhood of the soft-optimal fixed point.

What would settle it

A numerical experiment in which the observed convergence rate deviates from linear when either the function class violates realizability near the fixed point or the reweighting error does not decrease geometrically.

Figures

Figures reproduced from arXiv: 2512.23927 by Lars van der Laan, Nathan Kallus.

**Figure 2.** Figure 2: Soft fitted Q-iteration under severe norm mismatch on a Garnet MDP. Left: direct training at τ . Middle and right: temperature homotopy. Shaded regions denote the 25%–75% quantiles across random seeds, and the dashed vertical line marks when τ reaches its target value. 7 Discussion and Conclusion This work develops a fitted iterative regression procedure with theoretical guarantees for local contraction an… view at source ↗

read the original abstract

Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identify the stability mechanism that replaces it: local stationary norm alignment. Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy. This operator contracts in the policy's stationary state-action norm, whereas standard fitted regression projects Bellman targets in the behavior norm. This mismatch explains instability under distribution shift. We use this insight to develop stationary-reweighted soft FQI, which reweights each regression step toward the stationary distribution of the current softmax policy. Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. Our results also show that ordinary soft FQI is locally stable under on-policy stationary sampling, even without Bellman completeness, and explain temperature annealing as a continuation strategy for reaching a contraction region.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reweighting regression targets to the current softmax policy's stationary distribution gives local linear convergence for soft FQI without Bellman completeness.

read the letter

The main takeaway is that stationary reweighting aligns the fitted regression norm with the contraction norm of the soft Bellman operator near the fixed point. This produces a clean local linear rate that separates statistical error from geometrically damped weight-estimation error, and it works even when the function class is not Bellman complete. The paper also recovers that plain soft FQI is already locally stable under on-policy stationary sampling and frames temperature annealing as a practical continuation method to reach the contraction region. These pieces are new relative to earlier FQI analyses and the argument is self-contained enough to check.

Referee Report

2 major / 2 minor

Summary. The paper analyzes soft fitted Q-iteration (FQI) in offline RL without Bellman completeness. It identifies local stationary norm alignment as the key stability mechanism: near the soft-optimal fixed point, the soft Bellman operator matches the policy-evaluation operator for the soft-optimal policy and contracts in the policy's stationary state-action norm. The authors introduce stationary-reweighted soft FQI, which reweights regression steps to the current softmax policy's stationary distribution. Under approximate realizability and controlled weighting error, they prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. They also show local stability of ordinary soft FQI under on-policy stationary sampling and interpret temperature annealing as a continuation strategy to enter the contraction region.

Significance. If the local convergence result holds, the work offers a meaningful theoretical contribution to offline RL by replacing the strong Bellman-completeness requirement with a local alignment condition and providing an explicit error decomposition that isolates geometrically damped weight-estimation error. The insight into the norm mismatch between behavior and stationary distributions, together with the stability guarantee for on-policy soft FQI, supplies practical guidance for algorithm design and annealing schedules. The separation of error sources is a clear strength.

major comments (2)

[main convergence theorem / finite-sample analysis] In the main convergence theorem (likely Theorem 4.1 or equivalent in the finite-sample analysis), the geometric damping of weight-estimation error is established only under the assumption that the reweighting operator remains contractive near the soft-optimal fixed point. No explicit quantitative bound is supplied on the radius of the neighborhood (in policy or weight space) within which the damping constant stays strictly less than 1; without this, it is unclear whether a given continuation/annealing path is guaranteed to remain inside the basin.
[error decomposition / proof of local linear convergence] The separation of statistical error from weight-estimation error in the local linear rate relies on the weighting error remaining controlled independently of the realizability error. Under approximate realizability, however, the two error sources may couple through the current policy's stationary distribution; the manuscript does not derive a joint bound showing that this coupling preserves the claimed damping rate (see the error decomposition preceding the main theorem).

minor comments (2)

[preliminaries] The notation distinguishing the behavior distribution, the current softmax stationary distribution, and the soft-optimal stationary distribution could be made more uniform and introduced earlier in the preliminaries.
[figures] Figure 1 (schematic of norm alignment) would benefit from an explicit legend indicating which norm is used for each operator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important aspects of the local convergence analysis that we address below. We have revised the manuscript to strengthen the discussion of the contraction neighborhood and to clarify the error coupling in the decomposition.

read point-by-point responses

Referee: In the main convergence theorem, the geometric damping of weight-estimation error is established only under the assumption that the reweighting operator remains contractive near the soft-optimal fixed point. No explicit quantitative bound is supplied on the radius of the neighborhood within which the damping constant stays strictly less than 1.

Authors: We agree that an explicit radius would be desirable for guaranteeing specific annealing paths. Deriving a fully quantitative, instance-independent bound requires stronger assumptions on the MDP and function class than our local analysis assumes. In the revision we have added a remark after Theorem 4.1 that characterizes the neighborhood size in terms of the temperature parameter, the Lipschitz constant of the softmax, and the realizability gap; this supplies practical guidance for continuation methods while acknowledging that the precise radius remains problem-dependent. revision: partial
Referee: The separation of statistical error from weight-estimation error relies on the weighting error remaining controlled independently of the realizability error. Under approximate realizability the two may couple through the current policy's stationary distribution; the manuscript does not derive a joint bound showing that this coupling preserves the claimed damping rate.

Authors: The proof of local linear convergence (Section 4.2) already accounts for the coupling by expressing the weighting error as a function of the distance of the current policy to the fixed point. Because the stationary distribution changes continuously with the policy and the contraction holds inside the local ball, the realizability error is absorbed into the O(1) term of the linear rate without degrading the geometric damping factor. We have expanded the error decomposition paragraph preceding Theorem 4.1 to make this dependence explicit and to verify that the damping constant remains strictly less than one whenever the total error lies inside the neighborhood. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence to independently defined projected fixed point under external assumptions

full rationale

The derivation establishes finite-sample local linear convergence of stationary-reweighted soft FQI to the projected fixed point under the stated assumptions of approximate realizability and controlled weighting error. The projected fixed point is defined via the soft Bellman operator independently of the reweighting parameters. The separation of statistical error from geometrically damped weight-estimation error follows from the local contraction property in the policy stationary norm, which is derived from the first-order equivalence to policy evaluation rather than from any fitted quantity or self-referential definition. No load-bearing step reduces the claimed rate or basin to a parameter estimated from the same data or to a prior self-citation. The analysis is self-contained against the external benchmarks of Bellman completeness and distribution shift.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions that are not derived in the paper: approximate realizability near the fixed point and bounded weighting error that damps geometrically. No free parameters or invented entities are introduced beyond standard RL objects.

axioms (2)

domain assumption Approximate realizability of the soft Q-function in a neighborhood of the soft-optimal fixed point
Invoked to ensure the projected fixed point exists and the local contraction holds.
domain assumption Controlled weighting error whose effect damps geometrically across iterations
Required to separate statistical error from weight-estimation error in the finite-sample bound.

pith-pipeline@v0.9.0 · 5508 in / 1436 out tokens · 30097 ms · 2026-05-16T19:50:17.473348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

A variant of the wang-foster-kakade lower bound for the discounted setting

19 Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011
[2]

J., Jiang, N., Sekhari, A., and Xie, T

Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,

work page arXiv
[3]

Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

work page arXiv
[4]

Bibaut, A. F. and van der Laan, M. J. Fast rates for empirical risk minimization over c` adl` ag functions with bounded sectional variation norm.arXiv preprint arXiv:1907.09244,

work page arXiv 1907
[5]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv
[6]

Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,

work page 1998
[7]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995
[8]

J., Heess, N., Precup, D., Kim, K.-E., and Guez, A

Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,

work page arXiv
[9]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999
[10]

Algaedice: Policy gradient from arbitrary experience

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...

work page arXiv 1912
[11]

A unified view of entropy-regularized Markov decision processes

Neu, G., Jonsson, A., and G´ omez, V. A unified view of entropy-regularized markov decision processes.arXiv preprint arXiv:1705.07798,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Nonparametric instrumental variable inference with many weak instruments.arXiv preprint arXiv:2505.07729,

van der Laan, L., Kallus, N., and Bibaut, A. Nonparametric instrumental variable inference with many weak instruments.arXiv preprint arXiv:2505.07729,

work page arXiv
[13]

and Wellner, J

Van Der Vaart, A. and Wellner, J. A. A local maximal inequality under uniform entropy. Electronic Journal of Statistics, 5(2011):192,

work page 2011
[14]

Frequentist regret bounds for randomized least-squares value iteration

Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR,

work page 1954
[15]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002
[16]

A natural way to ensure this is to begin with a large softmax temperature τ, for which the contraction radius r0(τ) grows as τ 1/α while the soft–hard bias scales only as O(τ)

A Towards Global Convergence via Temperature Ho- motopy A.1 Overview of challenges Global convergence via temperature homotopy.Our local linear convergence guaran- tees require that the initialization lie sufficiently inside the contraction region. A natural way to ensure this is to begin with a large softmax temperature τ, for which the contraction radiu...

work page 1990
[17]

The next subsection formalizes the structural assumptions under which these refinements hold and sketches how our local theory can, in principle, be extended toward global behavior

This restores the viability of the homotopy path down to small temperatures. The next subsection formalizes the structural assumptions under which these refinements hold and sketches how our local theory can, in principle, be extended toward global behavior. A full theoretical and empirical development of temperature–homotopy schemes remains an interestin...

work page 2011
[18]

By the local contraction bound (applied atQ (k)), ek+1 ≤ γ+β loc eα k ek, so the per-iteration modulus ρk := ek+1 ek satisfies ρk −γ≤β loc eα k

Under the assumption Q⋆ F = Q⋆ we have εF = 0, so Theorem 2 yields for anyQ (0) ∈B F(r, Q⋆) and allk≥0, ek =∥Q (k) −Q ⋆∥2,µ⋆ ≤ρ k e0, ρ:=γ+β locrα. By the local contraction bound (applied atQ (k)), ek+1 ≤ γ+β loc eα k ek, so the per-iteration modulus ρk := ek+1 ek satisfies ρk −γ≤β loc eα k . Using the bound from Theorem 2, eα k ≤(ρ ke0)α =e α 0 ραk. Comb...

work page 2023
[19]

Define the function class Gw := n gQ1,Q2,V1 : (s, a, r, s′)7→w(s, a) Q1(s, a)−Q 2(s, a) × r+γV 1(s′)−Q 2(s, a) :Q 1, Q2, V1 ∈ Fext o

Fix a bounded weight function w with ∥w∥∞ ≤M . Define the function class Gw := n gQ1,Q2,V1 : (s, a, r, s′)7→w(s, a) Q1(s, a)−Q 2(s, a) × r+γV 1(s′)−Q 2(s, a) :Q 1, Q2, V1 ∈ Fext o . By Condition C7, every Q∈ F ext and R are uniformly bounded by M. Thus, for any gQ1,Q2,V1 ∈ G w, |gQ1,Q2,V1(s, a, r, s′)| ≤ ∥w∥ ∞ Q1(s, a)−Q 2(s, a) × |r|+γ|V 1(s′)|+|Q 2(s, a...

work page 1996

[1] [1]

A variant of the wang-foster-kakade lower bound for the discounted setting

19 Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011

[2] [2]

J., Jiang, N., Sekhari, A., and Xie, T

Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,

work page arXiv

[3] [3]

Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

work page arXiv

[4] [4]

Bibaut, A. F. and van der Laan, M. J. Fast rates for empirical risk minimization over c` adl` ag functions with bounded sectional variation norm.arXiv preprint arXiv:1907.09244,

work page arXiv 1907

[5] [5]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv

[6] [6]

Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,

work page 1998

[7] [7]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995

[8] [8]

J., Heess, N., Precup, D., Kim, K.-E., and Guez, A

Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,

work page arXiv

[9] [9]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999

[10] [10]

Algaedice: Policy gradient from arbitrary experience

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...

work page arXiv 1912

[11] [11]

A unified view of entropy-regularized Markov decision processes

Neu, G., Jonsson, A., and G´ omez, V. A unified view of entropy-regularized markov decision processes.arXiv preprint arXiv:1705.07798,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Nonparametric instrumental variable inference with many weak instruments.arXiv preprint arXiv:2505.07729,

van der Laan, L., Kallus, N., and Bibaut, A. Nonparametric instrumental variable inference with many weak instruments.arXiv preprint arXiv:2505.07729,

work page arXiv

[13] [13]

and Wellner, J

Van Der Vaart, A. and Wellner, J. A. A local maximal inequality under uniform entropy. Electronic Journal of Statistics, 5(2011):192,

work page 2011

[14] [14]

Frequentist regret bounds for randomized least-squares value iteration

Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR,

work page 1954

[15] [15]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002

[16] [16]

A natural way to ensure this is to begin with a large softmax temperature τ, for which the contraction radius r0(τ) grows as τ 1/α while the soft–hard bias scales only as O(τ)

A Towards Global Convergence via Temperature Ho- motopy A.1 Overview of challenges Global convergence via temperature homotopy.Our local linear convergence guaran- tees require that the initialization lie sufficiently inside the contraction region. A natural way to ensure this is to begin with a large softmax temperature τ, for which the contraction radiu...

work page 1990

[17] [17]

The next subsection formalizes the structural assumptions under which these refinements hold and sketches how our local theory can, in principle, be extended toward global behavior

This restores the viability of the homotopy path down to small temperatures. The next subsection formalizes the structural assumptions under which these refinements hold and sketches how our local theory can, in principle, be extended toward global behavior. A full theoretical and empirical development of temperature–homotopy schemes remains an interestin...

work page 2011

[18] [18]

By the local contraction bound (applied atQ (k)), ek+1 ≤ γ+β loc eα k ek, so the per-iteration modulus ρk := ek+1 ek satisfies ρk −γ≤β loc eα k

Under the assumption Q⋆ F = Q⋆ we have εF = 0, so Theorem 2 yields for anyQ (0) ∈B F(r, Q⋆) and allk≥0, ek =∥Q (k) −Q ⋆∥2,µ⋆ ≤ρ k e0, ρ:=γ+β locrα. By the local contraction bound (applied atQ (k)), ek+1 ≤ γ+β loc eα k ek, so the per-iteration modulus ρk := ek+1 ek satisfies ρk −γ≤β loc eα k . Using the bound from Theorem 2, eα k ≤(ρ ke0)α =e α 0 ραk. Comb...

work page 2023

[19] [19]

Define the function class Gw := n gQ1,Q2,V1 : (s, a, r, s′)7→w(s, a) Q1(s, a)−Q 2(s, a) × r+γV 1(s′)−Q 2(s, a) :Q 1, Q2, V1 ∈ Fext o

Fix a bounded weight function w with ∥w∥∞ ≤M . Define the function class Gw := n gQ1,Q2,V1 : (s, a, r, s′)7→w(s, a) Q1(s, a)−Q 2(s, a) × r+γV 1(s′)−Q 2(s, a) :Q 1, Q2, V1 ∈ Fext o . By Condition C7, every Q∈ F ext and R are uniformly bounded by M. Thus, for any gQ1,Q2,V1 ∈ G w, |gQ1,Q2,V1(s, a, r, s′)| ≤ ∥w∥ ∞ Q1(s, a)−Q 2(s, a) × |r|+γ|V 1(s′)|+|Q 2(s, a...

work page 1996