Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration
Pith reviewed 2026-05-16 19:50 UTC · model grok-4.3
The pith
Stationary reweighting in soft fitted Q-iteration produces local linear convergence to the projected fixed point without Bellman completeness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy and therefore contracts in the policy's stationary state-action norm. Stationary-reweighted soft FQI reweights regression targets toward this stationary distribution at each step. Under approximate realizability and controlled weighting error the algorithm converges linearly in finite samples to the projected fixed point, with the weight-estimation error damped geometrically by the contraction factor.
What carries the argument
stationary norm alignment, in which the soft Bellman operator contracts in the stationary distribution of the softmax policy near the fixed point, implemented via reweighting of fitted regression targets
If this is right
- Ordinary soft FQI is locally stable under on-policy stationary sampling even without Bellman completeness.
- Temperature annealing acts as a continuation strategy to reach the local contraction region.
- The finite-sample bound separates statistical error from geometrically damped weight-estimation error.
- Local linear convergence holds to the projected fixed point under the stated conditions.
Where Pith is reading between the lines
- Similar norm-alignment ideas could stabilize other offline RL algorithms by iteratively adjusting the projection distribution to match the contraction norm.
- The results suggest that global Bellman completeness can be replaced by local conditions around the solution in many value-based methods.
- Practical implementations might benefit from adaptive reweighting schedules that improve as the policy approaches optimality.
Load-bearing premise
That weighting error remains controlled and approximate realizability holds in a neighborhood of the soft-optimal fixed point.
What would settle it
A numerical experiment in which the observed convergence rate deviates from linear when either the function class violates realizability near the fixed point or the reweighting error does not decrease geometrically.
Figures
read the original abstract
Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identify the stability mechanism that replaces it: local stationary norm alignment. Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy. This operator contracts in the policy's stationary state-action norm, whereas standard fitted regression projects Bellman targets in the behavior norm. This mismatch explains instability under distribution shift. We use this insight to develop stationary-reweighted soft FQI, which reweights each regression step toward the stationary distribution of the current softmax policy. Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. Our results also show that ordinary soft FQI is locally stable under on-policy stationary sampling, even without Bellman completeness, and explain temperature annealing as a continuation strategy for reaching a contraction region.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes soft fitted Q-iteration (FQI) in offline RL without Bellman completeness. It identifies local stationary norm alignment as the key stability mechanism: near the soft-optimal fixed point, the soft Bellman operator matches the policy-evaluation operator for the soft-optimal policy and contracts in the policy's stationary state-action norm. The authors introduce stationary-reweighted soft FQI, which reweights regression steps to the current softmax policy's stationary distribution. Under approximate realizability and controlled weighting error, they prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. They also show local stability of ordinary soft FQI under on-policy stationary sampling and interpret temperature annealing as a continuation strategy to enter the contraction region.
Significance. If the local convergence result holds, the work offers a meaningful theoretical contribution to offline RL by replacing the strong Bellman-completeness requirement with a local alignment condition and providing an explicit error decomposition that isolates geometrically damped weight-estimation error. The insight into the norm mismatch between behavior and stationary distributions, together with the stability guarantee for on-policy soft FQI, supplies practical guidance for algorithm design and annealing schedules. The separation of error sources is a clear strength.
major comments (2)
- [main convergence theorem / finite-sample analysis] In the main convergence theorem (likely Theorem 4.1 or equivalent in the finite-sample analysis), the geometric damping of weight-estimation error is established only under the assumption that the reweighting operator remains contractive near the soft-optimal fixed point. No explicit quantitative bound is supplied on the radius of the neighborhood (in policy or weight space) within which the damping constant stays strictly less than 1; without this, it is unclear whether a given continuation/annealing path is guaranteed to remain inside the basin.
- [error decomposition / proof of local linear convergence] The separation of statistical error from weight-estimation error in the local linear rate relies on the weighting error remaining controlled independently of the realizability error. Under approximate realizability, however, the two error sources may couple through the current policy's stationary distribution; the manuscript does not derive a joint bound showing that this coupling preserves the claimed damping rate (see the error decomposition preceding the main theorem).
minor comments (2)
- [preliminaries] The notation distinguishing the behavior distribution, the current softmax stationary distribution, and the soft-optimal stationary distribution could be made more uniform and introduced earlier in the preliminaries.
- [figures] Figure 1 (schematic of norm alignment) would benefit from an explicit legend indicating which norm is used for each operator.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight important aspects of the local convergence analysis that we address below. We have revised the manuscript to strengthen the discussion of the contraction neighborhood and to clarify the error coupling in the decomposition.
read point-by-point responses
-
Referee: In the main convergence theorem, the geometric damping of weight-estimation error is established only under the assumption that the reweighting operator remains contractive near the soft-optimal fixed point. No explicit quantitative bound is supplied on the radius of the neighborhood within which the damping constant stays strictly less than 1.
Authors: We agree that an explicit radius would be desirable for guaranteeing specific annealing paths. Deriving a fully quantitative, instance-independent bound requires stronger assumptions on the MDP and function class than our local analysis assumes. In the revision we have added a remark after Theorem 4.1 that characterizes the neighborhood size in terms of the temperature parameter, the Lipschitz constant of the softmax, and the realizability gap; this supplies practical guidance for continuation methods while acknowledging that the precise radius remains problem-dependent. revision: partial
-
Referee: The separation of statistical error from weight-estimation error relies on the weighting error remaining controlled independently of the realizability error. Under approximate realizability the two may couple through the current policy's stationary distribution; the manuscript does not derive a joint bound showing that this coupling preserves the claimed damping rate.
Authors: The proof of local linear convergence (Section 4.2) already accounts for the coupling by expressing the weighting error as a function of the distance of the current policy to the fixed point. Because the stationary distribution changes continuously with the policy and the contraction holds inside the local ball, the realizability error is absorbed into the O(1) term of the linear rate without degrading the geometric damping factor. We have expanded the error decomposition paragraph preceding Theorem 4.1 to make this dependence explicit and to verify that the damping constant remains strictly less than one whenever the total error lies inside the neighborhood. revision: yes
Circularity Check
No circularity: convergence to independently defined projected fixed point under external assumptions
full rationale
The derivation establishes finite-sample local linear convergence of stationary-reweighted soft FQI to the projected fixed point under the stated assumptions of approximate realizability and controlled weighting error. The projected fixed point is defined via the soft Bellman operator independently of the reweighting parameters. The separation of statistical error from geometrically damped weight-estimation error follows from the local contraction property in the policy stationary norm, which is derived from the first-order equivalence to policy evaluation rather than from any fitted quantity or self-referential definition. No load-bearing step reduces the claimed rate or basin to a parameter estimated from the same data or to a prior self-citation. The analysis is self-contained against the external benchmarks of Bellman completeness and distribution shift.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Approximate realizability of the soft Q-function in a neighborhood of the soft-optimal fixed point
- domain assumption Controlled weighting error whose effect damps geometrically across iterations
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A variant of the wang-foster-kakade lower bound for the discounted setting
19 Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,
-
[2]
J., Jiang, N., Sekhari, A., and Xie, T
Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,
-
[3]
Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,
Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,
- [4]
-
[5]
Offline reinforcement learning: Fundamental barriers for value function approximation
Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,
-
[6]
Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,
work page 1998
-
[7]
Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,
work page 1995
-
[8]
J., Heess, N., Precup, D., Kim, K.-E., and Guez, A
Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,
-
[9]
Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,
work page 1999
-
[10]
Algaedice: Policy gradient from arbitrary experience
Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...
-
[11]
A unified view of entropy-regularized Markov decision processes
Neu, G., Jonsson, A., and G´ omez, V. A unified view of entropy-regularized markov decision processes.arXiv preprint arXiv:1705.07798,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
van der Laan, L., Kallus, N., and Bibaut, A. Nonparametric instrumental variable inference with many weak instruments.arXiv preprint arXiv:2505.07729,
-
[13]
Van Der Vaart, A. and Wellner, J. A. A local maximal inequality under uniform entropy. Electronic Journal of Statistics, 5(2011):192,
work page 2011
-
[14]
Frequentist regret bounds for randomized least-squares value iteration
Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR,
work page 1954
-
[15]
Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,
Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,
-
[16]
A Towards Global Convergence via Temperature Ho- motopy A.1 Overview of challenges Global convergence via temperature homotopy.Our local linear convergence guaran- tees require that the initialization lie sufficiently inside the contraction region. A natural way to ensure this is to begin with a large softmax temperature τ, for which the contraction radiu...
work page 1990
-
[17]
This restores the viability of the homotopy path down to small temperatures. The next subsection formalizes the structural assumptions under which these refinements hold and sketches how our local theory can, in principle, be extended toward global behavior. A full theoretical and empirical development of temperature–homotopy schemes remains an interestin...
work page 2011
-
[18]
Under the assumption Q⋆ F = Q⋆ we have εF = 0, so Theorem 2 yields for anyQ (0) ∈B F(r, Q⋆) and allk≥0, ek =∥Q (k) −Q ⋆∥2,µ⋆ ≤ρ k e0, ρ:=γ+β locrα. By the local contraction bound (applied atQ (k)), ek+1 ≤ γ+β loc eα k ek, so the per-iteration modulus ρk := ek+1 ek satisfies ρk −γ≤β loc eα k . Using the bound from Theorem 2, eα k ≤(ρ ke0)α =e α 0 ραk. Comb...
work page 2023
-
[19]
Fix a bounded weight function w with ∥w∥∞ ≤M . Define the function class Gw := n gQ1,Q2,V1 : (s, a, r, s′)7→w(s, a) Q1(s, a)−Q 2(s, a) × r+γV 1(s′)−Q 2(s, a) :Q 1, Q2, V1 ∈ Fext o . By Condition C7, every Q∈ F ext and R are uniformly bounded by M. Thus, for any gQ1,Q2,V1 ∈ G w, |gQ1,Q2,V1(s, a, r, s′)| ≤ ∥w∥ ∞ Q1(s, a)−Q 2(s, a) × |r|+γ|V 1(s′)|+|Q 2(s, a...
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.