Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
Pith reviewed 2026-05-16 19:02 UTC · model grok-4.3
The pith
Reweighting FQE regressions by the target stationary density ratio yields linear convergence without Bellman completeness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stationary-weighted FQE reweights the Bellman regression targets by the stationary target-to-behavior density ratio, preserving the supervised learning form of FQE while ensuring the fitted projection is with respect to the L2 norm induced by the target policy's stationary state-action distribution. This yields finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The convergence bound separates iteration, statistical, approximation, and weight-estimation errors.
What carries the argument
the stationary-weighted Bellman regression that projects onto the target policy's stationary distribution norm
If this is right
- The method reduces value error when standard FQE overemphasizes behavior-distribution regions rarely visited by the target.
- Ratio estimation error is attenuated in the bound when inherent Bellman error is small.
- Convergence holds linearly in finite samples under the stated conditions.
- The approach remains modular and compatible with standard supervised learning tools.
Where Pith is reading between the lines
- Similar reweighting could improve stability in other off-policy RL algorithms like fitted Q-iteration.
- Accurate estimation of the density ratio becomes a bottleneck in high-dimensional or continuous spaces.
- This highlights the importance of choosing the right norm for regression in approximate dynamic programming.
Load-bearing premise
The target policy must induce a well-defined stationary distribution, and the density ratio must be estimable with bounded error.
What would settle it
Observe the value error in a setting where the estimated ratio has large error but the inherent Bellman error is controlled to be small; if the bound fails to hold as predicted, the claim is falsified.
Figures
read the original abstract
Fitted $Q$-evaluation (FQE) is a standard regression-based tool for off-policy evaluation, but existing stability guarantees often rely on Bellman completeness, a strong closure condition that can fail under function approximation. We study an alternative route: changing the norm used in the regression step. The policy-evaluation Bellman operator is contractive in the $L^2$ norm induced by the target policy's stationary state-action distribution, whereas standard off-policy FQE projects Bellman targets in the behavior-distribution norm. We propose stationary-weighted FQE, which reweights each Bellman regression by the stationary target-to-behavior density ratio. The method preserves FQE's modular supervised-learning form while aligning the fitted projection with that contractive norm. We prove finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The bound separates finite-iteration, statistical, approximation, and weight-estimation errors, and shows that ratio-estimation error is attenuated when the inherent Bellman error is small. Controlled experiments show that stationary weighting can stabilize FQE and reduce value error when behavior-norm regression overemphasizes regions rarely visited by the target policy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces stationary-weighted Fitted Q-Evaluation (FQE), which reweights Bellman regression targets by the stationary target-to-behavior density ratio to align the projection with the L2 norm induced by the target policy's stationary distribution. It claims finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without Bellman completeness, via an error bound that decomposes iteration, statistical, approximation, and weight-estimation terms, with the ratio-estimation error attenuated by small inherent Bellman error. Controlled experiments indicate improved stability and lower value error relative to behavior-norm FQE.
Significance. If the finite-sample bound and attenuation property hold, the work offers a modular, supervised-learning-compatible route to stable off-policy evaluation that relaxes Bellman completeness, a common failure mode under function approximation. The explicit separation of error sources and the potential robustness to weight estimation error could inform practical FQE implementations and analysis of projected Bellman operators in misspecified regimes.
major comments (3)
- [§4] §4 (main convergence theorem): the advertised attenuation of ratio-estimation error by the inherent Bellman error relies on a multiplicative factor in the error recursion that approaches 1 as the projected Bellman residual grows; under the large-misspecification regimes that motivate the method, this factor does not shrink, so the weight term can dominate and the bound reduces to a standard importance-sampling form without the claimed robustness.
- [§3.2] §3.2 (weight estimation procedure): the assumption that the target policy induces a well-defined stationary distribution and that the density ratio can be estimated with error small enough not to dominate the bound is load-bearing, yet the paper provides no quantitative conditions under which the estimation error remains controlled when the behavior and target distributions differ substantially.
- [§5] §5 (experiments): the reported stabilization and value-error reduction are shown only for controlled synthetic settings; without ablation on the magnitude of inherent Bellman error or on the accuracy of the ratio estimator, it is unclear whether the empirical gains persist precisely in the misspecification regimes where the theoretical attenuation is weakest.
minor comments (2)
- Notation for the stationary density ratio (e.g., w(s,a)) is introduced without an explicit definition of the support or integrability conditions needed for the reweighting to be well-defined.
- The abstract and introduction use 'linear convergence' without clarifying whether the rate is with respect to iteration count, sample size, or both; a short clarifying sentence would help.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments. Below we address each major comment, offering clarifications and committing to revisions where the manuscript can be strengthened.
read point-by-point responses
-
Referee: [§4] §4 (main convergence theorem): the advertised attenuation of ratio-estimation error by the inherent Bellman error relies on a multiplicative factor in the error recursion that approaches 1 as the projected Bellman residual grows; under the large-misspecification regimes that motivate the method, this factor does not shrink, so the weight term can dominate and the bound reduces to a standard importance-sampling form without the claimed robustness.
Authors: We agree with the referee's observation regarding the behavior of the multiplicative factor. The attenuation of the ratio-estimation error term is indeed stronger when the inherent Bellman error is small. In regimes of large misspecification, the bound does revert to a form similar to standard importance sampling. However, even in such cases, the stationary-weighted projection aligns the fixed point with the target distribution's norm, which can still provide benefits in terms of stability compared to behavior-norm FQE. We will revise the statement of the main theorem and the surrounding discussion in §4 to make the dependence on the inherent Bellman error explicit and to avoid overstating the robustness in high-misspecification settings. This will be a partial revision. revision: partial
-
Referee: [§3.2] §3.2 (weight estimation procedure): the assumption that the target policy induces a well-defined stationary distribution and that the density ratio can be estimated with error small enough not to dominate the bound is load-bearing, yet the paper provides no quantitative conditions under which the estimation error remains controlled when the behavior and target distributions differ substantially.
Authors: The referee correctly identifies that the paper assumes the ratio estimation error is sufficiently small without providing explicit quantitative conditions for when this holds under substantial distribution shift. To address this, we will add a new subsection or paragraph in §3.2 that provides sufficient conditions for the ratio estimator, drawing on existing results from the density ratio estimation literature (such as bounds under Lipschitz assumptions or using minimax estimators). This will include a quantitative bound on the allowable estimation error in terms of other problem parameters to ensure it does not dominate the overall bound. We plan to make this a full revision to the section. revision: yes
-
Referee: [§5] §5 (experiments): the reported stabilization and value-error reduction are shown only for controlled synthetic settings; without ablation on the magnitude of inherent Bellman error or on the accuracy of the ratio estimator, it is unclear whether the empirical gains persist precisely in the misspecification regimes where the theoretical attenuation is weakest.
Authors: We acknowledge the limitation in the experimental section. The current experiments use synthetic environments to isolate the effects, but lack targeted ablations varying the inherent Bellman error (e.g., by changing the function class capacity) and the ratio estimation accuracy (e.g., by simulating noisy ratio estimates). In the revised version, we will include additional figures and tables in §5 with such ablations, showing performance as a function of these quantities. This will help validate the theory in the relevant regimes. We will make this revision. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper defines the target as the fixed point of the stationary-weighted Bellman projection under the target policy's stationary distribution, which is constructed independently of any fitted value function. The finite-sample linear convergence bound is obtained via standard contraction-mapping arguments in the reweighted L2 norm, with an explicit error decomposition into iteration, statistical, approximation, and weight-estimation terms. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and the central result does not rename a known empirical pattern. The analysis remains self-contained against external benchmarks such as the contraction property of the Bellman operator in the appropriate norm.
Axiom & Free-Parameter Ledger
free parameters (1)
- density ratio estimator
axioms (2)
- domain assumption Existence of a unique stationary distribution under the target policy
- standard math Standard MDP transition and reward structure
Reference graph
Works this paper leans on
-
[1]
A variant of the wang-foster-kakade lower bound for the discounted setting
Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,
-
[2]
J., Jiang, N., Sekhari, A., and Xie, T
Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,
-
[3]
Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,
Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,
- [4]
-
[5]
Di, Q., Zhao, H., He, J., and Gu, Q. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,
-
[6]
Offline reinforcement learning: Fundamental barriers for value function approximation
Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,
-
[7]
Off-policy deep reinforcement learning without exploration
Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pp. 2052–2062. PMLR,
work page 2052
-
[8]
Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,
work page 1998
-
[9]
Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,
work page 1995
-
[10]
Hines, O. J. and Miles, C. H. Learning density ratios in causal inference using bregman-riesz regression.arXiv preprint arXiv:2510.16127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
J., Heess, N., Precup, D., Kim, K.-E., and Guez, A
Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,
-
[12]
Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,
work page 1999
-
[13]
Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...
-
[14]
van der Laan, L., Hubbard, D., Tran, A., Kallus, N., and Bibaut, A. Semiparametric double reinforcement learning with applications to long-term causal inference.arXiv preprint arXiv:2501.06926, 2025a. van der Laan, L., Kallus, N., and Bibaut, A. Inverse reinforcement learning using just classification and a few regressions.arXiv preprint arXiv:2509.21172,...
-
[15]
Stochastic gradients under nuisances
21 Yu, F., Mehta, R., Luedtke, A., and Harchaoui, Z. Stochastic gradients under nuisances. arXiv preprint arXiv:2508.20326,
-
[16]
Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,
Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,
-
[17]
There is a hub state 0 and six spokes 1 ,
Experimental setup.We use a modified Baird-style 7-state MDP (Baird et al., 1995). There is a hub state 0 and six spokes 1 , . . . ,6. Thesolidaction deterministically transitions to state 0 from any state, while thedashedaction transitions uniformly among the spokes. The target policy always chooses the solid action, giving a stationary distribution µ th...
work page 1995
-
[18]
For each state–action pair, 5 distinct next states are sampled uniformly without replacement and assigned transition probabilities via a Dirichlet distribution. Rewards are drawn i.i.d. fromN(0,1). For each MDP instance, the target policy π is sampled independently for each state from a Dirichlet distribution over actions. The ground-truth value functions...
work page 1996
-
[19]
We sweep the discount factorγ∈ { 0.90, 0.925, 0.95, 0.96, 0.97, 0.98, 0.99} and run K = 200 fitted iterations. All learning curves and final errors are averaged over M = 50 independent random seeds for the MDP, policies, and dataset generation. Errors are reported in the stationary norm∥Q (K) −Q ⋆∥2,µπ. A.2.2 Additional experiments In this experiment, sta...
work page 2025
-
[20]
C.2 Misspecification error bound Proof of Lemma 1.Write Q⋆ F −Q ⋆ =T F Q⋆ F − TQ ⋆ = TF Q⋆ F − TF Q⋆ + TF Q⋆ − TQ ⋆ . TakingL 2(µ) norms and applying the triangle inequality ∥Q⋆ F −Q ⋆∥2,µ ≤ ∥T F Q⋆ F − TF Q⋆∥2,µ +∥T F Q⋆ − TQ ⋆∥2,µ. By the contraction property of Lemma 5, ∥TF Q⋆ F − TF Q⋆∥2,µ ≤γ∥Q ⋆ F −Q ⋆∥2,µ, and by definition ofT F, TF Q⋆ − TQ ⋆ = ΠF ...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.