Fitted $Q$ Evaluation Without Bellman Completeness via Stationary Weighting

Lars van der Laan; Nathan Kallus

arxiv: 2512.23805 · v3 · submitted 2025-12-29 · 📊 stat.ML · cs.LG

Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

Lars van der Laan , Nathan Kallus This is my paper

Pith reviewed 2026-05-16 19:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords fitted Q-evaluationoff-policy evaluationBellman operatordensity ratiofunction approximationreinforcement learningstationary distribution

0 comments

The pith

Reweighting FQE regressions by the target stationary density ratio yields linear convergence without Bellman completeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fitted Q-evaluation can be made stable by reweighting each regression step with the density ratio between the target policy's stationary distribution and the behavior distribution. This change aligns the projection norm with the contraction property of the Bellman operator under the target measure. A reader would care because it removes the need for Bellman completeness, a condition often violated in practice with function approximation. The resulting finite-sample bound decomposes errors cleanly and attenuates the impact of ratio estimation errors when the Bellman residual is small. Experiments confirm reduced value error in regions the target policy visits infrequently.

Core claim

Stationary-weighted FQE reweights the Bellman regression targets by the stationary target-to-behavior density ratio, preserving the supervised learning form of FQE while ensuring the fitted projection is with respect to the L2 norm induced by the target policy's stationary state-action distribution. This yields finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The convergence bound separates iteration, statistical, approximation, and weight-estimation errors.

What carries the argument

the stationary-weighted Bellman regression that projects onto the target policy's stationary distribution norm

If this is right

The method reduces value error when standard FQE overemphasizes behavior-distribution regions rarely visited by the target.
Ratio estimation error is attenuated in the bound when inherent Bellman error is small.
Convergence holds linearly in finite samples under the stated conditions.
The approach remains modular and compatible with standard supervised learning tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reweighting could improve stability in other off-policy RL algorithms like fitted Q-iteration.
Accurate estimation of the density ratio becomes a bottleneck in high-dimensional or continuous spaces.
This highlights the importance of choosing the right norm for regression in approximate dynamic programming.

Load-bearing premise

The target policy must induce a well-defined stationary distribution, and the density ratio must be estimable with bounded error.

What would settle it

Observe the value error in a setting where the estimated ratio has large error but the inherent Bellman error is controlled to be small; if the bound fails to hold as predicted, the claim is falsified.

Figures

Figures reproduced from arXiv: 2512.23805 by Lars van der Laan, Nathan Kallus.

**Figure 1.** Figure 1: Norm mismatch: Q ∈ F (red), T Q leaves F but lies in L 2 (µ) (blue), and Πνb F projects it back under the behavior-norm L 2 (νb). The resulting composite map Uνb = Πνb F T need not be contractive. 2.3 Standard FQE and Norm Mismatch Standard fitted Q-evaluation (FQE) constructs iterates Qb(k+1) ∈ arg min f∈F 1 n Xn i=1 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Moderate-overlap regime (γ = 0.95): final error vs. κ (left) and iteration curves (right). actly but is not Bellman-complete, and compare unweighted FQE to stationary-weighted FQE using the exact stationary density ratios. We sweep γ ∈ {0.90, 0.925, 0.95, 0.96, 0.97, 0.98, 0.99}, run K = 200 fitted iterations, and average over M = 200 random seeds. Full details are in Appendix A [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 3.** Figure 3: Severe norm-mismatch regime: final error vs. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Final stationary-norm error versus stationary overlap for [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: High-discount, low-overlap regime (γ = 0.999). Left: final error vs. κ. Right: iteration curves at κ = 0.05. weighted FQE reweights each regression step using the exact stationary density ratio w(s, a) = dπ(s)π(a|s) ρˆS (s)µ(a|s) , where ρˆS is the empirical state marginal induced by the dataset. Both methods use ridge regularization with λ = 10−6 and are initialized at θ0 = 0. We sweep the discount factor… view at source ↗

**Figure 6.** Figure 6: Final error versus stationary overlap for [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Final error versus stationary overlap for [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Severe norm-mismatch regime with estimated stationary density ratios: final error versus γ (left) and iteration curves (right). Integrating with respect to µ and using stationarity µ = µ(πP), ∥πP h∥ 2 2,µ = Z (πP h) 2 dµ ≤ Z πP(h 2 ) dµ = Z h 2 dµ = ∥h∥ 2 2,µ. Thus πP is nonexpansive: ∥πP h∥2,µ ≤ ∥h∥2,µ. For the Bellman operator, T Q1 − T Q2 = γ πP(Q1 − Q2) = γ πP h. Taking norms and applying nonexpansiven… view at source ↗

read the original abstract

Fitted $Q$-evaluation (FQE) is a standard regression-based tool for off-policy evaluation, but existing stability guarantees often rely on Bellman completeness, a strong closure condition that can fail under function approximation. We study an alternative route: changing the norm used in the regression step. The policy-evaluation Bellman operator is contractive in the $L^2$ norm induced by the target policy's stationary state-action distribution, whereas standard off-policy FQE projects Bellman targets in the behavior-distribution norm. We propose stationary-weighted FQE, which reweights each Bellman regression by the stationary target-to-behavior density ratio. The method preserves FQE's modular supervised-learning form while aligning the fitted projection with that contractive norm. We prove finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The bound separates finite-iteration, statistical, approximation, and weight-estimation errors, and shows that ratio-estimation error is attenuated when the inherent Bellman error is small. Controlled experiments show that stationary weighting can stabilize FQE and reduce value error when behavior-norm regression overemphasizes regions rarely visited by the target policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stationary reweighting aligns FQE regression with the target distribution to drop the Bellman completeness assumption, but the bound's error attenuation weakens precisely under large misspecification.

read the letter

The main takeaway is that reweighting each Bellman regression step by the target policy's stationary density ratio lets them prove linear convergence to the projected fixed point without Bellman completeness. The paper keeps the usual supervised-learning form of FQE but swaps the norm so the projection matches the contractive operator under the target measure. That is the concrete change from prior FQE work. They also give an explicit finite-sample bound that splits iteration error, statistical error, approximation error, and weight-estimation error, which is cleaner than many existing analyses. The controlled experiments show the reweighting can reduce over-emphasis on regions the target rarely visits. That part is useful and worth testing in practice. The soft spot sits in the attenuation claim for the ratio-estimation term. The recursion multiplies that error by a factor that approaches 1 as the inherent Bellman residual grows, so the advertised robustness shrinks exactly when misspecification is large and completeness fails. In those regimes the bound starts to look more like standard importance sampling without extra protection. Estimating the density ratio accurately is also left as a practical requirement that could dominate in high dimensions. This paper is for people already working on off-policy evaluation who keep running into function-approximation instability. A reader who wants a modular tweak with some theory backing will get value from the error decomposition and the reweighting idea. The argument is internally consistent on its own terms, so I would send it to a serious referee rather than desk-reject it. Referees can check the full derivation and push on how often the attenuation actually helps in realistic misspecification regimes.

Referee Report

3 major / 2 minor

Summary. The paper introduces stationary-weighted Fitted Q-Evaluation (FQE), which reweights Bellman regression targets by the stationary target-to-behavior density ratio to align the projection with the L2 norm induced by the target policy's stationary distribution. It claims finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without Bellman completeness, via an error bound that decomposes iteration, statistical, approximation, and weight-estimation terms, with the ratio-estimation error attenuated by small inherent Bellman error. Controlled experiments indicate improved stability and lower value error relative to behavior-norm FQE.

Significance. If the finite-sample bound and attenuation property hold, the work offers a modular, supervised-learning-compatible route to stable off-policy evaluation that relaxes Bellman completeness, a common failure mode under function approximation. The explicit separation of error sources and the potential robustness to weight estimation error could inform practical FQE implementations and analysis of projected Bellman operators in misspecified regimes.

major comments (3)

[§4] §4 (main convergence theorem): the advertised attenuation of ratio-estimation error by the inherent Bellman error relies on a multiplicative factor in the error recursion that approaches 1 as the projected Bellman residual grows; under the large-misspecification regimes that motivate the method, this factor does not shrink, so the weight term can dominate and the bound reduces to a standard importance-sampling form without the claimed robustness.
[§3.2] §3.2 (weight estimation procedure): the assumption that the target policy induces a well-defined stationary distribution and that the density ratio can be estimated with error small enough not to dominate the bound is load-bearing, yet the paper provides no quantitative conditions under which the estimation error remains controlled when the behavior and target distributions differ substantially.
[§5] §5 (experiments): the reported stabilization and value-error reduction are shown only for controlled synthetic settings; without ablation on the magnitude of inherent Bellman error or on the accuracy of the ratio estimator, it is unclear whether the empirical gains persist precisely in the misspecification regimes where the theoretical attenuation is weakest.

minor comments (2)

Notation for the stationary density ratio (e.g., w(s,a)) is introduced without an explicit definition of the support or integrability conditions needed for the reweighting to be well-defined.
The abstract and introduction use 'linear convergence' without clarifying whether the rate is with respect to iteration count, sample size, or both; a short clarifying sentence would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's insightful comments. Below we address each major comment, offering clarifications and committing to revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [§4] §4 (main convergence theorem): the advertised attenuation of ratio-estimation error by the inherent Bellman error relies on a multiplicative factor in the error recursion that approaches 1 as the projected Bellman residual grows; under the large-misspecification regimes that motivate the method, this factor does not shrink, so the weight term can dominate and the bound reduces to a standard importance-sampling form without the claimed robustness.

Authors: We agree with the referee's observation regarding the behavior of the multiplicative factor. The attenuation of the ratio-estimation error term is indeed stronger when the inherent Bellman error is small. In regimes of large misspecification, the bound does revert to a form similar to standard importance sampling. However, even in such cases, the stationary-weighted projection aligns the fixed point with the target distribution's norm, which can still provide benefits in terms of stability compared to behavior-norm FQE. We will revise the statement of the main theorem and the surrounding discussion in §4 to make the dependence on the inherent Bellman error explicit and to avoid overstating the robustness in high-misspecification settings. This will be a partial revision. revision: partial
Referee: [§3.2] §3.2 (weight estimation procedure): the assumption that the target policy induces a well-defined stationary distribution and that the density ratio can be estimated with error small enough not to dominate the bound is load-bearing, yet the paper provides no quantitative conditions under which the estimation error remains controlled when the behavior and target distributions differ substantially.

Authors: The referee correctly identifies that the paper assumes the ratio estimation error is sufficiently small without providing explicit quantitative conditions for when this holds under substantial distribution shift. To address this, we will add a new subsection or paragraph in §3.2 that provides sufficient conditions for the ratio estimator, drawing on existing results from the density ratio estimation literature (such as bounds under Lipschitz assumptions or using minimax estimators). This will include a quantitative bound on the allowable estimation error in terms of other problem parameters to ensure it does not dominate the overall bound. We plan to make this a full revision to the section. revision: yes
Referee: [§5] §5 (experiments): the reported stabilization and value-error reduction are shown only for controlled synthetic settings; without ablation on the magnitude of inherent Bellman error or on the accuracy of the ratio estimator, it is unclear whether the empirical gains persist precisely in the misspecification regimes where the theoretical attenuation is weakest.

Authors: We acknowledge the limitation in the experimental section. The current experiments use synthetic environments to isolate the effects, but lack targeted ablations varying the inherent Bellman error (e.g., by changing the function class capacity) and the ratio estimation accuracy (e.g., by simulating noisy ratio estimates). In the revised version, we will include additional figures and tables in §5 with such ablations, showing performance as a function of these quantities. This will help validate the theory in the relevant regimes. We will make this revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines the target as the fixed point of the stationary-weighted Bellman projection under the target policy's stationary distribution, which is constructed independently of any fitted value function. The finite-sample linear convergence bound is obtained via standard contraction-mapping arguments in the reweighted L2 norm, with an explicit error decomposition into iteration, statistical, approximation, and weight-estimation terms. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and the central result does not rename a known empirical pattern. The analysis remains self-contained against external benchmarks such as the contraction property of the Bellman operator in the appropriate norm.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim rests on standard MDP assumptions plus the practical ability to estimate density ratios; no new entities are postulated and no parameters are fitted directly to the value target.

free parameters (1)

density ratio estimator
The method requires an estimator for the stationary target-to-behavior ratio; its accuracy enters the bound but is not fitted to the value function itself.

axioms (2)

domain assumption Existence of a unique stationary distribution under the target policy
Invoked to define the L2 norm in which the Bellman operator is contractive.
standard math Standard MDP transition and reward structure
Used to ensure the Bellman operator is well-defined and contractive in the chosen norm.

pith-pipeline@v0.9.0 · 5510 in / 1436 out tokens · 55413 ms · 2026-05-16T19:02:17.614824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

A variant of the wang-foster-kakade lower bound for the discounted setting

Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011
[2]

J., Jiang, N., Sekhari, A., and Xie, T

Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,

work page arXiv
[3]

Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

work page arXiv
[4]

17 Bibaut, A. F. and van der Laan, M. J. Fast rates for empirical risk minimization over c` adl` ag functions with bounded sectional variation norm.arXiv preprint arXiv:1907.09244,

work page arXiv 1907
[5]

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

Di, Q., Zhao, H., He, J., and Gu, Q. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

work page arXiv
[6]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv
[7]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pp. 2052–2062. PMLR,

work page 2052
[8]

Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,

work page 1998
[9]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995
[10]

Hines, O. J. and Miles, C. H. Learning density ratios in causal inference using bregman-riesz regression.arXiv preprint arXiv:2510.16127,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

J., Heess, N., Precup, D., Kim, K.-E., and Guez, A

Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,

work page arXiv
[12]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999
[13]

Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...

work page arXiv 1912
[14]

Semiparametric double reinforcement learning with applications to long-term causal inference.arXiv preprint arXiv:2501.06926, 2025a

van der Laan, L., Hubbard, D., Tran, A., Kallus, N., and Bibaut, A. Semiparametric double reinforcement learning with applications to long-term causal inference.arXiv preprint arXiv:2501.06926, 2025a. van der Laan, L., Kallus, N., and Bibaut, A. Inverse reinforcement learning using just classification and a few regressions.arXiv preprint arXiv:2509.21172,...

work page arXiv 2011
[15]

Stochastic gradients under nuisances

21 Yu, F., Mehta, R., Luedtke, A., and Harchaoui, Z. Stochastic gradients under nuisances. arXiv preprint arXiv:2508.20326,

work page arXiv
[16]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002
[17]

There is a hub state 0 and six spokes 1 ,

Experimental setup.We use a modified Baird-style 7-state MDP (Baird et al., 1995). There is a hub state 0 and six spokes 1 , . . . ,6. Thesolidaction deterministically transitions to state 0 from any state, while thedashedaction transitions uniformly among the spokes. The target policy always chooses the solid action, giving a stationary distribution µ th...

work page 1995
[18]

Rewards are drawn i.i.d

For each state–action pair, 5 distinct next states are sampled uniformly without replacement and assigned transition probabilities via a Dirichlet distribution. Rewards are drawn i.i.d. fromN(0,1). For each MDP instance, the target policy π is sampled independently for each state from a Dirichlet distribution over actions. The ground-truth value functions...

work page 1996
[19]

All learning curves and final errors are averaged over M = 50 independent random seeds for the MDP, policies, and dataset generation

We sweep the discount factorγ∈ { 0.90, 0.925, 0.95, 0.96, 0.97, 0.98, 0.99} and run K = 200 fitted iterations. All learning curves and final errors are averaged over M = 50 independent random seeds for the MDP, policies, and dataset generation. Errors are reported in the stationary norm∥Q (K) −Q ⋆∥2,µπ. A.2.2 Additional experiments In this experiment, sta...

work page 2025
[20]

TakingL 2(µ) norms and applying the triangle inequality ∥Q⋆ F −Q ⋆∥2,µ ≤ ∥T F Q⋆ F − TF Q⋆∥2,µ +∥T F Q⋆ − TQ ⋆∥2,µ

C.2 Misspecification error bound Proof of Lemma 1.Write Q⋆ F −Q ⋆ =T F Q⋆ F − TQ ⋆ = TF Q⋆ F − TF Q⋆ + TF Q⋆ − TQ ⋆ . TakingL 2(µ) norms and applying the triangle inequality ∥Q⋆ F −Q ⋆∥2,µ ≤ ∥T F Q⋆ F − TF Q⋆∥2,µ +∥T F Q⋆ − TQ ⋆∥2,µ. By the contraction property of Lemma 5, ∥TF Q⋆ F − TF Q⋆∥2,µ ≤γ∥Q ⋆ F −Q ⋆∥2,µ, and by definition ofT F, TF Q⋆ − TQ ⋆ = ΠF ...

work page 2011

[1] [1]

A variant of the wang-foster-kakade lower bound for the discounted setting

Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011

[2] [2]

J., Jiang, N., Sekhari, A., and Xie, T

Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning.arXiv preprint arXiv:2401.09681,

work page arXiv

[3] [3]

Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

Bibaut, A., Petersen, M., Vlassis, N., Dimakopoulou, M., and van der Laan, M. Sequential causal inference in a single world of connected units.arXiv preprint arXiv:2101.07380,

work page arXiv

[4] [4]

17 Bibaut, A. F. and van der Laan, M. J. Fast rates for empirical risk minimization over c` adl` ag functions with bounded sectional variation norm.arXiv preprint arXiv:1907.09244,

work page arXiv 1907

[5] [5]

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

Di, Q., Zhao, H., He, J., and Gu, Q. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

work page arXiv

[6] [6]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv

[7] [7]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pp. 2052–2062. PMLR,

work page 2052

[8] [8]

Glynn, P. W. and Henderson, S. G. Estimation of stationary densities for markov chains. In 1998 Winter Simulation Conference. Proceedings (Cat. No. 98CH36274), volume 1, pp. 647–652. IEEE,

work page 1998

[9] [9]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995

[10] [10]

Hines, O. J. and Miles, C. H. Learning density ratios in causal inference using bregman-riesz regression.arXiv preprint arXiv:2510.16127,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

J., Heess, N., Precup, D., Kim, K.-E., and Guez, A

Lee, J., Paduraru, C., Mankowitz, D. J., Heess, N., Precup, D., Kim, K.-E., and Guez, A. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation.arXiv preprint arXiv:2204.08957,

work page arXiv

[12] [12]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999

[13] [13]

Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections.Advances in neural information processing systems, 32, 2019a. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience.arXiv preprint arXiv:1912.02074, 2019b...

work page arXiv 1912

[14] [14]

Semiparametric double reinforcement learning with applications to long-term causal inference.arXiv preprint arXiv:2501.06926, 2025a

van der Laan, L., Hubbard, D., Tran, A., Kallus, N., and Bibaut, A. Semiparametric double reinforcement learning with applications to long-term causal inference.arXiv preprint arXiv:2501.06926, 2025a. van der Laan, L., Kallus, N., and Bibaut, A. Inverse reinforcement learning using just classification and a few regressions.arXiv preprint arXiv:2509.21172,...

work page arXiv 2011

[15] [15]

Stochastic gradients under nuisances

21 Yu, F., Mehta, R., Luedtke, A., and Harchaoui, Z. Stochastic gradients under nuisances. arXiv preprint arXiv:2508.20326,

work page arXiv

[16] [16]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002

[17] [17]

There is a hub state 0 and six spokes 1 ,

Experimental setup.We use a modified Baird-style 7-state MDP (Baird et al., 1995). There is a hub state 0 and six spokes 1 , . . . ,6. Thesolidaction deterministically transitions to state 0 from any state, while thedashedaction transitions uniformly among the spokes. The target policy always chooses the solid action, giving a stationary distribution µ th...

work page 1995

[18] [18]

Rewards are drawn i.i.d

For each state–action pair, 5 distinct next states are sampled uniformly without replacement and assigned transition probabilities via a Dirichlet distribution. Rewards are drawn i.i.d. fromN(0,1). For each MDP instance, the target policy π is sampled independently for each state from a Dirichlet distribution over actions. The ground-truth value functions...

work page 1996

[19] [19]

All learning curves and final errors are averaged over M = 50 independent random seeds for the MDP, policies, and dataset generation

We sweep the discount factorγ∈ { 0.90, 0.925, 0.95, 0.96, 0.97, 0.98, 0.99} and run K = 200 fitted iterations. All learning curves and final errors are averaged over M = 50 independent random seeds for the MDP, policies, and dataset generation. Errors are reported in the stationary norm∥Q (K) −Q ⋆∥2,µπ. A.2.2 Additional experiments In this experiment, sta...

work page 2025

[20] [20]

TakingL 2(µ) norms and applying the triangle inequality ∥Q⋆ F −Q ⋆∥2,µ ≤ ∥T F Q⋆ F − TF Q⋆∥2,µ +∥T F Q⋆ − TQ ⋆∥2,µ

C.2 Misspecification error bound Proof of Lemma 1.Write Q⋆ F −Q ⋆ =T F Q⋆ F − TQ ⋆ = TF Q⋆ F − TF Q⋆ + TF Q⋆ − TQ ⋆ . TakingL 2(µ) norms and applying the triangle inequality ∥Q⋆ F −Q ⋆∥2,µ ≤ ∥T F Q⋆ F − TF Q⋆∥2,µ +∥T F Q⋆ − TQ ⋆∥2,µ. By the contraction property of Lemma 5, ∥TF Q⋆ F − TF Q⋆∥2,µ ≤γ∥Q ⋆ F −Q ⋆∥2,µ, and by definition ofT F, TF Q⋆ − TQ ⋆ = ΠF ...

work page 2011