Bellman Calibration for $V$-Learning in Offline Reinforcement Learning

Lars van der Laan; Nathan Kallus

arxiv: 2512.23694 · v2 · submitted 2025-12-29 · 📊 stat.ML · cs.LG· econ.EM

Bellman Calibration for V-Learning in Offline Reinforcement Learning

Lars van der Laan , Nathan Kallus This is my paper

Pith reviewed 2026-05-16 19:06 UTC · model grok-4.3

classification 📊 stat.ML cs.LGecon.EM

keywords offline reinforcement learningvalue function estimationBellman calibrationdoubly robust estimationpost-hoc recalibrationnonparametric ratesmodel-agnostic correction

0 comments

The pith

Bellman calibration error is controlled at one-dimensional nonparametric rates without Bellman completeness or value-function realizability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline reinforcement learning struggles with reliable long-horizon value prediction because bootstrapping, function approximation, and distribution shift interact in ways that standard guarantees cannot handle without strong assumptions. The paper defines Bellman calibration as the requirement that states assigned similar predicted values must have average Bellman targets that match those predictions, yielding a scalar diagnostic of systematic numerical miscalibration. This error is estimated from off-policy data via doubly robust Bellman target estimates, after which a one-dimensional recalibration map (histogram or isotonic) is fitted to the original predictions. The resulting procedure, Iterated Bellman Calibration, delivers finite-sample value-error bounds that separate statistical estimation, finite-iteration, and approximation errors. A sympathetic reader would care because the approach works model-agnostically and avoids the usual demands of Bellman completeness or realizability.

Core claim

Bellman calibration is a weak reliability criterion requiring that states with similar predicted values have average Bellman targets agreeing with those predictions; the associated scalar calibration error can be estimated doubly robustly from off-policy data, and any learned value predictor can be post-hoc recalibrated by fitting a one-dimensional map of its outputs, achieving control of calibration error at one-dimensional nonparametric rates without Bellman completeness or value-function realizability.

What carries the argument

Bellman calibration error, a scalar measure of systematic miscalibration between predicted values and doubly robust Bellman targets, corrected by a one-dimensional recalibration map (histogram or isotonic regression).

If this is right

Any learned value predictor can be post-hoc recalibrated model-agnostically to reduce systematic miscalibration.
Finite-sample guarantees separate statistical estimation error from approximation error in the original predictor.
Calibration improves value prediction when the original predictor has sufficient coverage but insufficient expressivity.
Histogram and isotonic regression serve as practical recalibration methods with explicit convergence rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The one-dimensional recalibration idea may apply to other sequential prediction settings where miscalibration is low-dimensional.
Doubly robust target estimation could be replaced by other robust estimators if coverage is weaker than assumed.
Connections to calibration methods in supervised learning suggest new diagnostics for detecting when recalibration will succeed or fail.

Load-bearing premise

That doubly robust Bellman target estimates can be formed from the given off-policy data and that a one-dimensional map suffices to capture the dominant miscalibration.

What would settle it

An experiment in which the recalibrated value function fails to reduce out-of-sample value error even when the estimated calibration error is driven to the claimed nonparametric rate.

Figures

Figures reproduced from arXiv: 2512.23694 by Lars van der Laan, Nathan Kallus.

read the original abstract

Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Bellman completeness or realizability. We introduce Bellman calibration, a weak reliability criterion requiring that states assigned similar predicted values have average Bellman targets that agree with those predictions. This criterion yields a scalar calibration error for diagnosing systematic numerical miscalibration, which we estimate from off-policy data using doubly robust Bellman target estimates. We then propose Iterated Bellman Calibration, a model-agnostic post-hoc procedure that recalibrates any learned value predictor by fitting a one-dimensional map of its original prediction, with histogram and isotonic variants. We prove finite-sample guarantees showing that Bellman calibration error is controlled at one-dimensional nonparametric rates without Bellman completeness or value-function realizability. Our value-error bounds separate statistical estimation, finite-iteration, and approximation errors, clarifying when calibration improves value prediction and when its gains are limited by the information in the original predictor or insufficient coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bellman calibration gives a post-hoc 1D recalibration fix for value estimates in offline RL that avoids Bellman completeness and realizability.

read the letter

The core contribution is a calibration criterion where states with similar predicted values must have matching average Bellman targets, plus an iterated one-dimensional recalibration procedure (histogram or isotonic) that corrects any existing value predictor. They estimate the scalar calibration error from off-policy data via doubly robust Bellman targets and prove finite-sample bounds that control this error at one-dimensional nonparametric rates. The value-error decomposition cleanly separates statistical estimation, finite-iteration, and approximation terms, which is useful for seeing when the fix actually helps versus when the original predictor or data coverage limits gains. This is new relative to the usual offline RL literature on completeness assumptions. The model-agnostic post-hoc nature is a practical plus; it can be layered on top of existing V-learning methods without retraining. The argument structure holds together internally once the calibration definition and doubly robust estimates are granted. A soft spot is that the univariate map may miss structured miscalibration that depends on state features beyond the scalar prediction, and forming reliable doubly robust targets still requires decent coverage and nuisance estimation rates that could dominate in sparse data regimes. The paper is aimed at offline RL researchers who want lighter alternatives to strong realizability assumptions. It shows clear thinking on the error sources and deserves a serious referee to check the proof details and the practical scope of the one-dimensional correction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Bellman calibration, a weak reliability criterion for value predictors in offline RL requiring that states with similar predicted values have average Bellman targets matching those predictions. It defines a scalar calibration error estimated via doubly robust Bellman targets from off-policy data, then proposes Iterated Bellman Calibration, a post-hoc recalibration procedure fitting a univariate map (histogram or isotonic variants) to any learned value predictor. Finite-sample guarantees are claimed showing calibration error controlled at one-dimensional nonparametric rates without Bellman completeness or value-function realizability; value-error bounds separate statistical estimation, finite-iteration, and approximation errors.

Significance. If the finite-sample results hold, the work is significant for relaxing strong assumptions (Bellman completeness, realizability) that limit applicability in offline RL. The model-agnostic post-hoc nature and explicit error decomposition clarify when recalibration yields gains versus when it is limited by predictor information or coverage. The reduction to 1D nonparametric rates for the calibration step is a notable technical relaxation, and the use of doubly robust estimation for the scalar error supports practical deployment.

major comments (2)

[Abstract] Abstract and the finite-sample guarantee section: the central claim that Bellman calibration error is controlled at one-dimensional nonparametric rates (without completeness or realizability) is load-bearing for the value bounds; the manuscript should state the precise rate (e.g., the exact exponent for histogram binning or isotonic regression) and the minimal sample-size condition under which it holds, as the current sketch leaves the dependence on the univariate map's complexity implicit.
[Value-error bounds] Value-error bounds paragraph: the separation into statistical, iteration, and approximation errors follows from the calibration control, but the approximation-error term must be shown not to dominate under the paper's own assumptions on the original predictor; an explicit comparison to the corresponding term in standard FQI bounds would verify the claimed improvement.

minor comments (3)

Clarify the exact definition of the scalar calibration error (how the one-dimensional map is fitted to the doubly robust targets) in the main text, as the abstract description is high-level.
Add a short paragraph contrasting the proposed method with existing calibration techniques in supervised learning (e.g., Platt scaling, isotonic regression) to highlight the RL-specific doubly robust construction.
The weakest assumption—that doubly robust targets can be formed from the given off-policy data—should be stated as an explicit coverage/overlap condition with a reference to the relevant proposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comments on our manuscript. We address each major comment below and will incorporate the suggested clarifications in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract and the finite-sample guarantee section: the central claim that Bellman calibration error is controlled at one-dimensional nonparametric rates (without completeness or realizability) is load-bearing for the value bounds; the manuscript should state the precise rate (e.g., the exact exponent for histogram binning or isotonic regression) and the minimal sample-size condition under which it holds, as the current sketch leaves the dependence on the univariate map's complexity implicit.

Authors: We agree that the rates and sample-size conditions should be stated explicitly rather than left implicit. In the revised manuscript we will add the precise convergence rates for the calibration error (n^{-1/3} for histogram binning with appropriately chosen bins and n^{-2/3} for isotonic regression) together with the minimal sample-size requirements under which these rates hold, both in the abstract and in the finite-sample guarantee section. This will make the dependence on the complexity of the univariate map fully transparent. revision: yes
Referee: [Value-error bounds] Value-error bounds paragraph: the separation into statistical, iteration, and approximation errors follows from the calibration control, but the approximation-error term must be shown not to dominate under the paper's own assumptions on the original predictor; an explicit comparison to the corresponding term in standard FQI bounds would verify the claimed improvement.

Authors: We thank the referee for this suggestion. We will expand the value-error bounds paragraph to include an explicit comparison of the approximation-error term with the corresponding term appearing in standard Fitted Q-Iteration bounds. Under the paper's assumptions the approximation error remains controlled by the inherent bias of the original predictor and does not dominate provided the predictor satisfies the mild coverage condition already stated; the added comparison will clarify the improvement relative to FQI. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins with the definition of Bellman calibration error as the discrepancy between predicted values and doubly-robust Bellman targets averaged over states with similar predictions. This scalar error is estimated from off-policy data using standard doubly robust estimators whose finite-sample rates are one-dimensional nonparametric and do not invoke Bellman completeness or value-function realizability. The subsequent Iterated Bellman Calibration step fits a univariate (histogram or isotonic) map to the original predictor; the value-error bounds then decompose additively into statistical estimation error (controlled at the 1D rate), finite-iteration error, and approximation error from the original predictor. None of these steps reduces by the paper's own equations to a quantity defined solely in terms of fitted parameters, nor does any load-bearing claim rest on a self-citation chain. The argument is therefore self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the ability to form doubly robust Bellman targets from off-policy data and on the assumption that miscalibration is largely one-dimensional.

axioms (2)

domain assumption Off-policy data permits construction of doubly robust Bellman target estimates with controlled variance
Required for estimating the calibration error from available trajectories.
domain assumption A one-dimensional map captures the dominant systematic miscalibration
Underpins the histogram and isotonic recalibration procedures.

pith-pipeline@v0.9.0 · 5483 in / 1253 out tokens · 24629 ms · 2026-05-16T19:06:11.482726+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Calibeating Prediction-Powered Inference
stat.ML 2026-04 unverdicted novelty 7.0

Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

A variant of the wang-foster-kakade lower bound for the discounted setting

Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011
[2]

On risk bounds in isotonic and other shape restricted regression problems

Chatterjee, S., Guntuboyina, A., and Sen, B. Improved risk bounds in isotonic regression. arXiv preprint arXiv:1311.3765,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Confidence intervals for multiple isotonic regression and other monotone models.The Annals of Statistics, 49(4):2021–2052,

Deng, H., Han, Q., and Zhang, C.-H. Confidence intervals for multiple isotonic regression and other monotone models.The Annals of Statistics, 49(4):2021–2052,

work page 2021
[4]

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

Di, Q., Zhao, H., He, J., and Gu, Q. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

work page arXiv
[5]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv
[6]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pp. 2052–2062. PMLR,

work page 2052
[7]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995
[8]

Lichtenstein, S., Fischhoff, B., and Phillips, L. D. Calibration of probabilities: The state of the art. InDecision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1–4 September, 1975, pp. 275–324. Springer,

work page 1975
[9]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999
[10]

and Roth, A

Noarov, G. and Roth, A. The scope of multicalibration: Characterizing multicalibration via property elicitation.arXiv preprint arXiv:2302.08507,

work page arXiv
[11]

and Schwartz, A

Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. InProceedings of the 1993 connectionist models summer school, pp. 255–263. Psychology Press,

work page 1993
[12]

Evaluating diagnostic tests and prediction models

ISSN 1741-7015. On behalf of Topic Group “Evaluating diagnostic tests and prediction models” of the STRATOS initiative. van der Laan, L. and Alaa, A. Generalized venn and venn-abers calibration with applications in conformal prediction.arXiv preprint arXiv:2502.05676,

work page arXiv
[13]

Stabilized inverse probability weighting via isotonic calibration.arXiv preprint arXiv:2411.06342, 2024a

van der Laan, L., Lin, Z., Carone, M., and Luedtke, A. Stabilized inverse probability weighting via isotonic calibration.arXiv preprint arXiv:2411.06342, 2024a. van der Laan, L., Luedtke, A., and Carone, M. Doubly robust inference via calibration. arXiv preprint arXiv:2411.02771, 2024b. van der Laan, L., Hubbard, D., Tran, A., Kallus, N., and Bibaut, A. S...

work page arXiv 2011
[14]

Whitehouse, J., Jung, C., Syrgkanis, V., Wilder, B., and Wu, Z. S. Orthogonal causal calibration.arXiv preprint arXiv:2406.01933,

work page arXiv
[15]

Auro: Reinforcement learning for adaptive user retention optimization in recommender systems

Xue, Z., Cai, Q., Yang, B., Hu, L., Jiang, P., Gai, K., and An, B. Auro: Reinforcement learning for adaptive user retention optimization in recommender systems. InProceedings of the ACM on Web Conference 2025, pp. 391–401,

work page 2025
[16]

Bellman conformal inference: Calibrating prediction intervals for time series.arXiv preprint arXiv:2402.05203,

Yang, Z., Cand` es, E., and Lei, L. Bellman conformal inference: Calibrating prediction intervals for time series.arXiv preprint arXiv:2402.05203,

work page arXiv
[17]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002
[18]

By assumption, both ˆvand bTπ are fixed (non-random) operators conditional on the training data, which is independent of the calibration sample Cn

bTπ(f2)−f 2 :f 1, f2 ∈ F B,ˆv o . By assumption, both ˆvand bTπ are fixed (non-random) operators conditional on the training data, which is independent of the calibration sample Cn. Consequently, the classes FB,ˆvand bGare non-random conditional on the training dataset. For any distribution Q and any uniformly bounded function class F, let N(ε,F, L 2(Q)) ...

work page 1996
[19]

Then there exists a universal constant C > 0such that, for allu≥1, with probability at least1−e −u2 , everyf∈ Fsatisfies 1 n nX i=1 f(O i)−E[f(O i)] ≤C δ2 +δ∥f∥+ u∥f∥√n + M u2 n

Suppose further that n−1/2p log log(1/δ) = o(δ). Then there exists a universal constant C > 0such that, for allu≥1, with probability at least1−e −u2 , everyf∈ Fsatisfies 1 n nX i=1 f(O i)−E[f(O i)] ≤C δ2 +δ∥f∥+ u∥f∥√n + M u2 n . The following lemma bounds the localized Rademacher complexity in terms of the uniform entropy integral and is a direct conseque...

work page 2011
[20]

(2023) and van der Laan et al

40 See, for example, the proof of Lemma C.1 in Van Der Laan et al. (2023) and van der Laan et al. (2024a). Choosingfappropriately yields 0 = 1 n nX i=1 n Γ0(ˆv(K))(Si)−ˆv(K)(Si) o × n bTπ(ˆv(K−1))(Oi)−ˆv(K)(Si) o , which is the same basic equality as (10) in the proof of Theorem

work page 2023
[21]

The remainder of the argument proceeds along the same lines with minor modifications. Specifically, let FT V denote the union of Fiso, which is uniformly bounded by 2 M under Condition C1, with all functions of bounded total variation bounded by the constant C in Condition C5. By Van Der Vaart & Wellner (1996), this class satisfies the uniform entropy int...

work page 1996

[1] [1]

A variant of the wang-foster-kakade lower bound for the discounted setting

Amortila, P., Jiang, N., and Xie, T. A variant of the wang-foster-kakade lower bound for the discounted setting.arXiv preprint arXiv:2011.01075,

work page arXiv 2011

[2] [2]

On risk bounds in isotonic and other shape restricted regression problems

Chatterjee, S., Guntuboyina, A., and Sen, B. Improved risk bounds in isotonic regression. arXiv preprint arXiv:1311.3765,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Confidence intervals for multiple isotonic regression and other monotone models.The Annals of Statistics, 49(4):2021–2052,

Deng, H., Han, Q., and Zhang, C.-H. Confidence intervals for multiple isotonic regression and other monotone models.The Annals of Statistics, 49(4):2021–2052,

work page 2021

[4] [4]

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

Di, Q., Zhao, H., He, J., and Gu, Q. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning.arXiv preprint arXiv:2310.01380,

work page arXiv

[5] [5]

Offline reinforcement learning: Fundamental barriers for value function approximation

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation.arXiv preprint arXiv:2111.10919,

work page arXiv

[6] [6]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pp. 2052–2062. PMLR,

work page 2052

[7] [7]

Gordon, G. J. Stable function approximation in dynamic programming. InMachine learning proceedings 1995, pp. 261–268. Elsevier,

work page 1995

[8] [8]

Lichtenstein, S., Fischhoff, B., and Phillips, L. D. Calibration of probabilities: The state of the art. InDecision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1–4 September, 1975, pp. 275–324. Springer,

work page 1975

[9] [9]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

work page 1999

[10] [10]

and Roth, A

Noarov, G. and Roth, A. The scope of multicalibration: Characterizing multicalibration via property elicitation.arXiv preprint arXiv:2302.08507,

work page arXiv

[11] [11]

and Schwartz, A

Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. InProceedings of the 1993 connectionist models summer school, pp. 255–263. Psychology Press,

work page 1993

[12] [12]

Evaluating diagnostic tests and prediction models

ISSN 1741-7015. On behalf of Topic Group “Evaluating diagnostic tests and prediction models” of the STRATOS initiative. van der Laan, L. and Alaa, A. Generalized venn and venn-abers calibration with applications in conformal prediction.arXiv preprint arXiv:2502.05676,

work page arXiv

[13] [13]

Stabilized inverse probability weighting via isotonic calibration.arXiv preprint arXiv:2411.06342, 2024a

van der Laan, L., Lin, Z., Carone, M., and Luedtke, A. Stabilized inverse probability weighting via isotonic calibration.arXiv preprint arXiv:2411.06342, 2024a. van der Laan, L., Luedtke, A., and Carone, M. Doubly robust inference via calibration. arXiv preprint arXiv:2411.02771, 2024b. van der Laan, L., Hubbard, D., Tran, A., Kallus, N., and Bibaut, A. S...

work page arXiv 2011

[14] [14]

Whitehouse, J., Jung, C., Syrgkanis, V., Wilder, B., and Wu, Z. S. Orthogonal causal calibration.arXiv preprint arXiv:2406.01933,

work page arXiv

[15] [15]

Auro: Reinforcement learning for adaptive user retention optimization in recommender systems

Xue, Z., Cai, Q., Yang, B., Hu, L., Jiang, P., Gai, K., and An, B. Auro: Reinforcement learning for adaptive user retention optimization in recommender systems. InProceedings of the ACM on Web Conference 2025, pp. 391–401,

work page 2025

[16] [16]

Bellman conformal inference: Calibrating prediction intervals for time series.arXiv preprint arXiv:2402.05203,

Yang, Z., Cand` es, E., and Lei, L. Bellman conformal inference: Calibrating prediction intervals for time series.arXiv preprint arXiv:2402.05203,

work page arXiv

[17] [17]

Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

Zhang, R., Dai, B., Li, L., and Schuurmans, D. Gendice: Generalized offline estimation of stationary values.arXiv preprint arXiv:2002.09072,

work page arXiv 2002

[18] [18]

By assumption, both ˆvand bTπ are fixed (non-random) operators conditional on the training data, which is independent of the calibration sample Cn

bTπ(f2)−f 2 :f 1, f2 ∈ F B,ˆv o . By assumption, both ˆvand bTπ are fixed (non-random) operators conditional on the training data, which is independent of the calibration sample Cn. Consequently, the classes FB,ˆvand bGare non-random conditional on the training dataset. For any distribution Q and any uniformly bounded function class F, let N(ε,F, L 2(Q)) ...

work page 1996

[19] [19]

Then there exists a universal constant C > 0such that, for allu≥1, with probability at least1−e −u2 , everyf∈ Fsatisfies 1 n nX i=1 f(O i)−E[f(O i)] ≤C δ2 +δ∥f∥+ u∥f∥√n + M u2 n

Suppose further that n−1/2p log log(1/δ) = o(δ). Then there exists a universal constant C > 0such that, for allu≥1, with probability at least1−e −u2 , everyf∈ Fsatisfies 1 n nX i=1 f(O i)−E[f(O i)] ≤C δ2 +δ∥f∥+ u∥f∥√n + M u2 n . The following lemma bounds the localized Rademacher complexity in terms of the uniform entropy integral and is a direct conseque...

work page 2011

[20] [20]

(2023) and van der Laan et al

40 See, for example, the proof of Lemma C.1 in Van Der Laan et al. (2023) and van der Laan et al. (2024a). Choosingfappropriately yields 0 = 1 n nX i=1 n Γ0(ˆv(K))(Si)−ˆv(K)(Si) o × n bTπ(ˆv(K−1))(Oi)−ˆv(K)(Si) o , which is the same basic equality as (10) in the proof of Theorem

work page 2023

[21] [21]

The remainder of the argument proceeds along the same lines with minor modifications. Specifically, let FT V denote the union of Fiso, which is uniformly bounded by 2 M under Condition C1, with all functions of bounded total variation bounded by the constant C in Condition C5. By Van Der Vaart & Wellner (1996), this class satisfies the uniform entropy int...

work page 1996