Cost-optimal Sequential Testing via Doubly Robust Q-learning

Dian Jin; Doudou Zhou; Lu Tian; Tianxi Cai; Yingye Zheng; Yiran Zhang

arxiv: 2604.11165 · v2 · submitted 2026-04-13 · 📊 stat.ML · cs.AI· cs.LG· math.ST· stat.TH

Cost-optimal Sequential Testing via Doubly Robust Q-learning

Doudou Zhou , Yiran Zhang , Dian Jin , Yingye Zheng , Lu Tian , Tianxi Cai This is my paper

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGmath.STstat.TH

keywords sequential testingcost-optimal policiesdoubly robust estimationQ-learninginverse probability weightingmissing at randompolicy learningclinical decision making

0 comments

The pith

A doubly robust Q-learning framework learns cost-optimal sequential testing policies from retrospective data with informative missingness using path-specific weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical decisions often require choosing which costly or invasive tests to perform and when to stop, yet retrospective records show test results only when chosen earlier, creating dependent missingness. The paper builds a doubly robust Q-learning method that introduces path-specific inverse probability weights for each possible test trajectory; these weights normalize conditionally on the observed history. The weights are then paired with auxiliary contrast models to form orthogonal pseudo-outcomes, so that the resulting policy-value estimator remains unbiased whenever either the test-acquisition model or the contrast model is correctly specified. Oracle inequalities, convergence rates, regret bounds, and misclassification rates are derived for the stage-wise estimators and the final policy. Simulations and a prostate-cancer cohort application show lower cost-adjusted regret than standard weighted or complete-case baselines while preserving predictive accuracy.

Core claim

Under a sequential missing-at-random mechanism, path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a conditional normalization property are combined with auxiliary contrast models to produce orthogonal pseudo-outcomes. These pseudo-outcomes enable unbiased estimation of the value of any sequential testing policy when either the acquisition model or the contrast model is correctly specified. The resulting Q-learning procedure yields oracle inequalities for the contrast estimators together with convergence rates, regret bounds, and misclassification rates for the learned policy.

What carries the argument

Path-specific inverse probability weights combined with auxiliary contrast models to generate orthogonal pseudo-outcomes inside a Q-learning recursion.

If this is right

Unbiased policy learning holds if the acquisition model is correct, regardless of whether the contrast model is correct.
Unbiased policy learning also holds if the contrast model is correct, regardless of whether the acquisition model is correct.
Stage-wise contrast estimators satisfy oracle inequalities that translate into finite-sample regret bounds and misclassification rates for the learned policy.
The method produces lower cost-adjusted regret than inverse-probability-weighted or complete-case Q-learning in finite samples.
In a prostate cancer cohort the procedure selects testing sequences that reduce total cost while maintaining the same level of predictive accuracy as more expensive policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting-and-contrast construction could be applied to other sequential decision problems that involve costly actions and history-dependent missingness, such as dynamic treatment regimes.
The conditional normalization property of the path-specific weights may simplify variance estimation and enable scalable computation when the number of possible test sequences grows large.
Combining the orthogonal pseudo-outcomes with modern function approximators such as neural networks would produce a doubly robust deep Q-learning variant for high-dimensional clinical state spaces.

Load-bearing premise

Test missingness follows a sequential missing-at-random process that depends only on the observed history up to each stage.

What would settle it

A simulation in which missingness depends on unobserved factors shows that the policy regret of the doubly robust estimator does not converge to zero at the stated rate even when one of the two models is correctly specified.

Figures

Figures reproduced from arXiv: 2604.11165 by Dian Jin, Doudou Zhou, Lu Tian, Tianxi Cai, Yingye Zheng, Yiran Zhang.

**Figure 2.** Figure 2: Scenario 2. Violin plots of average total loss across 50 repetitions at representative [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of COST-Q terminal testing paths across quintiles of the baseline [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

read the original abstract

Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a doubly robust Q-learning method with path-specific weights for cost-optimal sequential testing under sequential MAR, with decent empirical checks but no escape from the missingness assumption.

read the letter

The core contribution is a framework that learns policies for when to order costly tests by combining path-specific inverse probability weights with auxiliary contrast models. The weights are normalized conditional on observed history, and the orthogonal pseudo-outcomes are meant to deliver unbiased policy value estimates if either the acquisition model or the contrast model is correct. They derive oracle inequalities, convergence rates, regret bounds, and misclassification rates, then show gains over weighted and complete-case baselines in simulations plus a prostate cancer cohort where testing costs drop without hurting accuracy.

Referee Report

2 major / 2 minor

Summary. The paper develops a doubly robust Q-learning framework for estimating cost-optimal sequential testing policies from retrospective data with informative missingness induced by prior test results. Under a sequential missing-at-random mechanism, it introduces path-specific inverse probability weights that account for heterogeneous trajectories and satisfy a normalization property conditional on observed history; these are combined with auxiliary contrast models to form orthogonal pseudo-outcomes enabling unbiased policy learning when either the acquisition or contrast model is correctly specified. Theoretical contributions include oracle inequalities for stage-wise contrast estimators, convergence rates, regret bounds, and misclassification rates for the learned policy, with supporting simulations and an application to a prostate cancer cohort.

Significance. If the central doubly robust construction and associated bounds hold, the work provides a practically relevant extension of Q-learning and doubly robust estimation to sequential testing problems with cost considerations and missingness. The path-specific weights and normalization property address a key challenge in heterogeneous trajectories, and the provision of regret and misclassification guarantees strengthens the case for deployment in clinical decision support. Simulations and the real-data illustration offer concrete evidence of improved cost-adjusted performance over baselines.

major comments (2)

[Assumptions and Method] The unbiasedness and orthogonality of the pseudo-outcomes (central to all subsequent oracle inequalities, regret bounds, and misclassification rates) are derived under the sequential missing-at-random assumption. No sensitivity analysis or robustness result is provided for violations where missingness depends on unobserved factors at any stage; this is load-bearing because misspecification of the acquisition probabilities would invalidate the weights and break the double-robustness property even if the contrast model is correct.
[Theoretical Results] The normalization property of the path-specific IPW conditional on observed history is invoked to construct the orthogonal pseudo-outcomes, but the theoretical analysis should explicitly verify that this property propagates through the stage-wise recursion to guarantee that the expectation of the pseudo-outcome equals the true Q-function (or contrast) under correct specification of either model.

minor comments (2)

[Simulations] The simulation section would benefit from additional detail on the data-generating process for heterogeneous test trajectories and the specific forms of the acquisition and contrast models used in the baselines.
[Application] In the prostate cancer application, reporting quantitative cost reductions alongside confidence intervals or p-values for the predictive accuracy comparison would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important aspects of our doubly robust framework. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Assumptions and Method] The unbiasedness and orthogonality of the pseudo-outcomes (central to all subsequent oracle inequalities, regret bounds, and misclassification rates) are derived under the sequential missing-at-random assumption. No sensitivity analysis or robustness result is provided for violations where missingness depends on unobserved factors at any stage; this is load-bearing because misspecification of the acquisition probabilities would invalidate the weights and break the double-robustness property even if the contrast model is correct.

Authors: We agree that the sequential missing-at-random assumption is foundational to the unbiasedness and double robustness of the pseudo-outcomes. The manuscript derives all theoretical guarantees under this assumption and does not include sensitivity analyses for violations involving unobserved factors. In the revised version, we will add a new subsection in the Discussion that explicitly acknowledges this limitation, explains the potential impact on the acquisition model and weights, and outlines possible future extensions such as sensitivity parameters or alternative robust estimation strategies. revision: yes
Referee: [Theoretical Results] The normalization property of the path-specific IPW conditional on observed history is invoked to construct the orthogonal pseudo-outcomes, but the theoretical analysis should explicitly verify that this property propagates through the stage-wise recursion to guarantee that the expectation of the pseudo-outcome equals the true Q-function (or contrast) under correct specification of either model.

Authors: We appreciate this observation. While the normalization property is used to establish the form of the orthogonal pseudo-outcomes and is implicitly relied upon in the stage-wise derivations, the manuscript does not contain an explicit recursive verification across stages. In the revision, we will expand the relevant lemma and proof (in the section on orthogonal pseudo-outcomes and oracle inequalities) to include a direct inductive argument showing that the normalization holds conditionally on the observed history at each stage, thereby confirming that the expectation of the pseudo-outcome recovers the true contrast when either model is correctly specified. revision: yes

Circularity Check

0 steps flagged

No circularity: standard doubly robust construction with independent theory

full rationale

The paper adapts established doubly robust Q-learning to sequential testing under a sequential missing-at-random assumption. Path-specific IPW and orthogonal pseudo-outcomes are constructed from auxiliary models in the usual way; the resulting oracle inequalities, convergence rates, and regret bounds are derived from standard concentration and empirical process arguments rather than by re-expressing the target policy value as a fitted quantity. No self-citation chain, self-definitional step, or fitted-input-renamed-as-prediction appears in the derivation. The framework is self-contained against external benchmarks once the MAR mechanism is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the sequential missing-at-random assumption and standard regularity conditions for Q-learning and doubly robust estimation. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption sequential missing-at-random mechanism
Invoked to justify unbiased estimation via the path-specific weights and orthogonal pseudo-outcomes.

pith-pipeline@v0.9.0 · 5493 in / 1158 out tokens · 49080 ms · 2026-05-10T15:56:21.049293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework... path-specific inverse probability weights... orthogonal pseudo-outcomes... E[E_s] < ∞ and treat C_s as pre-calibrated
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 1 (Sequential MAR)... Assumption 2 (Positivity)... oracle inequalities... regret bounds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Anderson, G. F., P. Hussey, and B. Varkey (2019). It’s still the prices, stupid: Why the us spends so much on health care, and a tribute to uwe reinhardt.Health Affairs 38(1), 7–11. Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models.Biometrics 61(4), 962–973. 26 Blumenthal, D., E. D. Gumas, A. Shah, M. Z...

work page 2019
[2]

Lloyd-Jones, D. M., L. T. Braun, C. E. Ndumele, S. C. Smith, L. S. Sperling, S. S. Virani, and R. S. Blumenthal (2019). Use of risk assessment tools to guide decision-making in the primary prevention of atherosclerotic cardiovascular disease: a special report from the american heart association and american college of cardiology.Journal of the American Co...

work page 2019
[3]

Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression.The Annals of Statistics 10(4), 1040 –

work page 1982
[4]

28 Sutton, R. S., A. G. Barto, et al. (1998).Introduction to Reinforcement Learning, Volume

work page 1998
[5]

Szepesv´ ari, C

MIT press Cambridge. Szepesv´ ari, C. (2022).Algorithms for Reinforcement Learning. Springer nature. Thompson, I. M., D. P. Ankerst, C. Chi, P. J. Goodman, C. M. Tangen, M. S. Lucia, Z. Feng, H. L. Parnes, and C. A. Coltman Jr (2006). Assessing prostate cancer risk: results from the prostate cancer prevention trial.Journal of the National Cancer Institute...

work page 2022

[1] [1]

Anderson, G. F., P. Hussey, and B. Varkey (2019). It’s still the prices, stupid: Why the us spends so much on health care, and a tribute to uwe reinhardt.Health Affairs 38(1), 7–11. Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models.Biometrics 61(4), 962–973. 26 Blumenthal, D., E. D. Gumas, A. Shah, M. Z...

work page 2019

[2] [2]

Lloyd-Jones, D. M., L. T. Braun, C. E. Ndumele, S. C. Smith, L. S. Sperling, S. S. Virani, and R. S. Blumenthal (2019). Use of risk assessment tools to guide decision-making in the primary prevention of atherosclerotic cardiovascular disease: a special report from the american heart association and american college of cardiology.Journal of the American Co...

work page 2019

[3] [3]

Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression.The Annals of Statistics 10(4), 1040 –

work page 1982

[4] [4]

28 Sutton, R. S., A. G. Barto, et al. (1998).Introduction to Reinforcement Learning, Volume

work page 1998

[5] [5]

Szepesv´ ari, C

MIT press Cambridge. Szepesv´ ari, C. (2022).Algorithms for Reinforcement Learning. Springer nature. Thompson, I. M., D. P. Ankerst, C. Chi, P. J. Goodman, C. M. Tangen, M. S. Lucia, Z. Feng, H. L. Parnes, and C. A. Coltman Jr (2006). Assessing prostate cancer risk: results from the prostate cancer prevention trial.Journal of the National Cancer Institute...

work page 2022