Detecting critical treatment effect bias in small subgroups

Fanny Yang; Javier Abad; Konstantin Donhauser; Piersilvio De Bartolomeis

arxiv: 2404.18905 · v3 · submitted 2024-04-29 · 📊 stat.ME · cs.LG· stat.ML

Detecting critical treatment effect bias in small subgroups

Piersilvio De Bartolomeis , Javier Abad , Konstantin Donhauser , Fanny Yang This is my paper

Pith reviewed 2026-05-24 02:07 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords treatment effect biasobservational studiesrandomized trialssubgroup analysisbias lower boundconditional effectsstatistical testing

0 comments

The pith

A statistical test estimates the strongest possible bias in any subgroup's treatment effect when comparing observational data to a randomized trial.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to benchmark observational studies against randomized trials on treatment effects within subgroups rather than only on the overall average. It tests whether conditional effects differ by more than a chosen tolerance and then derives an asymptotically valid lower bound on the largest bias present in any subgroup. This matters because observational data reaches more patients but can hide biases that hit particular groups hardest, and a bound on the worst-case bias helps judge when the data remains usable for medical decisions. The approach is validated on real data where its conclusions match known medical patterns.

Core claim

We design a statistical test for the null hypothesis that treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study.

What carries the argument

The asymptotically valid lower bound on the maximum bias strength across all subgroups, obtained after testing conditional treatment effect differences up to tolerance.

If this is right

If the estimated lower bound exceeds the tolerance, the observational study cannot be trusted for decisions on that subgroup without further adjustment.
The method extends benchmarking from the population average to every subgroup defined by the conditioning features.
Validation on real medical data produces conclusions consistent with established clinical knowledge.
The bound is asymptotically valid, so its reliability improves with larger sample sizes in both studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounding technique could be applied when multiple observational sources are available instead of one.
If the bound proves tight in practice, it could help prioritize which additional features to collect in future studies.
The approach might generalize to checking consistency across different randomized trials that share the same conditioning features.

Load-bearing premise

The chosen set of features is enough to make the conditional treatment effects identifiable and comparable between the randomized trial and the observational study.

What would settle it

A dataset in which the computed lower bound stays low while independent evidence shows large treatment effect differences in at least one subgroup defined by the same features would contradict the bound's validity.

Figures

Figures reproduced from arXiv: 2404.18905 by Fanny Yang, Javier Abad, Konstantin Donhauser, Piersilvio De Bartolomeis.

**Figure 4.** Figure 4: Comparison between the estimated and true bias models for Scenario 2. Our estimates of the bias [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmap visualizations of the bias for (a) Scenario 2 based on 12 subgroups with different biases [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

read the original abstract

Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a conditional test plus lower bound on max subgroup bias when benchmarking obs studies to RCTs, but the identifiability of those conditional effects is the weakest link.

read the letter

The main takeaway is a method that tests whether conditional treatment effects in an observational study and an RCT differ by at most a tolerance, then produces an asymptotically valid lower bound on the strongest bias present in any subgroup. This moves the benchmarking exercise past the average treatment effect, which matters in settings where decisions turn on patient subgroups. The real-data example that lines up with medical knowledge is a concrete check that the outputs can be sensible. That combination of test and bound is not standard in the ATE-focused benchmarking papers cited in the abstract, so the contribution is real on that narrow point. The asymptotic validity claim is the part that would need the most scrutiny in a review. The soft spot is exactly the one the stress-test note flags: the features used for conditioning have to be rich enough that the conditional effects are identifiable and comparable across the two studies. If they are not, then any detected difference is not cleanly interpretable as bias, and the lower bound loses its meaning. The abstract states the null and the bound without mentioning overlap diagnostics, sensitivity checks, or omitted-variable probes, so a referee would want to see those in the full text. The tolerance parameter is also free and its practical selection rule is not obvious from the summary. This is aimed at statisticians working on transportability or validation of observational data in medicine, especially when subgroup effects matter. A reader who already works with RCT-obs comparisons would get value from the subgroup extension. The work is coherent enough on its own terms to deserve referee time rather than a desk reject, even though the identifiability step will probably require extra justification or diagnostics. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes a statistical test of the null that conditional treatment effects estimated from an RCT and an observational study differ by at most a user-specified tolerance, then derives an asymptotically valid lower bound on the maximum bias strength across any subgroup in the observational study. The method is illustrated and validated on real-world medical data where the resulting conclusions align with established clinical knowledge.

Significance. If the derivation and identifiability assumptions hold, the work supplies a concrete, asymptotically justified tool for benchmarking subgroup-level treatment-effect estimates from observational data against RCTs. This addresses a practical need in medical statistics where RCTs often lack generalizability. The explicit asymptotic validity claim and the real-data validation that matches domain knowledge are positive features.

major comments (2)

[methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.
[methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.

minor comments (2)

[methods] Notation for the conditioning feature set and the tolerance parameter should be introduced once with a clear table or list of symbols to avoid later ambiguity.
[real-world validation] The real-world validation section would benefit from an explicit statement of the exact feature set used and a brief overlap or positivity diagnostic, even if informal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important considerations for the practical application of our method. We address each major comment below.

read point-by-point responses

Referee: [methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.

Authors: We agree that the lower bound is interpretable as a bound on bias strength only under the assumption that the selected conditioning features suffice for identifiability and comparability of conditional treatment effects. This assumption is stated explicitly in the paper's setup. While domain knowledge typically guides feature selection in such benchmarking exercises, we acknowledge the value of diagnostics. In the revision we will add a dedicated subsection on this assumption, recommend overlap diagnostics and sensitivity checks for omitted covariates, and include an empirical sensitivity analysis to alternative feature sets in the real-data example. revision: yes
Referee: [methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.

Authors: The tolerance is a user-specified parameter reflecting the maximum acceptable difference in conditional effects, to be chosen according to clinical or substantive criteria. We agree that explicit guidance and sensitivity analysis would improve usability. In the revised manuscript we will expand the methods section with recommendations for tolerance selection (e.g., linking to minimal clinically important differences) and will report a sensitivity analysis in the real-data application that quantifies how the estimated lower bound changes across a range of tolerance values. revision: yes

Circularity Check

0 steps flagged

No circularity: external RCT benchmark and explicit identifiability assumption keep derivation self-contained

full rationale

The paper's core procedure compares conditional treatment effects from the observational study against an external RCT benchmark via a statistical test of the null that the effects differ by at most a tolerance; the lower bound on maximum bias is then derived from that comparison. This structure uses independent external data and states the sufficiency assumption for identifiability explicitly rather than deriving it from the fitted quantities or the same dataset. No equation reduces by construction to a parameter fitted on the target data, no self-citation chain bears the central claim, and the method does not rename or smuggle in prior results from the same authors. The derivation is therefore statistically independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields limited visibility into parameters or axioms; the tolerance parameter in the null hypothesis is likely a free choice, and the feature set is treated as given.

free parameters (1)

tolerance parameter
The null allows effects to differ up to some tolerance; its value is chosen by the analyst.

pith-pipeline@v0.9.0 · 5686 in / 1037 out tokens · 15021 ms · 2026-05-24T02:07:35.332756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Implementation of the Women’s Health Initiative study design

Garnet Anderson, Joann Manson, Robert Wallace, Bernedine Lund, Dallas Hall, Scott Davis, Sally Shumaker, Ching-Yun Wang, Evan Stein, and Ross Prentice. Implementation of the Women’s Health Initiative study design. Annals of Epidemiology, 13(9):S5–S17, 2003

work page 2003
[2]

Adaptive combination of randomized and observational data

David Cheng and Tianxi Cai. Adaptive combination of randomized and observational data. arXiv preprint arXiv:2111.15012, 2021

work page arXiv 2021
[3]

Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator

Yuwen Cheng, Lili Wu, and Shu Yang. Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator. Conference on Uncertainty in Artificial Intelligence , 2023

work page 2023
[4]

Benchmarking observational methods by comparing randomized trials and their emulations

Issa Dahabreh, James Robins, and Miguel Hern´ an. Benchmarking observational methods by comparing randomized trials and their emulations. Epidemiology, 31(5):614–619, 2020

work page 2020
[5]

Global sensitivity analysis for studies extending inferences from a randomized trial to a target population

Issa Dahabreh, James Robins, Sebastien Haneuse, Sarah Robertson, Jon Steingrimsson, and Miguel Hern´ an. Global sensitivity analysis for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:2207.09982 , 2022

work page arXiv 2022
[6]

Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population

Issa Dahabreh, James Robins, Sebastien Haneuse, Iman Saeed, Sarah Robertson, Elizabeth Stuart, and Miguel Hern´ an. Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. Statistics in Medicine , 42(13):2029–2043, 2023

work page 2029
[7]

Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, and Fanny Yang. Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials. International Conference on Artificial Intelligence and Statistics , 2024

work page 2024
[8]

Testing for the unconfoundedness assumption using an instrumental assumption

Xavier De Luna and Per Johansson. Testing for the unconfoundedness assumption using an instrumental assumption. Journal of Causal Inference , 2(2):187–199, 2014. 13

work page 2014
[9]

Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

Miguel Delgado. Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

work page 1993
[10]

Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

Ilker Demirel, Edward De Brouwer, Zeshan Hussain, Michael Oberst, Anthony Philippakis, and David Sontag. Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

work page 2024
[11]

Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT

Stephen Donald, Yu-Chin Hsu, and Robert Lieli. Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT. Journal of Business & Economic Statistics , 32(3):395–415, 2014

work page 2014
[12]

Representation of minorities and women in oncology clinical trials: review of the past 14 years

Narjust Duma, Jesus Vera Aguilera, Jonas Paludo, Candace Haddox, Miguel Gonzalez Velez, Yucai Wang, Konstantinos Leventakos, Joleen Hubbard, Aaron Mansfield, Ronald Go, et al. Representation of minorities and women in oncology clinical trials: review of the past 14 years. Journal of Oncology Practice, 14(1):e1–e10, 2018

work page 2018
[13]

Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods

Shaun Forbes and Issa Dahabreh. Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods. Journal of General Internal Medicine , 35:1396– 1404, 2020

work page 2020
[14]

Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

Jessica Franklin, Robert Glynn, David Martin, and Sebastian Schneeweiss. Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

work page 2019
[15]

Pretest estimation in combining probability and non-probability samples

Chenyin Gao and Shu Yang. Pretest estimation in combining probability and non-probability samples. Electronic Journal of Statistics , 17(1):1492–1546, 2023

work page 2023
[16]

Hormone therapy to prevent disease and prolong life in postmenopausal women

Deborah Grady, Susan Rubin, Diana Petitti, Cary Fox, Dennis Black, Bruce Ettinger, Virginia Ernster, and Steven Cummings. Hormone therapy to prevent disease and prolong life in postmenopausal women. Annals of Internal Medicine , 117(12):1016–1037, 1992

work page 1992
[17]

A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease

Francine Grodstein, JoAnn Manson, Graham Colditz, Walter Willett, Frank Speizer, and Meir Stampfer. A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease. Annals of Internal Medicine , 133(12):933–941, 2000

work page 2000
[18]

Clinical trial generalizability assessment in the big data era: a review

Zhe He, Xiang Tang, Xi Yang, Yi Guo, Thomas George, Neil Charness, Kelsa Bartley Quan Hem, William Hogan, and Jiang Bian. Clinical trial generalizability assessment in the big data era: a review. Clinical and Translational Science, 13(4):675–684, 2020

work page 2020
[19]

Decreased mortality in users of estrogen replacement therapy

Brian Henderson, Annlia Paganini-Hill, and Ronald Ross. Decreased mortality in users of estrogen replacement therapy. Archives of Internal Medicine , 151(1):75–78, 1991

work page 1991
[20]

Using big data to emulate a target trial when a randomized trial is not available

Miguel Hern´ an and James Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology , 183(8):758–764, 2016

work page 2016
[21]

The MineThatData e-mail analytics and data mining challenge, 2008

Kevin Hillstrom. The MineThatData e-mail analytics and data mining challenge, 2008

work page 2008
[22]

Vascular effects of early versus late postmenopausal treatment with estradiol

Howard Hodis, Wendy Mack, Victor Henderson, Donna Shoupe, Matthew Budoff, Juliana Hwang- Levine, Yanjie Li, Mei Feng, Laurie Dustin, Naoko Kono, et al. Vascular effects of early versus late postmenopausal treatment with estradiol. New England Journal of Medicine , 374(13):1221–1231, 2016

work page 2016
[23]

Consistency of the generalized bootstrap for degenerate U-statistics

Marie Huskova and Paul Janssen. Consistency of the generalized bootstrap for degenerate U-statistics. The Annals of Statistics , pages 1811–1823, 1993

work page 1993
[24]

Falsification before extrapolation in causal effect estimation

Zeshan Hussain, Michael Oberst, Ming-Chieh Shih, and David Sontag. Falsification before extrapolation in causal effect estimation. Advances in Neural Information Processing Systems , 2022. 14

work page 2022
[25]

Falsification of internal and external validity in observational studies via conditional moment restrictions

Zeshan Hussain, Ming-Chieh Shih, Michael Oberst, Ilker Demirel, and David Sontag. Falsification of internal and external validity in observational studies via conditional moment restrictions. International Conference on Artificial Intelligence and Statistics , 2023

work page 2023
[26]

Removing hidden confounding by experimental grounding

Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. Advances in Neural Information Processing Systems , 2018

work page 2018
[27]

Detecting hidden confounding in observational data using multiple environments

Rickard Karlsson and Jesse Krijthe. Detecting hidden confounding in observational data using multiple environments. Advances in Neural Information Processing Systems , 2023

work page 2023
[28]

Towards optimal doubly robust estimation of heterogeneous causal effects

Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics , 17(2):3008–3049, 2023

work page 2023
[29]

Dimension-agnostic inference using cross U-statistics

Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics. Bernoulli, 30 (1):683–711, 2024

work page 2024
[30]

The new FDA real-world evidence program to support development of drugs and bio- logics

David Klonoff. The new FDA real-world evidence program to support development of drugs and bio- logics. Journal of Diabetes Science and Technology , 14(2):345–349, 2020

work page 2020
[31]

The 2020 menopausal hormone therapy guidelines

Sa Ra Lee, Moon Kyoung Cho, Yeon Jean Cho, Sungwook Chun, Seung-Hwa Hong, Kyu Ri Hwang, Gyun-Ho Jeon, Jong Kil Joo, Seul Ki Kim, Dong Ock Lee, et al. The 2020 menopausal hormone therapy guidelines. Journal of Menopausal Medicine , 26(2):69, 2020

work page 2020
[32]

Negative controls: a tool for detecting con- founding and bias in observational studies

Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting con- founding and bias in observational studies. Epidemiology, 21(3):383, 2010

work page 2010
[33]

An omnibus non-parametric test of equality in distribution for unknown functions

Alex Luedtke, Marco Carone, and Mark van der Laan. An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, 2019

work page 2019
[34]

A double machine learning approach to combining experimental and observational data

Marco Morucci, Vittorio Orlandi, Harsh Parikh, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. A double machine learning approach to combining experimental and observational data. arXiv preprint arXiv:2307.01449, 2023

work page arXiv 2023
[35]

Kernel conditional moment test via max- imum moment restriction

Krikamol Muandet, Wittawat Jitkrittum, and Jonas K¨ ubler. Kernel conditional moment test via max- imum moment restriction. Conference on Uncertainty in Artificial Intelligence , 2020

work page 2020
[36]

Nonparametric comparison of regression curves: an empirical process approach

Natalie Neumeyer and Holger Dette. Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880–920, 2003

work page 2003
[37]

Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

Trang Quynh Nguyen, Cyrus Ebnesajjad, Stephen Cole, and Elizabeth Stuart. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. The Annals of Applied Statistics , pages 225–247, 2017

work page 2017
[38]

Trang Quynh Nguyen, Benjamin Ackerman, Ian Schmid, Stephen Cole, and Elizabeth Stuart. Sensi- tivity analyses for effect modifiers not observed in the target population when generalizing treatment effects from a randomized controlled trial: Assumptions, models, effect scales, data scenarios, and implementation details. PloS One, 13(12):e0208795, 2018

work page 2018
[39]

The FDA Sentinel Initiative—an evolving national resource

Richard Platt, Jeffrey Brown, Melissa Robb, Mark McClellan, Robert Ball, Michael Nguyen, and Rachel Sherman. The FDA Sentinel Initiative—an evolving national resource. New England Jour- nal of Medicine , 379(22):2091–2093, 2018

work page 2091
[40]

Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial

Ross Prentice, Robert Langer, Marcia Stefanick, Barbara Howard, Mary Pettinger, Garnet Anderson, David Barad, David Curb, Jane Kotchen, Lewis Kuller, et al. Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial. American Journal of...

work page 2005
[41]

Testing the significance of categorical predictor variables in nonparametric regression models

Jeffery Racine, Jeffrey Hart, and Qi Li. Testing the significance of categorical predictor variables in nonparametric regression models. Econometric Reviews, 25(4):523–544, 2006

work page 2006
[42]

Combining observational and exper- imental datasets using shrinkage estimators

Evan Rosenman, Guillaume Basse, Art Owen, and Mike Baiocchi. Combining observational and exper- imental datasets using shrinkage estimators. Biometrics, 79(4):2961–2973, 2023

work page 2023
[43]

to whom do the results of this trial apply?

Peter Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005

work page 2005
[44]

Bayesian inference for causal effects: The role of randomization

Donald Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, pages 34–58, 1978

work page 1978
[45]

The framework for FDA’s real-world evidence program

Beth Schurman. The framework for FDA’s real-world evidence program. Applied Clinical Trials, 28(4), 2019

work page 2019
[46]

Approximation Theorems of Mathematical Statistics

Robert Serfling. Approximation Theorems of Mathematical Statistics . Wiley, 1980

work page 1980
[47]

On negative outcome control of unobserved confounding as a generalization of difference-in-differences

Tamar Sofer, David Richardson, Elena Colicino, Joel Schwartz, and Eric Tchetgen Tchetgen. On negative outcome control of unobserved confounding as a generalization of difference-in-differences. Statistical Science: a Review Journal of the Institute of Mathematical Statistics , 31(3):348, 2016

work page 2016
[48]

Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence

Meir Stampfer and Graham Colditz. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Preventive Medicine, 20(1):47–63, 1991

work page 1991
[49]

On the influence of the kernel on the consistency of support vector machines

Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(11):67–93, 2001

work page 2001
[50]

Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health

David Sturdee and Pines. Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health. Climacteric, 14(3):302–320, 2011

work page 2011
[51]

A distributional approach for causal inference using propensity scores

Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006

work page 2006
[52]

Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS)

Hugh Taylor, Aya Tal, Lubna Pal, Fangyong Li, Dennis Black, Eliot Brinton, Matthew Budoff, Marcelle Cedars, Wei Du, Howard Hodis, et al. Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS). JAMA Internal Medicine , 177(10):1471–1479, 2017

work page 2017
[53]

The HRT controversy: observational studies and RCTs fall in line

Jan Vandenbroucke. The HRT controversy: observational studies and RCTs fall in line. The Lancet, 373(9671):1233–1235, 2009

work page 2009
[54]

Use of historical control data for assessing treatment effects in clinical trials

Kert Viele, Scott Berry, Beat Neuenschwander, Billy Amzal, Fang Chen, Nathan Enas, Brian Hobbs, Joseph Ibrahim, Nelson Kinnersley, Stacy Lindborg, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13(1):41–54, 2014

work page 2014
[55]

Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies

Lili Wu and Shu Yang. Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Conference on Causal Learning and Reasoning , 2022

work page 2022
[56]

Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding

Shu Yang, Donglin Zeng, and Xiaofei Wang. Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922 , 2020

work page arXiv 2007
[57]

newbie”, “mens

Shu Yang, Chenyin Gao, Donglin Zeng, and Xiaofei Wang. Elastic integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology , 85(3):575–596, 04 2023. 16 Appendices The following appendices provide deferred proofs, experiment details, and abla...

work page 2023
[58]

witness function

When |X J | = 3, we select the features that capture the bias between rct and os datasets ( newbie, mens, channel), and hence we achieve the highest power. Intuitively, if the feature set is smaller, some of the bias averages out, and the test loses power. On the other hand, when increasing the feature set, the test loses power due to the curse of dimensi...

work page 1993

[1] [1]

Implementation of the Women’s Health Initiative study design

Garnet Anderson, Joann Manson, Robert Wallace, Bernedine Lund, Dallas Hall, Scott Davis, Sally Shumaker, Ching-Yun Wang, Evan Stein, and Ross Prentice. Implementation of the Women’s Health Initiative study design. Annals of Epidemiology, 13(9):S5–S17, 2003

work page 2003

[2] [2]

Adaptive combination of randomized and observational data

David Cheng and Tianxi Cai. Adaptive combination of randomized and observational data. arXiv preprint arXiv:2111.15012, 2021

work page arXiv 2021

[3] [3]

Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator

Yuwen Cheng, Lili Wu, and Shu Yang. Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator. Conference on Uncertainty in Artificial Intelligence , 2023

work page 2023

[4] [4]

Benchmarking observational methods by comparing randomized trials and their emulations

Issa Dahabreh, James Robins, and Miguel Hern´ an. Benchmarking observational methods by comparing randomized trials and their emulations. Epidemiology, 31(5):614–619, 2020

work page 2020

[5] [5]

Global sensitivity analysis for studies extending inferences from a randomized trial to a target population

Issa Dahabreh, James Robins, Sebastien Haneuse, Sarah Robertson, Jon Steingrimsson, and Miguel Hern´ an. Global sensitivity analysis for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:2207.09982 , 2022

work page arXiv 2022

[6] [6]

Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population

Issa Dahabreh, James Robins, Sebastien Haneuse, Iman Saeed, Sarah Robertson, Elizabeth Stuart, and Miguel Hern´ an. Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. Statistics in Medicine , 42(13):2029–2043, 2023

work page 2029

[7] [7]

Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, and Fanny Yang. Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials. International Conference on Artificial Intelligence and Statistics , 2024

work page 2024

[8] [8]

Testing for the unconfoundedness assumption using an instrumental assumption

Xavier De Luna and Per Johansson. Testing for the unconfoundedness assumption using an instrumental assumption. Journal of Causal Inference , 2(2):187–199, 2014. 13

work page 2014

[9] [9]

Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

Miguel Delgado. Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

work page 1993

[10] [10]

Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

Ilker Demirel, Edward De Brouwer, Zeshan Hussain, Michael Oberst, Anthony Philippakis, and David Sontag. Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

work page 2024

[11] [11]

Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT

Stephen Donald, Yu-Chin Hsu, and Robert Lieli. Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT. Journal of Business & Economic Statistics , 32(3):395–415, 2014

work page 2014

[12] [12]

Representation of minorities and women in oncology clinical trials: review of the past 14 years

Narjust Duma, Jesus Vera Aguilera, Jonas Paludo, Candace Haddox, Miguel Gonzalez Velez, Yucai Wang, Konstantinos Leventakos, Joleen Hubbard, Aaron Mansfield, Ronald Go, et al. Representation of minorities and women in oncology clinical trials: review of the past 14 years. Journal of Oncology Practice, 14(1):e1–e10, 2018

work page 2018

[13] [13]

Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods

Shaun Forbes and Issa Dahabreh. Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods. Journal of General Internal Medicine , 35:1396– 1404, 2020

work page 2020

[14] [14]

Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

Jessica Franklin, Robert Glynn, David Martin, and Sebastian Schneeweiss. Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

work page 2019

[15] [15]

Pretest estimation in combining probability and non-probability samples

Chenyin Gao and Shu Yang. Pretest estimation in combining probability and non-probability samples. Electronic Journal of Statistics , 17(1):1492–1546, 2023

work page 2023

[16] [16]

Hormone therapy to prevent disease and prolong life in postmenopausal women

Deborah Grady, Susan Rubin, Diana Petitti, Cary Fox, Dennis Black, Bruce Ettinger, Virginia Ernster, and Steven Cummings. Hormone therapy to prevent disease and prolong life in postmenopausal women. Annals of Internal Medicine , 117(12):1016–1037, 1992

work page 1992

[17] [17]

A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease

Francine Grodstein, JoAnn Manson, Graham Colditz, Walter Willett, Frank Speizer, and Meir Stampfer. A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease. Annals of Internal Medicine , 133(12):933–941, 2000

work page 2000

[18] [18]

Clinical trial generalizability assessment in the big data era: a review

Zhe He, Xiang Tang, Xi Yang, Yi Guo, Thomas George, Neil Charness, Kelsa Bartley Quan Hem, William Hogan, and Jiang Bian. Clinical trial generalizability assessment in the big data era: a review. Clinical and Translational Science, 13(4):675–684, 2020

work page 2020

[19] [19]

Decreased mortality in users of estrogen replacement therapy

Brian Henderson, Annlia Paganini-Hill, and Ronald Ross. Decreased mortality in users of estrogen replacement therapy. Archives of Internal Medicine , 151(1):75–78, 1991

work page 1991

[20] [20]

Using big data to emulate a target trial when a randomized trial is not available

Miguel Hern´ an and James Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology , 183(8):758–764, 2016

work page 2016

[21] [21]

The MineThatData e-mail analytics and data mining challenge, 2008

Kevin Hillstrom. The MineThatData e-mail analytics and data mining challenge, 2008

work page 2008

[22] [22]

Vascular effects of early versus late postmenopausal treatment with estradiol

Howard Hodis, Wendy Mack, Victor Henderson, Donna Shoupe, Matthew Budoff, Juliana Hwang- Levine, Yanjie Li, Mei Feng, Laurie Dustin, Naoko Kono, et al. Vascular effects of early versus late postmenopausal treatment with estradiol. New England Journal of Medicine , 374(13):1221–1231, 2016

work page 2016

[23] [23]

Consistency of the generalized bootstrap for degenerate U-statistics

Marie Huskova and Paul Janssen. Consistency of the generalized bootstrap for degenerate U-statistics. The Annals of Statistics , pages 1811–1823, 1993

work page 1993

[24] [24]

Falsification before extrapolation in causal effect estimation

Zeshan Hussain, Michael Oberst, Ming-Chieh Shih, and David Sontag. Falsification before extrapolation in causal effect estimation. Advances in Neural Information Processing Systems , 2022. 14

work page 2022

[25] [25]

Falsification of internal and external validity in observational studies via conditional moment restrictions

Zeshan Hussain, Ming-Chieh Shih, Michael Oberst, Ilker Demirel, and David Sontag. Falsification of internal and external validity in observational studies via conditional moment restrictions. International Conference on Artificial Intelligence and Statistics , 2023

work page 2023

[26] [26]

Removing hidden confounding by experimental grounding

Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. Advances in Neural Information Processing Systems , 2018

work page 2018

[27] [27]

Detecting hidden confounding in observational data using multiple environments

Rickard Karlsson and Jesse Krijthe. Detecting hidden confounding in observational data using multiple environments. Advances in Neural Information Processing Systems , 2023

work page 2023

[28] [28]

Towards optimal doubly robust estimation of heterogeneous causal effects

Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics , 17(2):3008–3049, 2023

work page 2023

[29] [29]

Dimension-agnostic inference using cross U-statistics

Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics. Bernoulli, 30 (1):683–711, 2024

work page 2024

[30] [30]

The new FDA real-world evidence program to support development of drugs and bio- logics

David Klonoff. The new FDA real-world evidence program to support development of drugs and bio- logics. Journal of Diabetes Science and Technology , 14(2):345–349, 2020

work page 2020

[31] [31]

The 2020 menopausal hormone therapy guidelines

Sa Ra Lee, Moon Kyoung Cho, Yeon Jean Cho, Sungwook Chun, Seung-Hwa Hong, Kyu Ri Hwang, Gyun-Ho Jeon, Jong Kil Joo, Seul Ki Kim, Dong Ock Lee, et al. The 2020 menopausal hormone therapy guidelines. Journal of Menopausal Medicine , 26(2):69, 2020

work page 2020

[32] [32]

Negative controls: a tool for detecting con- founding and bias in observational studies

Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting con- founding and bias in observational studies. Epidemiology, 21(3):383, 2010

work page 2010

[33] [33]

An omnibus non-parametric test of equality in distribution for unknown functions

Alex Luedtke, Marco Carone, and Mark van der Laan. An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, 2019

work page 2019

[34] [34]

A double machine learning approach to combining experimental and observational data

Marco Morucci, Vittorio Orlandi, Harsh Parikh, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. A double machine learning approach to combining experimental and observational data. arXiv preprint arXiv:2307.01449, 2023

work page arXiv 2023

[35] [35]

Kernel conditional moment test via max- imum moment restriction

Krikamol Muandet, Wittawat Jitkrittum, and Jonas K¨ ubler. Kernel conditional moment test via max- imum moment restriction. Conference on Uncertainty in Artificial Intelligence , 2020

work page 2020

[36] [36]

Nonparametric comparison of regression curves: an empirical process approach

Natalie Neumeyer and Holger Dette. Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880–920, 2003

work page 2003

[37] [37]

Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

Trang Quynh Nguyen, Cyrus Ebnesajjad, Stephen Cole, and Elizabeth Stuart. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. The Annals of Applied Statistics , pages 225–247, 2017

work page 2017

[38] [38]

Trang Quynh Nguyen, Benjamin Ackerman, Ian Schmid, Stephen Cole, and Elizabeth Stuart. Sensi- tivity analyses for effect modifiers not observed in the target population when generalizing treatment effects from a randomized controlled trial: Assumptions, models, effect scales, data scenarios, and implementation details. PloS One, 13(12):e0208795, 2018

work page 2018

[39] [39]

The FDA Sentinel Initiative—an evolving national resource

Richard Platt, Jeffrey Brown, Melissa Robb, Mark McClellan, Robert Ball, Michael Nguyen, and Rachel Sherman. The FDA Sentinel Initiative—an evolving national resource. New England Jour- nal of Medicine , 379(22):2091–2093, 2018

work page 2091

[40] [40]

Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial

Ross Prentice, Robert Langer, Marcia Stefanick, Barbara Howard, Mary Pettinger, Garnet Anderson, David Barad, David Curb, Jane Kotchen, Lewis Kuller, et al. Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial. American Journal of...

work page 2005

[41] [41]

Testing the significance of categorical predictor variables in nonparametric regression models

Jeffery Racine, Jeffrey Hart, and Qi Li. Testing the significance of categorical predictor variables in nonparametric regression models. Econometric Reviews, 25(4):523–544, 2006

work page 2006

[42] [42]

Combining observational and exper- imental datasets using shrinkage estimators

Evan Rosenman, Guillaume Basse, Art Owen, and Mike Baiocchi. Combining observational and exper- imental datasets using shrinkage estimators. Biometrics, 79(4):2961–2973, 2023

work page 2023

[43] [43]

to whom do the results of this trial apply?

Peter Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005

work page 2005

[44] [44]

Bayesian inference for causal effects: The role of randomization

Donald Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, pages 34–58, 1978

work page 1978

[45] [45]

The framework for FDA’s real-world evidence program

Beth Schurman. The framework for FDA’s real-world evidence program. Applied Clinical Trials, 28(4), 2019

work page 2019

[46] [46]

Approximation Theorems of Mathematical Statistics

Robert Serfling. Approximation Theorems of Mathematical Statistics . Wiley, 1980

work page 1980

[47] [47]

On negative outcome control of unobserved confounding as a generalization of difference-in-differences

Tamar Sofer, David Richardson, Elena Colicino, Joel Schwartz, and Eric Tchetgen Tchetgen. On negative outcome control of unobserved confounding as a generalization of difference-in-differences. Statistical Science: a Review Journal of the Institute of Mathematical Statistics , 31(3):348, 2016

work page 2016

[48] [48]

Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence

Meir Stampfer and Graham Colditz. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Preventive Medicine, 20(1):47–63, 1991

work page 1991

[49] [49]

On the influence of the kernel on the consistency of support vector machines

Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(11):67–93, 2001

work page 2001

[50] [50]

Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health

David Sturdee and Pines. Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health. Climacteric, 14(3):302–320, 2011

work page 2011

[51] [51]

A distributional approach for causal inference using propensity scores

Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006

work page 2006

[52] [52]

Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS)

Hugh Taylor, Aya Tal, Lubna Pal, Fangyong Li, Dennis Black, Eliot Brinton, Matthew Budoff, Marcelle Cedars, Wei Du, Howard Hodis, et al. Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS). JAMA Internal Medicine , 177(10):1471–1479, 2017

work page 2017

[53] [53]

The HRT controversy: observational studies and RCTs fall in line

Jan Vandenbroucke. The HRT controversy: observational studies and RCTs fall in line. The Lancet, 373(9671):1233–1235, 2009

work page 2009

[54] [54]

Use of historical control data for assessing treatment effects in clinical trials

Kert Viele, Scott Berry, Beat Neuenschwander, Billy Amzal, Fang Chen, Nathan Enas, Brian Hobbs, Joseph Ibrahim, Nelson Kinnersley, Stacy Lindborg, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13(1):41–54, 2014

work page 2014

[55] [55]

Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies

Lili Wu and Shu Yang. Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Conference on Causal Learning and Reasoning , 2022

work page 2022

[56] [56]

Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding

Shu Yang, Donglin Zeng, and Xiaofei Wang. Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922 , 2020

work page arXiv 2007

[57] [57]

newbie”, “mens

Shu Yang, Chenyin Gao, Donglin Zeng, and Xiaofei Wang. Elastic integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology , 85(3):575–596, 04 2023. 16 Appendices The following appendices provide deferred proofs, experiment details, and abla...

work page 2023

[58] [58]

witness function

When |X J | = 3, we select the features that capture the bias between rct and os datasets ( newbie, mens, channel), and hence we achieve the highest power. Intuitively, if the feature set is smaller, some of the bias averages out, and the test loses power. On the other hand, when increasing the feature set, the test loses power due to the curse of dimensi...

work page 1993