pith. sign in

arxiv: 2404.18905 · v3 · submitted 2024-04-29 · 📊 stat.ME · cs.LG· stat.ML

Detecting critical treatment effect bias in small subgroups

Pith reviewed 2026-05-24 02:07 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML
keywords treatment effect biasobservational studiesrandomized trialssubgroup analysisbias lower boundconditional effectsstatistical testing
0
0 comments X

The pith

A statistical test estimates the strongest possible bias in any subgroup's treatment effect when comparing observational data to a randomized trial.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to benchmark observational studies against randomized trials on treatment effects within subgroups rather than only on the overall average. It tests whether conditional effects differ by more than a chosen tolerance and then derives an asymptotically valid lower bound on the largest bias present in any subgroup. This matters because observational data reaches more patients but can hide biases that hit particular groups hardest, and a bound on the worst-case bias helps judge when the data remains usable for medical decisions. The approach is validated on real data where its conclusions match known medical patterns.

Core claim

We design a statistical test for the null hypothesis that treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study.

What carries the argument

The asymptotically valid lower bound on the maximum bias strength across all subgroups, obtained after testing conditional treatment effect differences up to tolerance.

If this is right

  • If the estimated lower bound exceeds the tolerance, the observational study cannot be trusted for decisions on that subgroup without further adjustment.
  • The method extends benchmarking from the population average to every subgroup defined by the conditioning features.
  • Validation on real medical data produces conclusions consistent with established clinical knowledge.
  • The bound is asymptotically valid, so its reliability improves with larger sample sizes in both studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounding technique could be applied when multiple observational sources are available instead of one.
  • If the bound proves tight in practice, it could help prioritize which additional features to collect in future studies.
  • The approach might generalize to checking consistency across different randomized trials that share the same conditioning features.

Load-bearing premise

The chosen set of features is enough to make the conditional treatment effects identifiable and comparable between the randomized trial and the observational study.

What would settle it

A dataset in which the computed lower bound stays low while independent evidence shows large treatment effect differences in at least one subgroup defined by the same features would contradict the bound's validity.

Figures

Figures reproduced from arXiv: 2404.18905 by Fanny Yang, Javier Abad, Konstantin Donhauser, Piersilvio De Bartolomeis.

Figure 1
Figure 1. Figure 1: High-level illustration of our approach. We want to test if the bias in the observational study, i.e. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between the estimated and true bias models for Scenario 2. Our estimates of the bias [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Heatmap visualizations of the bias for (a) Scenario 2 based on 12 subgroups with different biases [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
read the original abstract

Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a statistical test of the null that conditional treatment effects estimated from an RCT and an observational study differ by at most a user-specified tolerance, then derives an asymptotically valid lower bound on the maximum bias strength across any subgroup in the observational study. The method is illustrated and validated on real-world medical data where the resulting conclusions align with established clinical knowledge.

Significance. If the derivation and identifiability assumptions hold, the work supplies a concrete, asymptotically justified tool for benchmarking subgroup-level treatment-effect estimates from observational data against RCTs. This addresses a practical need in medical statistics where RCTs often lack generalizability. The explicit asymptotic validity claim and the real-data validation that matches domain knowledge are positive features.

major comments (2)
  1. [methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.
  2. [methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.
minor comments (2)
  1. [methods] Notation for the conditioning feature set and the tolerance parameter should be introduced once with a clear table or list of symbols to avoid later ambiguity.
  2. [real-world validation] The real-world validation section would benefit from an explicit statement of the exact feature set used and a brief overlap or positivity diagnostic, even if informal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important considerations for the practical application of our method. We address each major comment below.

read point-by-point responses
  1. Referee: [methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.

    Authors: We agree that the lower bound is interpretable as a bound on bias strength only under the assumption that the selected conditioning features suffice for identifiability and comparability of conditional treatment effects. This assumption is stated explicitly in the paper's setup. While domain knowledge typically guides feature selection in such benchmarking exercises, we acknowledge the value of diagnostics. In the revision we will add a dedicated subsection on this assumption, recommend overlap diagnostics and sensitivity checks for omitted covariates, and include an empirical sensitivity analysis to alternative feature sets in the real-data example. revision: yes

  2. Referee: [methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.

    Authors: The tolerance is a user-specified parameter reflecting the maximum acceptable difference in conditional effects, to be chosen according to clinical or substantive criteria. We agree that explicit guidance and sensitivity analysis would improve usability. In the revised manuscript we will expand the methods section with recommendations for tolerance selection (e.g., linking to minimal clinically important differences) and will report a sensitivity analysis in the real-data application that quantifies how the estimated lower bound changes across a range of tolerance values. revision: yes

Circularity Check

0 steps flagged

No circularity: external RCT benchmark and explicit identifiability assumption keep derivation self-contained

full rationale

The paper's core procedure compares conditional treatment effects from the observational study against an external RCT benchmark via a statistical test of the null that the effects differ by at most a tolerance; the lower bound on maximum bias is then derived from that comparison. This structure uses independent external data and states the sufficiency assumption for identifiability explicitly rather than deriving it from the fitted quantities or the same dataset. No equation reduces by construction to a parameter fitted on the target data, no self-citation chain bears the central claim, and the method does not rename or smuggle in prior results from the same authors. The derivation is therefore statistically independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields limited visibility into parameters or axioms; the tolerance parameter in the null hypothesis is likely a free choice, and the feature set is treated as given.

free parameters (1)
  • tolerance parameter
    The null allows effects to differ up to some tolerance; its value is chosen by the analyst.

pith-pipeline@v0.9.0 · 5686 in / 1037 out tokens · 15021 ms · 2026-05-24T02:07:35.332756+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Implementation of the Women’s Health Initiative study design

    Garnet Anderson, Joann Manson, Robert Wallace, Bernedine Lund, Dallas Hall, Scott Davis, Sally Shumaker, Ching-Yun Wang, Evan Stein, and Ross Prentice. Implementation of the Women’s Health Initiative study design. Annals of Epidemiology, 13(9):S5–S17, 2003

  2. [2]

    Adaptive combination of randomized and observational data

    David Cheng and Tianxi Cai. Adaptive combination of randomized and observational data. arXiv preprint arXiv:2111.15012, 2021

  3. [3]

    Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator

    Yuwen Cheng, Lili Wu, and Shu Yang. Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator. Conference on Uncertainty in Artificial Intelligence , 2023

  4. [4]

    Benchmarking observational methods by comparing randomized trials and their emulations

    Issa Dahabreh, James Robins, and Miguel Hern´ an. Benchmarking observational methods by comparing randomized trials and their emulations. Epidemiology, 31(5):614–619, 2020

  5. [5]

    Global sensitivity analysis for studies extending inferences from a randomized trial to a target population

    Issa Dahabreh, James Robins, Sebastien Haneuse, Sarah Robertson, Jon Steingrimsson, and Miguel Hern´ an. Global sensitivity analysis for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:2207.09982 , 2022

  6. [6]

    Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population

    Issa Dahabreh, James Robins, Sebastien Haneuse, Iman Saeed, Sarah Robertson, Elizabeth Stuart, and Miguel Hern´ an. Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. Statistics in Medicine , 42(13):2029–2043, 2023

  7. [7]

    Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials

    Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, and Fanny Yang. Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials. International Conference on Artificial Intelligence and Statistics , 2024

  8. [8]

    Testing for the unconfoundedness assumption using an instrumental assumption

    Xavier De Luna and Per Johansson. Testing for the unconfoundedness assumption using an instrumental assumption. Journal of Causal Inference , 2(2):187–199, 2014. 13

  9. [9]

    Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

    Miguel Delgado. Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993

  10. [10]

    Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

    Ilker Demirel, Edward De Brouwer, Zeshan Hussain, Michael Oberst, Anthony Philippakis, and David Sontag. Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024

  11. [11]

    Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT

    Stephen Donald, Yu-Chin Hsu, and Robert Lieli. Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT. Journal of Business & Economic Statistics , 32(3):395–415, 2014

  12. [12]

    Representation of minorities and women in oncology clinical trials: review of the past 14 years

    Narjust Duma, Jesus Vera Aguilera, Jonas Paludo, Candace Haddox, Miguel Gonzalez Velez, Yucai Wang, Konstantinos Leventakos, Joleen Hubbard, Aaron Mansfield, Ronald Go, et al. Representation of minorities and women in oncology clinical trials: review of the past 14 years. Journal of Oncology Practice, 14(1):e1–e10, 2018

  13. [13]

    Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods

    Shaun Forbes and Issa Dahabreh. Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods. Journal of General Internal Medicine , 35:1396– 1404, 2020

  14. [14]

    Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

    Jessica Franklin, Robert Glynn, David Martin, and Sebastian Schneeweiss. Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019

  15. [15]

    Pretest estimation in combining probability and non-probability samples

    Chenyin Gao and Shu Yang. Pretest estimation in combining probability and non-probability samples. Electronic Journal of Statistics , 17(1):1492–1546, 2023

  16. [16]

    Hormone therapy to prevent disease and prolong life in postmenopausal women

    Deborah Grady, Susan Rubin, Diana Petitti, Cary Fox, Dennis Black, Bruce Ettinger, Virginia Ernster, and Steven Cummings. Hormone therapy to prevent disease and prolong life in postmenopausal women. Annals of Internal Medicine , 117(12):1016–1037, 1992

  17. [17]

    A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease

    Francine Grodstein, JoAnn Manson, Graham Colditz, Walter Willett, Frank Speizer, and Meir Stampfer. A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease. Annals of Internal Medicine , 133(12):933–941, 2000

  18. [18]

    Clinical trial generalizability assessment in the big data era: a review

    Zhe He, Xiang Tang, Xi Yang, Yi Guo, Thomas George, Neil Charness, Kelsa Bartley Quan Hem, William Hogan, and Jiang Bian. Clinical trial generalizability assessment in the big data era: a review. Clinical and Translational Science, 13(4):675–684, 2020

  19. [19]

    Decreased mortality in users of estrogen replacement therapy

    Brian Henderson, Annlia Paganini-Hill, and Ronald Ross. Decreased mortality in users of estrogen replacement therapy. Archives of Internal Medicine , 151(1):75–78, 1991

  20. [20]

    Using big data to emulate a target trial when a randomized trial is not available

    Miguel Hern´ an and James Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology , 183(8):758–764, 2016

  21. [21]

    The MineThatData e-mail analytics and data mining challenge, 2008

    Kevin Hillstrom. The MineThatData e-mail analytics and data mining challenge, 2008

  22. [22]

    Vascular effects of early versus late postmenopausal treatment with estradiol

    Howard Hodis, Wendy Mack, Victor Henderson, Donna Shoupe, Matthew Budoff, Juliana Hwang- Levine, Yanjie Li, Mei Feng, Laurie Dustin, Naoko Kono, et al. Vascular effects of early versus late postmenopausal treatment with estradiol. New England Journal of Medicine , 374(13):1221–1231, 2016

  23. [23]

    Consistency of the generalized bootstrap for degenerate U-statistics

    Marie Huskova and Paul Janssen. Consistency of the generalized bootstrap for degenerate U-statistics. The Annals of Statistics , pages 1811–1823, 1993

  24. [24]

    Falsification before extrapolation in causal effect estimation

    Zeshan Hussain, Michael Oberst, Ming-Chieh Shih, and David Sontag. Falsification before extrapolation in causal effect estimation. Advances in Neural Information Processing Systems , 2022. 14

  25. [25]

    Falsification of internal and external validity in observational studies via conditional moment restrictions

    Zeshan Hussain, Ming-Chieh Shih, Michael Oberst, Ilker Demirel, and David Sontag. Falsification of internal and external validity in observational studies via conditional moment restrictions. International Conference on Artificial Intelligence and Statistics , 2023

  26. [26]

    Removing hidden confounding by experimental grounding

    Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. Advances in Neural Information Processing Systems , 2018

  27. [27]

    Detecting hidden confounding in observational data using multiple environments

    Rickard Karlsson and Jesse Krijthe. Detecting hidden confounding in observational data using multiple environments. Advances in Neural Information Processing Systems , 2023

  28. [28]

    Towards optimal doubly robust estimation of heterogeneous causal effects

    Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics , 17(2):3008–3049, 2023

  29. [29]

    Dimension-agnostic inference using cross U-statistics

    Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics. Bernoulli, 30 (1):683–711, 2024

  30. [30]

    The new FDA real-world evidence program to support development of drugs and bio- logics

    David Klonoff. The new FDA real-world evidence program to support development of drugs and bio- logics. Journal of Diabetes Science and Technology , 14(2):345–349, 2020

  31. [31]

    The 2020 menopausal hormone therapy guidelines

    Sa Ra Lee, Moon Kyoung Cho, Yeon Jean Cho, Sungwook Chun, Seung-Hwa Hong, Kyu Ri Hwang, Gyun-Ho Jeon, Jong Kil Joo, Seul Ki Kim, Dong Ock Lee, et al. The 2020 menopausal hormone therapy guidelines. Journal of Menopausal Medicine , 26(2):69, 2020

  32. [32]

    Negative controls: a tool for detecting con- founding and bias in observational studies

    Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting con- founding and bias in observational studies. Epidemiology, 21(3):383, 2010

  33. [33]

    An omnibus non-parametric test of equality in distribution for unknown functions

    Alex Luedtke, Marco Carone, and Mark van der Laan. An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, 2019

  34. [34]

    A double machine learning approach to combining experimental and observational data

    Marco Morucci, Vittorio Orlandi, Harsh Parikh, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. A double machine learning approach to combining experimental and observational data. arXiv preprint arXiv:2307.01449, 2023

  35. [35]

    Kernel conditional moment test via max- imum moment restriction

    Krikamol Muandet, Wittawat Jitkrittum, and Jonas K¨ ubler. Kernel conditional moment test via max- imum moment restriction. Conference on Uncertainty in Artificial Intelligence , 2020

  36. [36]

    Nonparametric comparison of regression curves: an empirical process approach

    Natalie Neumeyer and Holger Dette. Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880–920, 2003

  37. [37]

    Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects

    Trang Quynh Nguyen, Cyrus Ebnesajjad, Stephen Cole, and Elizabeth Stuart. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. The Annals of Applied Statistics , pages 225–247, 2017

  38. [38]

    Trang Quynh Nguyen, Benjamin Ackerman, Ian Schmid, Stephen Cole, and Elizabeth Stuart. Sensi- tivity analyses for effect modifiers not observed in the target population when generalizing treatment effects from a randomized controlled trial: Assumptions, models, effect scales, data scenarios, and implementation details. PloS One, 13(12):e0208795, 2018

  39. [39]

    The FDA Sentinel Initiative—an evolving national resource

    Richard Platt, Jeffrey Brown, Melissa Robb, Mark McClellan, Robert Ball, Michael Nguyen, and Rachel Sherman. The FDA Sentinel Initiative—an evolving national resource. New England Jour- nal of Medicine , 379(22):2091–2093, 2018

  40. [40]

    Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial

    Ross Prentice, Robert Langer, Marcia Stefanick, Barbara Howard, Mary Pettinger, Garnet Anderson, David Barad, David Curb, Jane Kotchen, Lewis Kuller, et al. Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial. American Journal of...

  41. [41]

    Testing the significance of categorical predictor variables in nonparametric regression models

    Jeffery Racine, Jeffrey Hart, and Qi Li. Testing the significance of categorical predictor variables in nonparametric regression models. Econometric Reviews, 25(4):523–544, 2006

  42. [42]

    Combining observational and exper- imental datasets using shrinkage estimators

    Evan Rosenman, Guillaume Basse, Art Owen, and Mike Baiocchi. Combining observational and exper- imental datasets using shrinkage estimators. Biometrics, 79(4):2961–2973, 2023

  43. [43]

    to whom do the results of this trial apply?

    Peter Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005

  44. [44]

    Bayesian inference for causal effects: The role of randomization

    Donald Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, pages 34–58, 1978

  45. [45]

    The framework for FDA’s real-world evidence program

    Beth Schurman. The framework for FDA’s real-world evidence program. Applied Clinical Trials, 28(4), 2019

  46. [46]

    Approximation Theorems of Mathematical Statistics

    Robert Serfling. Approximation Theorems of Mathematical Statistics . Wiley, 1980

  47. [47]

    On negative outcome control of unobserved confounding as a generalization of difference-in-differences

    Tamar Sofer, David Richardson, Elena Colicino, Joel Schwartz, and Eric Tchetgen Tchetgen. On negative outcome control of unobserved confounding as a generalization of difference-in-differences. Statistical Science: a Review Journal of the Institute of Mathematical Statistics , 31(3):348, 2016

  48. [48]

    Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence

    Meir Stampfer and Graham Colditz. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Preventive Medicine, 20(1):47–63, 1991

  49. [49]

    On the influence of the kernel on the consistency of support vector machines

    Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(11):67–93, 2001

  50. [50]

    Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health

    David Sturdee and Pines. Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health. Climacteric, 14(3):302–320, 2011

  51. [51]

    A distributional approach for causal inference using propensity scores

    Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006

  52. [52]

    Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS)

    Hugh Taylor, Aya Tal, Lubna Pal, Fangyong Li, Dennis Black, Eliot Brinton, Matthew Budoff, Marcelle Cedars, Wei Du, Howard Hodis, et al. Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS). JAMA Internal Medicine , 177(10):1471–1479, 2017

  53. [53]

    The HRT controversy: observational studies and RCTs fall in line

    Jan Vandenbroucke. The HRT controversy: observational studies and RCTs fall in line. The Lancet, 373(9671):1233–1235, 2009

  54. [54]

    Use of historical control data for assessing treatment effects in clinical trials

    Kert Viele, Scott Berry, Beat Neuenschwander, Billy Amzal, Fang Chen, Nathan Enas, Brian Hobbs, Joseph Ibrahim, Nelson Kinnersley, Stacy Lindborg, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13(1):41–54, 2014

  55. [55]

    Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies

    Lili Wu and Shu Yang. Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Conference on Causal Learning and Reasoning , 2022

  56. [56]

    Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding

    Shu Yang, Donglin Zeng, and Xiaofei Wang. Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922 , 2020

  57. [57]

    newbie”, “mens

    Shu Yang, Chenyin Gao, Donglin Zeng, and Xiaofei Wang. Elastic integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology , 85(3):575–596, 04 2023. 16 Appendices The following appendices provide deferred proofs, experiment details, and abla...

  58. [58]

    witness function

    When |X J | = 3, we select the features that capture the bias between rct and os datasets ( newbie, mens, channel), and hence we achieve the highest power. Intuitively, if the feature set is smaller, some of the bias averages out, and the test loses power. On the other hand, when increasing the feature set, the test loses power due to the curse of dimensi...