Detecting critical treatment effect bias in small subgroups
Pith reviewed 2026-05-24 02:07 UTC · model grok-4.3
The pith
A statistical test estimates the strongest possible bias in any subgroup's treatment effect when comparing observational data to a randomized trial.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design a statistical test for the null hypothesis that treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study.
What carries the argument
The asymptotically valid lower bound on the maximum bias strength across all subgroups, obtained after testing conditional treatment effect differences up to tolerance.
If this is right
- If the estimated lower bound exceeds the tolerance, the observational study cannot be trusted for decisions on that subgroup without further adjustment.
- The method extends benchmarking from the population average to every subgroup defined by the conditioning features.
- Validation on real medical data produces conclusions consistent with established clinical knowledge.
- The bound is asymptotically valid, so its reliability improves with larger sample sizes in both studies.
Where Pith is reading between the lines
- The same bounding technique could be applied when multiple observational sources are available instead of one.
- If the bound proves tight in practice, it could help prioritize which additional features to collect in future studies.
- The approach might generalize to checking consistency across different randomized trials that share the same conditioning features.
Load-bearing premise
The chosen set of features is enough to make the conditional treatment effects identifiable and comparable between the randomized trial and the observational study.
What would settle it
A dataset in which the computed lower bound stays low while independent evidence shows large treatment effect differences in at least one subgroup defined by the same features would contradict the bound's validity.
Figures
read the original abstract
Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a statistical test of the null that conditional treatment effects estimated from an RCT and an observational study differ by at most a user-specified tolerance, then derives an asymptotically valid lower bound on the maximum bias strength across any subgroup in the observational study. The method is illustrated and validated on real-world medical data where the resulting conclusions align with established clinical knowledge.
Significance. If the derivation and identifiability assumptions hold, the work supplies a concrete, asymptotically justified tool for benchmarking subgroup-level treatment-effect estimates from observational data against RCTs. This addresses a practical need in medical statistics where RCTs often lack generalizability. The explicit asymptotic validity claim and the real-data validation that matches domain knowledge are positive features.
major comments (2)
- [methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.
- [methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.
minor comments (2)
- [methods] Notation for the conditioning feature set and the tolerance parameter should be introduced once with a clear table or list of symbols to avoid later ambiguity.
- [real-world validation] The real-world validation section would benefit from an explicit statement of the exact feature set used and a brief overlap or positivity diagnostic, even if informal.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important considerations for the practical application of our method. We address each major comment below.
read point-by-point responses
-
Referee: [methods (null hypothesis and lower-bound derivation)] The derivation of the lower bound (methods section, statement of the null and subsequent bound construction) treats the chosen conditioning features as sufficient to render conditional treatment effects identifiable and comparable across the two studies. No diagnostic (e.g., overlap checks, sensitivity to omitted covariates, or empirical test of the identifiability condition) is supplied; violation of this premise would mean observed differences are not necessarily bias, directly undermining the interpretation of the reported lower bound as a bound on bias strength.
Authors: We agree that the lower bound is interpretable as a bound on bias strength only under the assumption that the selected conditioning features suffice for identifiability and comparability of conditional treatment effects. This assumption is stated explicitly in the paper's setup. While domain knowledge typically guides feature selection in such benchmarking exercises, we acknowledge the value of diagnostics. In the revision we will add a dedicated subsection on this assumption, recommend overlap diagnostics and sensitivity checks for omitted covariates, and include an empirical sensitivity analysis to alternative feature sets in the real-data example. revision: yes
-
Referee: [methods (tolerance parameter)] The tolerance parameter appears as a free input with no guidance or sensitivity analysis on its selection (methods section). Because the lower bound is a direct function of this tolerance, the absence of any quantification of how the bound changes with the tolerance weakens the practical claim that the procedure yields a useful, interpretable bound on maximum bias.
Authors: The tolerance is a user-specified parameter reflecting the maximum acceptable difference in conditional effects, to be chosen according to clinical or substantive criteria. We agree that explicit guidance and sensitivity analysis would improve usability. In the revised manuscript we will expand the methods section with recommendations for tolerance selection (e.g., linking to minimal clinically important differences) and will report a sensitivity analysis in the real-data application that quantifies how the estimated lower bound changes across a range of tolerance values. revision: yes
Circularity Check
No circularity: external RCT benchmark and explicit identifiability assumption keep derivation self-contained
full rationale
The paper's core procedure compares conditional treatment effects from the observational study against an external RCT benchmark via a statistical test of the null that the effects differ by at most a tolerance; the lower bound on maximum bias is then derived from that comparison. This structure uses independent external data and states the sufficiency assumption for identifiability explicitly rather than deriving it from the fitted quantities or the same dataset. No equation reduces by construction to a parameter fitted on the target data, no self-citation chain bears the central claim, and the method does not rename or smuggle in prior results from the same authors. The derivation is therefore statistically independent of its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- tolerance parameter
Reference graph
Works this paper leans on
-
[1]
Implementation of the Women’s Health Initiative study design
Garnet Anderson, Joann Manson, Robert Wallace, Bernedine Lund, Dallas Hall, Scott Davis, Sally Shumaker, Ching-Yun Wang, Evan Stein, and Ross Prentice. Implementation of the Women’s Health Initiative study design. Annals of Epidemiology, 13(9):S5–S17, 2003
work page 2003
-
[2]
Adaptive combination of randomized and observational data
David Cheng and Tianxi Cai. Adaptive combination of randomized and observational data. arXiv preprint arXiv:2111.15012, 2021
-
[3]
Yuwen Cheng, Lili Wu, and Shu Yang. Enhancing treatment effect estimation: A model robust ap- proach integrating randomized experiments and external controls using the double penalty integration estimator. Conference on Uncertainty in Artificial Intelligence , 2023
work page 2023
-
[4]
Benchmarking observational methods by comparing randomized trials and their emulations
Issa Dahabreh, James Robins, and Miguel Hern´ an. Benchmarking observational methods by comparing randomized trials and their emulations. Epidemiology, 31(5):614–619, 2020
work page 2020
-
[5]
Issa Dahabreh, James Robins, Sebastien Haneuse, Sarah Robertson, Jon Steingrimsson, and Miguel Hern´ an. Global sensitivity analysis for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:2207.09982 , 2022
-
[6]
Issa Dahabreh, James Robins, Sebastien Haneuse, Iman Saeed, Sarah Robertson, Elizabeth Stuart, and Miguel Hern´ an. Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. Statistics in Medicine , 42(13):2029–2043, 2023
work page 2029
-
[7]
Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials
Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, and Fanny Yang. Hidden yet quantifi- able: A lower bound for confounding strength using randomized trials. International Conference on Artificial Intelligence and Statistics , 2024
work page 2024
-
[8]
Testing for the unconfoundedness assumption using an instrumental assumption
Xavier De Luna and Per Johansson. Testing for the unconfoundedness assumption using an instrumental assumption. Journal of Causal Inference , 2(2):187–199, 2014. 13
work page 2014
-
[9]
Miguel Delgado. Testing the equality of nonparametric regression curves.Statistics & Probability Letters, 17(3):199–204, 1993
work page 1993
-
[10]
Ilker Demirel, Edward De Brouwer, Zeshan Hussain, Michael Oberst, Anthony Philippakis, and David Sontag. Benchmarking observational studies with experimental data under right-censoring.International Conference on Artificial Intelligence and Statistics , 2024
work page 2024
-
[11]
Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT
Stephen Donald, Yu-Chin Hsu, and Robert Lieli. Testing the unconfoundedness assumption via inverse probability weighted estimators of (L) ATT. Journal of Business & Economic Statistics , 32(3):395–415, 2014
work page 2014
-
[12]
Representation of minorities and women in oncology clinical trials: review of the past 14 years
Narjust Duma, Jesus Vera Aguilera, Jonas Paludo, Candace Haddox, Miguel Gonzalez Velez, Yucai Wang, Konstantinos Leventakos, Joleen Hubbard, Aaron Mansfield, Ronald Go, et al. Representation of minorities and women in oncology clinical trials: review of the past 14 years. Journal of Oncology Practice, 14(1):e1–e10, 2018
work page 2018
-
[13]
Shaun Forbes and Issa Dahabreh. Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods. Journal of General Internal Medicine , 35:1396– 1404, 2020
work page 2020
-
[14]
Jessica Franklin, Robert Glynn, David Martin, and Sebastian Schneeweiss. Evaluating the use of nonran- domized real-world data analyses for regulatory decision making.Clinical Pharmacology & Therapeutics, 105(4):867–877, 2019
work page 2019
-
[15]
Pretest estimation in combining probability and non-probability samples
Chenyin Gao and Shu Yang. Pretest estimation in combining probability and non-probability samples. Electronic Journal of Statistics , 17(1):1492–1546, 2023
work page 2023
-
[16]
Hormone therapy to prevent disease and prolong life in postmenopausal women
Deborah Grady, Susan Rubin, Diana Petitti, Cary Fox, Dennis Black, Bruce Ettinger, Virginia Ernster, and Steven Cummings. Hormone therapy to prevent disease and prolong life in postmenopausal women. Annals of Internal Medicine , 117(12):1016–1037, 1992
work page 1992
-
[17]
Francine Grodstein, JoAnn Manson, Graham Colditz, Walter Willett, Frank Speizer, and Meir Stampfer. A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease. Annals of Internal Medicine , 133(12):933–941, 2000
work page 2000
-
[18]
Clinical trial generalizability assessment in the big data era: a review
Zhe He, Xiang Tang, Xi Yang, Yi Guo, Thomas George, Neil Charness, Kelsa Bartley Quan Hem, William Hogan, and Jiang Bian. Clinical trial generalizability assessment in the big data era: a review. Clinical and Translational Science, 13(4):675–684, 2020
work page 2020
-
[19]
Decreased mortality in users of estrogen replacement therapy
Brian Henderson, Annlia Paganini-Hill, and Ronald Ross. Decreased mortality in users of estrogen replacement therapy. Archives of Internal Medicine , 151(1):75–78, 1991
work page 1991
-
[20]
Using big data to emulate a target trial when a randomized trial is not available
Miguel Hern´ an and James Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology , 183(8):758–764, 2016
work page 2016
-
[21]
The MineThatData e-mail analytics and data mining challenge, 2008
Kevin Hillstrom. The MineThatData e-mail analytics and data mining challenge, 2008
work page 2008
-
[22]
Vascular effects of early versus late postmenopausal treatment with estradiol
Howard Hodis, Wendy Mack, Victor Henderson, Donna Shoupe, Matthew Budoff, Juliana Hwang- Levine, Yanjie Li, Mei Feng, Laurie Dustin, Naoko Kono, et al. Vascular effects of early versus late postmenopausal treatment with estradiol. New England Journal of Medicine , 374(13):1221–1231, 2016
work page 2016
-
[23]
Consistency of the generalized bootstrap for degenerate U-statistics
Marie Huskova and Paul Janssen. Consistency of the generalized bootstrap for degenerate U-statistics. The Annals of Statistics , pages 1811–1823, 1993
work page 1993
-
[24]
Falsification before extrapolation in causal effect estimation
Zeshan Hussain, Michael Oberst, Ming-Chieh Shih, and David Sontag. Falsification before extrapolation in causal effect estimation. Advances in Neural Information Processing Systems , 2022. 14
work page 2022
-
[25]
Zeshan Hussain, Ming-Chieh Shih, Michael Oberst, Ilker Demirel, and David Sontag. Falsification of internal and external validity in observational studies via conditional moment restrictions. International Conference on Artificial Intelligence and Statistics , 2023
work page 2023
-
[26]
Removing hidden confounding by experimental grounding
Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. Advances in Neural Information Processing Systems , 2018
work page 2018
-
[27]
Detecting hidden confounding in observational data using multiple environments
Rickard Karlsson and Jesse Krijthe. Detecting hidden confounding in observational data using multiple environments. Advances in Neural Information Processing Systems , 2023
work page 2023
-
[28]
Towards optimal doubly robust estimation of heterogeneous causal effects
Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics , 17(2):3008–3049, 2023
work page 2023
-
[29]
Dimension-agnostic inference using cross U-statistics
Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics. Bernoulli, 30 (1):683–711, 2024
work page 2024
-
[30]
The new FDA real-world evidence program to support development of drugs and bio- logics
David Klonoff. The new FDA real-world evidence program to support development of drugs and bio- logics. Journal of Diabetes Science and Technology , 14(2):345–349, 2020
work page 2020
-
[31]
The 2020 menopausal hormone therapy guidelines
Sa Ra Lee, Moon Kyoung Cho, Yeon Jean Cho, Sungwook Chun, Seung-Hwa Hong, Kyu Ri Hwang, Gyun-Ho Jeon, Jong Kil Joo, Seul Ki Kim, Dong Ock Lee, et al. The 2020 menopausal hormone therapy guidelines. Journal of Menopausal Medicine , 26(2):69, 2020
work page 2020
-
[32]
Negative controls: a tool for detecting con- founding and bias in observational studies
Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting con- founding and bias in observational studies. Epidemiology, 21(3):383, 2010
work page 2010
-
[33]
An omnibus non-parametric test of equality in distribution for unknown functions
Alex Luedtke, Marco Carone, and Mark van der Laan. An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, 2019
work page 2019
-
[34]
A double machine learning approach to combining experimental and observational data
Marco Morucci, Vittorio Orlandi, Harsh Parikh, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. A double machine learning approach to combining experimental and observational data. arXiv preprint arXiv:2307.01449, 2023
-
[35]
Kernel conditional moment test via max- imum moment restriction
Krikamol Muandet, Wittawat Jitkrittum, and Jonas K¨ ubler. Kernel conditional moment test via max- imum moment restriction. Conference on Uncertainty in Artificial Intelligence , 2020
work page 2020
-
[36]
Nonparametric comparison of regression curves: an empirical process approach
Natalie Neumeyer and Holger Dette. Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880–920, 2003
work page 2003
-
[37]
Trang Quynh Nguyen, Cyrus Ebnesajjad, Stephen Cole, and Elizabeth Stuart. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. The Annals of Applied Statistics , pages 225–247, 2017
work page 2017
-
[38]
Trang Quynh Nguyen, Benjamin Ackerman, Ian Schmid, Stephen Cole, and Elizabeth Stuart. Sensi- tivity analyses for effect modifiers not observed in the target population when generalizing treatment effects from a randomized controlled trial: Assumptions, models, effect scales, data scenarios, and implementation details. PloS One, 13(12):e0208795, 2018
work page 2018
-
[39]
The FDA Sentinel Initiative—an evolving national resource
Richard Platt, Jeffrey Brown, Melissa Robb, Mark McClellan, Robert Ball, Michael Nguyen, and Rachel Sherman. The FDA Sentinel Initiative—an evolving national resource. New England Jour- nal of Medicine , 379(22):2091–2093, 2018
work page 2091
-
[40]
Ross Prentice, Robert Langer, Marcia Stefanick, Barbara Howard, Mary Pettinger, Garnet Anderson, David Barad, David Curb, Jane Kotchen, Lewis Kuller, et al. Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the Women’s Health Initiative clinical trial. American Journal of...
work page 2005
-
[41]
Testing the significance of categorical predictor variables in nonparametric regression models
Jeffery Racine, Jeffrey Hart, and Qi Li. Testing the significance of categorical predictor variables in nonparametric regression models. Econometric Reviews, 25(4):523–544, 2006
work page 2006
-
[42]
Combining observational and exper- imental datasets using shrinkage estimators
Evan Rosenman, Guillaume Basse, Art Owen, and Mike Baiocchi. Combining observational and exper- imental datasets using shrinkage estimators. Biometrics, 79(4):2961–2973, 2023
work page 2023
-
[43]
to whom do the results of this trial apply?
Peter Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005
work page 2005
-
[44]
Bayesian inference for causal effects: The role of randomization
Donald Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, pages 34–58, 1978
work page 1978
-
[45]
The framework for FDA’s real-world evidence program
Beth Schurman. The framework for FDA’s real-world evidence program. Applied Clinical Trials, 28(4), 2019
work page 2019
-
[46]
Approximation Theorems of Mathematical Statistics
Robert Serfling. Approximation Theorems of Mathematical Statistics . Wiley, 1980
work page 1980
-
[47]
Tamar Sofer, David Richardson, Elena Colicino, Joel Schwartz, and Eric Tchetgen Tchetgen. On negative outcome control of unobserved confounding as a generalization of difference-in-differences. Statistical Science: a Review Journal of the Institute of Mathematical Statistics , 31(3):348, 2016
work page 2016
-
[48]
Meir Stampfer and Graham Colditz. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Preventive Medicine, 20(1):47–63, 1991
work page 1991
-
[49]
On the influence of the kernel on the consistency of support vector machines
Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(11):67–93, 2001
work page 2001
-
[50]
David Sturdee and Pines. Updated IMS recommendations on postmenopausal hormone therapy and preventive strategies for midlife health. Climacteric, 14(3):302–320, 2011
work page 2011
-
[51]
A distributional approach for causal inference using propensity scores
Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006
work page 2006
-
[52]
Hugh Taylor, Aya Tal, Lubna Pal, Fangyong Li, Dennis Black, Eliot Brinton, Matthew Budoff, Marcelle Cedars, Wei Du, Howard Hodis, et al. Effects of oral vs transdermal estrogen therapy on sexual func- tion in early postmenopause: ancillary study of the Kronos Early Estrogen Prevention Study (KEEPS). JAMA Internal Medicine , 177(10):1471–1479, 2017
work page 2017
-
[53]
The HRT controversy: observational studies and RCTs fall in line
Jan Vandenbroucke. The HRT controversy: observational studies and RCTs fall in line. The Lancet, 373(9671):1233–1235, 2009
work page 2009
-
[54]
Use of historical control data for assessing treatment effects in clinical trials
Kert Viele, Scott Berry, Beat Neuenschwander, Billy Amzal, Fang Chen, Nathan Enas, Brian Hobbs, Joseph Ibrahim, Nelson Kinnersley, Stacy Lindborg, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13(1):41–54, 2014
work page 2014
-
[55]
Lili Wu and Shu Yang. Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Conference on Causal Learning and Reasoning , 2022
work page 2022
-
[56]
Shu Yang, Donglin Zeng, and Xiaofei Wang. Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922 , 2020
-
[57]
Shu Yang, Chenyin Gao, Donglin Zeng, and Xiaofei Wang. Elastic integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology , 85(3):575–596, 04 2023. 16 Appendices The following appendices provide deferred proofs, experiment details, and abla...
work page 2023
-
[58]
When |X J | = 3, we select the features that capture the bias between rct and os datasets ( newbie, mens, channel), and hence we achieve the highest power. Intuitively, if the feature set is smaller, some of the bias averages out, and the test loses power. On the other hand, when increasing the feature set, the test loses power due to the curse of dimensi...
work page 1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.