Machine learning models for estimating counterfactuals in a single-arm inflammatory bowel disease study
Pith reviewed 2026-05-08 08:20 UTC · model grok-4.3
The pith
Machine learning models trained on external IFX data can generate virtual control arms yielding treatment effect estimates that match propensity score matching in a single-arm pediatric Crohn's study.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training five machine learning models on an external IFX-treated cohort to predict 1-year SFCR and CRP-SFCR counterfactual outcomes for ADA-treated patients produces odds ratios and confidence intervals that closely match those from propensity score matching to external controls, with LightGBM performing best and all intervals consistent with no statistical difference between ADA and IFX.
What carries the argument
Counterfactual prediction models (especially LightGBM) trained on IFX external cohort data to simulate control-arm responses for the ADA single-arm cohort.
Load-bearing premise
The IFX-treated external cohort provides a sufficiently representative basis for training models that accurately predict counterfactual outcomes in the ADA-treated cohort without major unmeasured confounding or distribution shift.
What would settle it
A randomized trial directly comparing ADA and IFX in a similar pediatric Crohn's population that finds a statistically significant difference in SFCR or CRP-SFCR would contradict the virtual control estimates.
read the original abstract
Single-arm trials accelerate study timelines by reducing the number of patients that must be recruited for a concurrent control group. However, these designs require an alternative comparator to estimate treatment effects. One approach is to construct a virtual control arm using a machine learning (ML) model trained on external control data to predict the counterfactual outcomes of the treatment arm. Our aim in this study was to leverage virtual controls by developing and evaluating ML-based counterfactual outcome models trained on IFX-treated patients to predict 1-year steroid-free clinical remission (SFCR ) and a composite of C-reactive protein remission plus steroid-free clinical remission (CRP-SFCR) for ADA-treated pediatric Crohn's disease patients, and to compare the resulting IFX-versus-ADA treatment effect estimates with those obtained using propensity score matching to external controls. Five ML models were used to train counterfactual models on the observed IFX cohort data. The resulting models were used to predict the counterfactual outcomes for the ADA arm patients. LGBM yields the best OR closest to the propensity score matched reference, and all 95% CI results align with the conclusion from the reference study that no statistical difference in the primary and secondary outcomes has been observed between the patients treated with ADA or IFX. Our study supports virtual controls as a viable and effective substitute for expensive, lengthy or unethical patient recruitment in an inflammatory bowel disease (IBD) trial. The developed gradient boosted prediction model can be used as a pretrained model to generate IFX counterfactual predictions in future studies, pending external validation and assessment of transportability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops and evaluates five machine learning models (including LGBM) trained on an external IFX-treated cohort to predict counterfactual 1-year steroid-free clinical remission (SFCR) and CRP-SFCR outcomes for ADA-treated pediatric Crohn's disease patients in a single-arm study. The resulting treatment effect estimates (odds ratios and 95% CIs) are compared to a propensity score matching (PSM) reference analysis on external controls; LGBM produces the OR closest to the PSM benchmark, and all model-derived CIs are consistent with the reference conclusion of no statistically significant difference between ADA and IFX. The work concludes that virtual controls via ML are a viable substitute for concurrent controls in IBD trials and offers the LGBM model as a pretrained tool for future use, pending external validation.
Significance. If the transportability and calibration claims hold after proper validation, the approach could meaningfully accelerate single-arm IBD trials by substituting external-data-driven counterfactuals for randomized controls, with direct implications for pediatric Crohn's studies where ethical or recruitment barriers are high. The explicit benchmarking against PSM and the offer of a reusable pretrained model are concrete strengths that would support reproducibility if accompanied by code, diagnostics, and sensitivity checks.
major comments (3)
- [Abstract] Abstract and Methods (model training and application): No details are provided on model training procedures, validation (e.g., cross-validation strategy, hold-out sets), hyperparameter tuning, feature preprocessing, handling of missing data, or performance metrics (AUC, calibration, Brier score) on the IFX training data. Without these, the claim that LGBM produces the 'best' OR cannot be evaluated and the numerical agreement with PSM may reflect shared bias rather than valid counterfactual estimation.
- [Abstract] Abstract and Results (counterfactual prediction): The manuscript reports no diagnostics for covariate overlap, propensity-score distribution overlap, or transportability between the IFX training cohort and ADA target cohort (e.g., no overlap plots, standardized mean differences, or sensitivity analyses for unmeasured confounding or distribution shift). This assumption is load-bearing for the headline result that LGBM ORs and CIs align with the PSM reference.
- [Abstract] Abstract: The statement that 'all 95% CI results align with the conclusion... no statistical difference' is presented without the actual OR/CI values, sample sizes, or variance estimation method for the ML-derived estimates, making it impossible to assess whether the alignment is substantive or merely non-informative.
minor comments (2)
- [Abstract] The abstract refers to 'five ML models' but does not name them beyond LGBM; listing the full set (e.g., in a table) would improve clarity.
- [Abstract] Clarify whether the PSM reference uses the same external IFX cohort or a distinct one, and whether any patients overlap between training and target cohorts.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the transparency and rigor of our work on ML-based virtual controls in single-arm IBD studies. We address each major comment point by point below and have prepared revisions to add the requested details, diagnostics, and numerical results.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods (model training and application): No details are provided on model training procedures, validation (e.g., cross-validation strategy, hold-out sets), hyperparameter tuning, feature preprocessing, handling of missing data, or performance metrics (AUC, calibration, Brier score) on the IFX training data. Without these, the claim that LGBM produces the 'best' OR cannot be evaluated and the numerical agreement with PSM may reflect shared bias rather than valid counterfactual estimation.
Authors: We agree that the abstract and Methods section as currently written do not provide these implementation details, which are necessary to fully evaluate model selection and the risk of shared bias with PSM. The manuscript describes training five models (including LGBM) on the external IFX cohort but omits the requested specifics. In the revised version we will expand the Methods with a dedicated model development subsection reporting: stratified 5-fold cross-validation on the IFX data, grid-search hyperparameter tuning, feature preprocessing steps, missing-data handling via multiple imputation, and performance metrics (AUC, calibration plots, Brier scores) for all five models. These additions will allow readers to assess whether LGBM's closer alignment with the PSM benchmark reflects genuine counterfactual validity. revision: yes
-
Referee: [Abstract] Abstract and Results (counterfactual prediction): The manuscript reports no diagnostics for covariate overlap, propensity-score distribution overlap, or transportability between the IFX training cohort and ADA target cohort (e.g., no overlap plots, standardized mean differences, or sensitivity analyses for unmeasured confounding or distribution shift). This assumption is load-bearing for the headline result that LGBM ORs and CIs align with the PSM reference.
Authors: This is a substantive point; the transportability assumption is indeed central and was not explicitly diagnosed in the current manuscript. While the PSM benchmark implicitly relies on overlap, we did not report supporting diagnostics. The revised manuscript will add: covariate overlap plots, standardized mean differences between IFX and ADA cohorts, propensity-score distribution comparisons, and sensitivity analyses for unmeasured confounding and distribution shift (including simulation-based checks). These will be presented alongside the main results to strengthen the justification for applying the IFX-trained model to the ADA cohort. revision: yes
-
Referee: [Abstract] Abstract: The statement that 'all 95% CI results align with the conclusion... no statistical difference' is presented without the actual OR/CI values, sample sizes, or variance estimation method for the ML-derived estimates, making it impossible to assess whether the alignment is substantive or merely non-informative.
Authors: We acknowledge that the abstract's qualitative summary of alignment lacks the quantitative detail needed for evaluation. The manuscript states that LGBM yields the OR closest to PSM and that all CIs are consistent with no difference, but does not report the numbers. In the revision we will update the abstract and Results to include the specific ORs and 95% CIs from the LGBM model and the PSM reference, the sample sizes for the IFX training cohort and ADA target cohort, and the variance estimation method (bootstrap resampling of the predicted counterfactual outcomes). This will make the degree of alignment transparent and evaluable. revision: yes
Circularity Check
No significant circularity; derivation relies on external training data and independent benchmark
full rationale
The paper trains five ML models (including LGBM) exclusively on the observed IFX-treated external cohort to predict counterfactual SFCR and CRP-SFCR outcomes for the ADA-treated target cohort using their observed covariates. The resulting ORs and 95% CIs are then compared to a separate propensity-score-matched analysis performed on external controls. No equations, fitting procedures, or self-citations reduce the reported treatment-effect estimates to parameters defined by the ADA target data itself. The method is self-contained against the external IFX training set and the independent PSM reference; transportability assumptions are stated but do not create definitional or fitted-input circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The IFX cohort data is representative and transportable to the ADA population for counterfactual prediction.
Reference graph
Works this paper leans on
-
[1]
Randomized controlled trials – The what, when, how and why
1 Braga LH, Farrokhyar F, Dönmez Mİ, et al. Randomized controlled trials – The what, when, how and why. Journal of Pediatric Urology. 2025;21:397–404. doi: 10.1016/j.jpurol.2024.11.021 2 Griessbach A, Speich B, Amstutz A, et al. Resource use and costs of investigator-sponsored randomized clinical trials in Switzerland, Germany, and the United Kingdom: a m...
-
[2]
Cost-effectiveness of health research study participant recruitment strategies: a systematic review
19/21 6 Huynh L, Johns B, Liu S-H, et al. Cost-effectiveness of health research study participant recruitment strategies: a systematic review. Clin Trials. 2014;11:576–83. doi: 10.1177/1740774514540371 7 Lambert J, Lengliné E, Porcher R, et al. Enriching single-arm clinical trials with external controls: possibilities and pitfalls. Blood Adv. 2023;7:5680–...
- [3]
-
[4]
Increasing FDA-accelerated approval of single-arm trials in oncology (1992 to 2020)
17 Ribeiro TB, Bennett CL, Colunga-Lozano LE, et al. Increasing FDA-accelerated approval of single-arm trials in oncology (1992 to 2020). Journal of Clinical Epidemiology. 2023;159:151–8. doi: 10.1016/j.jclinepi.2023.04.001 18 Nuño MM, Pugh SL, Ji L, et al. On the use of external controls in clinical trials. J Natl Cancer Inst Monogr. 2025;2025:30–4. doi:...
-
[5]
34 Dhaliwal J, Walters TD, Mack DR, et al. Phenotypic Variation in Paediatric Inflammatory Bowel Disease by Age: A Multicentre Prospective Inception Cohort Study of the Canadian Children IBD Network. J Crohns Colitis. 2020;14:445–54. doi: 10.1093/ecco-jcc/jjz106 35 Fernandes A, Porcher R, Tran V-T, et al. Evaluating virtual-control-augmented trials for re...
-
[6]
36 Hollmann N, Müller S, Purucker L, et al. Accurate predictions on small data with a tabular foundation model. Nature. 2025;637:319–26. doi: 10.1038/s41586-024-08328-6 37 Stone M. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society Series B (Methodological). 1974;36:111–47. 38 Varma S, Simon R. Bias...
- [7]
- [8]
-
[9]
Performance evaluation of predictive AI models to support medical decisions: Overview and guidance
47 Calster BV, Collins GS, Vickers AJ, et al. Performance evaluation of predictive AI models to support medical decisions: Overview and guidance. 48 Richardson E, Trevizani R, Greenbaum JA, et al. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024;5:100994. doi: 10.1016/j.patter.2024.100994 49 Brabec J, Kom...
-
[10]
doi: 10.1007/978-3-030-50423-6_6 50 Austin PC, Steyerberg EW
2020;12140:74–87. doi: 10.1007/978-3-030-50423-6_6 50 Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 2019;38:4051–65. doi: 10.1002/sim.8281 51 Watson DS, Blesch K, Kapar J, et al. Adversarial random forests for density estimation and generative modeling
-
[11]
pgmpy: Probabilistic Graphical Models using Python
52 Ankan A, Panda A. pgmpy: Probabilistic Graphical Models using Python. Austin, Texas 2015:6–11. 53 Xu L, Skoularidou M, Cuesta-Infante A, et al. Modeling Tabular data using Conditional GAN. Advances in Neural Information Processing Systems
2015
-
[12]
Synthcity: facilitating innovative use cases of synthetic data in different data modalities
54 Qian Z, Cebere B-C, van der Schaar M. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv. 2023;2301.07573. 55 Goring S, Taylor A, Müller K, et al. Characteristics of non-randomised studies using comparisons with external controls submitted for regulatory approval in the USA and Europe: a systematic review...
-
[13]
Application of Bayesian networks to generate synthetic health data
2 Kaur D, Sobiesk M, Patil S, et al. Application of Bayesian networks to generate synthetic health data. J Am Med Inform Assoc. 2021;28:801–11. doi: 10.1093/jamia/ocaa303 3 Murphy KP. Machine Learning: A Probabilistic Perspective. MIT Press
-
[14]
Montreal, QC, Canada: Neural Information Processing Systems Foundation, Inc. 2014:2672–80. 5 Bourou S, El Saer A, Velivassaki T-H, et al. A Review of Tabular Data Synthesis Using GANs on an IDS Dataset. Information. 2021;12:375. doi: 10.3390/info12090375 6 Xu L, Skoularidou M, Cuesta-Infante A, et al. Modeling Tabular data using Conditional GAN. Advances ...
-
[15]
Variational autoencoder based synthetic data generation for imbalanced learning
8 Wan Z, Zhang Y, He H. Variational autoencoder based synthetic data generation for imbalanced learning. 2017 IEEE Symposium Series on Computational Intelligence (SSCI). 2017:1–7. 9 Ishfaq H, Hoogi A, Rubin D. TVAE: Triplet-Based Variational Autoencoder using Metric Learning
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.