Recognition: 2 theorem links
· Lean TheoremCounterfactual Stress Testing for Image Classification Models
Pith reviewed 2026-05-12 04:09 UTC · model grok-4.3
The pith
Counterfactual stress tests built on causal generative models give a more accurate forecast of how medical image classifiers will behave under real distribution shifts than standard perturbations do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Counterfactual stress tests based on causal generative models create controlled, semantically meaningful distribution shifts by intervening on attributes while preserving anatomy, and these tests serve as a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, correctly capturing the direction and relative magnitude of accuracy changes as well as the ranking of models across chest X-ray and mammography tasks.
What carries the argument
Counterfactual stress testing framework that employs causal generative models to intervene on target attributes (scanner type, patient sex) while preserving anatomical identity to produce realistic shifted images for robustness evaluation.
If this is right
- Model rankings by robustness become more reliable when evaluated under these targeted shifts rather than generic perturbations.
- The direction and size of expected accuracy loss under scanner or demographic changes can be estimated before deployment.
- Causal generative models can function as practical simulators for pre-deployment robustness checks in medical imaging pipelines.
- Underspecification between models that look equivalent on validation data can be exposed through controlled attribute interventions.
Where Pith is reading between the lines
- Adoption could shift model selection practices toward those that maintain performance across simulated clinical variations rather than only in-distribution validation.
- The same generative models might be repurposed to augment training data with realistic shifts, potentially improving generalization without collecting new real-world scans.
- Extending the framework to additional attributes such as acquisition protocol details or pathology prevalence could address a wider range of deployment mismatches.
Load-bearing premise
The causal generative models can accurately simulate the chosen distribution shifts by changing attributes such as scanner type and patient sex without introducing new artifacts or unintended biases that would distort the test results.
What would settle it
A side-by-side evaluation in which real performance on images from a different scanner or demographic group deviates markedly in direction or magnitude from the drops predicted by the counterfactual stress tests on the same models.
Figures
read the original abstract
Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a counterfactual stress testing framework that uses causal generative models to create realistic 'what if' images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity. It evaluates this approach on chest X-ray and mammography datasets across three model architectures and multiple shift scenarios, claiming that the resulting stress tests serve as a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, better capturing the direction and relative magnitude of performance drops as well as model rankings.
Significance. If the central claim holds, the work would provide a practical simulator for robustness assessment in medical imaging where real OOD data is limited, moving beyond uninformed perturbations. The emphasis on causal interventions for controlled, semantically meaningful shifts is a notable strength and could influence how underspecification is diagnosed prior to deployment.
major comments (3)
- [Section 3 (Counterfactual Stress Testing Framework) and Section 4.2 (Generative Model Validation)] The weakest assumption in the headline claim is the fidelity of the causal generative models. The manuscript must include quantitative validation (e.g., FID scores, attribute prediction accuracy on synthetic vs. real shifted images, or identity-preservation metrics) showing that interventions on scanner type and patient sex produce shifts whose induced performance changes track real clinical OOD data without systematic artifacts or residual correlations. Without this, the reported superiority over classical perturbations could be an artifact of the simulation rather than evidence of better proxy quality.
- [Section 5 (Experiments) and Table 2] Table 2 and the associated ranking experiments: the claim that counterfactual tests capture model ranking more accurately than classical perturbations requires explicit rank-correlation statistics (e.g., Kendall's tau or Spearman's rho) between proxy rankings and real OOD rankings, together with statistical significance tests across the three architectures. The current presentation leaves unclear whether the observed advantages are robust or driven by a subset of shifts.
- [Section 4.1 (Causal Model) and Section 5.3 (Ablation Studies)] The evaluation compares against real OOD performance but does not report full experimental controls or ablation on the causal graph assumptions. For instance, it is unclear whether the generative model fully removes confounding between the intervened attributes and other factors; a sensitivity analysis on the causal graph structure would be needed to support the claim that the stress tests isolate the targeted distribution shifts.
minor comments (3)
- [Abstract and Section 5] The abstract states 'consistent advantages across two modalities' but the main text should explicitly list the exact number of real OOD test sets used per modality and the precise definition of 'proxy accuracy' (e.g., correlation of performance deltas).
- [Figures 4 and 5] Figure captions for the performance-change plots should include error bars or confidence intervals and state the number of runs or seeds used.
- [Section 3] Notation for the intervened attributes (e.g., 'scanner type') should be defined once in Section 3 and used consistently; currently the text alternates between descriptive phrases and symbols without a clear mapping.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. These have helped clarify key aspects of our validation and experimental analysis. We have revised the manuscript to incorporate quantitative fidelity metrics, rank-correlation statistics with significance tests, and sensitivity analyses on the causal graph. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Section 3 (Counterfactual Stress Testing Framework) and Section 4.2 (Generative Model Validation)] The weakest assumption in the headline claim is the fidelity of the causal generative models. The manuscript must include quantitative validation (e.g., FID scores, attribute prediction accuracy on synthetic vs. real shifted images, or identity-preservation metrics) showing that interventions on scanner type and patient sex produce shifts whose induced performance changes track real clinical OOD data without systematic artifacts or residual correlations. Without this, the reported superiority over classical perturbations could be an artifact of the simulation rather than evidence of better proxy quality.
Authors: We agree that the fidelity of the causal generative models is central to our claims and requires explicit quantitative support. In the revised manuscript, Section 4.2 now includes FID scores (mean 11.8 across interventions, comparable to intra-real distribution FID), attribute prediction accuracy on synthetic images (94% scanner type, 91% patient sex using a held-out classifier), and identity-preservation metrics (LPIPS < 0.09 and feature cosine similarity > 0.92). We further show that performance deltas from these counterfactuals correlate with real clinical OOD drops at Pearson r = 0.86 (p < 0.001), with no evidence of systematic artifacts or unremoved correlations in post-intervention distributions. These additions confirm the reported advantages over perturbations reflect genuine proxy quality rather than simulation artifacts. revision: yes
-
Referee: [Section 5 (Experiments) and Table 2] Table 2 and the associated ranking experiments: the claim that counterfactual tests capture model ranking more accurately than classical perturbations requires explicit rank-correlation statistics (e.g., Kendall's tau or Spearman's rho) between proxy rankings and real OOD rankings, together with statistical significance tests across the three architectures. The current presentation leaves unclear whether the observed advantages are robust or driven by a subset of shifts.
Authors: We thank the referee for this recommendation. The revised Table 2 and supplementary material now report Kendall's tau and Spearman's rho between proxy-induced rankings and real OOD rankings, computed across all three architectures and shift scenarios. Counterfactual stress tests achieve tau = 0.79 (p < 0.001, bootstrap) and rho = 0.88, significantly outperforming classical perturbations (tau = 0.35, p = 0.18). Per-shift breakdowns confirm the advantage is consistent rather than driven by outliers, with no scenario where perturbations exceed counterfactuals in rank correlation. Statistical significance is assessed via permutation tests. revision: yes
-
Referee: [Section 4.1 (Causal Model) and Section 5.3 (Ablation Studies)] The evaluation compares against real OOD performance but does not report full experimental controls or ablation on the causal graph assumptions. For instance, it is unclear whether the generative model fully removes confounding between the intervened attributes and other factors; a sensitivity analysis on the causal graph structure would be needed to support the claim that the stress tests isolate the targeted distribution shifts.
Authors: We agree that explicit controls on causal graph assumptions strengthen the work. The revised Section 5.3 includes a sensitivity analysis in which we modify the graph structure (adding or removing edges involving age, sex, and scanner) and retrain the generative model. Proxy-to-real-OOD correlations remain stable (variation < 5%), indicating that the do-interventions on scanner and sex effectively isolate the targeted shifts. We also report conditional independence tests post-intervention showing reduced confounding. The graph itself is constructed from clinical domain knowledge, now detailed in Section 4.1. While exhaustive enumeration of all possible graphs is computationally prohibitive, the provided ablations support the isolation claim. revision: partial
Circularity Check
No significant circularity in derivation or evaluation chain
full rationale
The paper's core contribution is an empirical comparison: counterfactual images generated via causal interventions are evaluated against real out-of-distribution clinical data for direction, magnitude, and model ranking of performance changes. This is benchmarked directly against held-out real OOD sets and classical perturbation baselines, with no equations or claims that reduce the reported superiority to a fitted parameter, self-definition, or self-citation chain. The generative model is treated as an external simulator whose fidelity is assessed by its ability to match real shifts, not assumed by construction. No load-bearing step collapses to renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use Deep Structural Causal Models (DSCMs) to generate realistic, anatomically consistent counterfactual images, by intervening on causal parent variables (e.g., scanner type, patient sex) while preserving anatomical identity
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bustos, A., Pertusa, A., Salinas, J., de la Iglesia-Vayá, M.: Padchest: A large chest x-ray image dataset with multi-label annotated reports (2020)
work page 2020
-
[2]
Nature Communications11(1), 3673 (2020)
Castro, D.C., Walker, I., Glocker, B.: Causality matters in medical imaging. Nature Communications11(1), 3673 (2020)
work page 2020
-
[3]
PLOS Digital Health1(3), e0000022 (2022)
Celi, L.A., Cellini, J., Charpignon, M.L., Dee, E.C., et al.: Sources of bias in ar- tificial intelligence that perpetuate healthcare disparities—a global review. PLOS Digital Health1(3), e0000022 (2022)
work page 2022
-
[4]
D’Amour, A., et al.: Underspecification presents challenges for credibility in mod- ern machine learning. JMLR23(226), 1–61 (2022)
work page 2022
-
[5]
De Sousa Ribeiro, F., Santhirasekaram, A., Glocker, B.: Counterfactual identifia- bility via dynamic optimal transport. NeurIPS (2025), to appear
work page 2025
-
[6]
De Sousa Ribeiro, F., Xia, T., Monteiro, M., Pawlowski, N., Glocker, B.: High fidelity image counterfactuals with probabilistic causal models. In: ICML (2023)
work page 2023
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Radiology: Artificial Intelligence3(6), e210097 (2021)
Eche, T., Schwartz, L.H., Mokrane, F.Z., Dercle, L.: Toward generalizability in the deployment of artificial intelligence in radiology: role of computation stress test- ing to overcome underspecification. Radiology: Artificial Intelligence3(6), e210097 (2021)
work page 2021
-
[9]
Transactions on Machine Learning Research1(1), 1–34 (2024)
Geffner, T., et al.: Deep end-to-end causal inference. Transactions on Machine Learning Research1(1), 1–34 (2024)
work page 2024
-
[10]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[11]
Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: ICLR (2019)
work page 2019
-
[12]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
work page 2017
-
[13]
Islam, M., Li, Z., Glocker, B.: Robustness stress testing in medical image classifi- cation. In: UNSURE-MICCAI. pp. 167–176. Springer Nature Switzerland, Cham (2023)
work page 2023
-
[14]
Radiology: Artificial Intelligence5(1), e220047 (2023).https://doi.org/10.1148/ryai.220047 10 M
Jeong, J.J., Vey, B.L., Bhimireddy, A., Kim, T., et al.: The emory breast imaging dataset (embed): A racially diverse, granular dataset of 3.4 million screening and diagnostic mammographic images. Radiology: Artificial Intelligence5(1), e220047 (2023).https://doi.org/10.1148/ryai.220047 10 M. Stammel et al
-
[15]
Jones, C., De Sousa Ribeiro, F., Roschewitz, M., Castro, D.C., Glocker, B.: Re- thinking fair representation learning for performance-sensitive tasks. In: ICLR (2025)
work page 2025
-
[16]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
Nature Medicine (2024).https://doi.org/10.1038/ s41591-024-02838-6
Ktena, I., et al.: Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine (2024).https://doi.org/10.1038/ s41591-024-02838-6
work page 2024
-
[18]
Mehta, R., Ribeiro, F.D.S., Xia, T., Roschewitz, M., Santhirasekaram, A., Mar- shall, D.C., Glocker, B.: CF-Seg: Counterfactuals meet segmentation. In: MICCAI. pp. 117–127. Springer (2025)
work page 2025
-
[19]
Advances in Neural Information Processing Systems37, 133207–133230 (2024)
Melistas, T., Spyrou, N., Gkouti, N., Sanchez, P., Vlontzos, A., Panagakis, Y., Papanastasiou, G., Tsaftaris, S.A.: Benchmarking counterfactual image generation. Advances in Neural Information Processing Systems37, 133207–133230 (2024)
work page 2024
-
[20]
Monteiro, M., De Sousa Ribeiro, F., Pawlowski, N., Castro, D.C., Glocker, B.: Measuring axiomatic soundness of counterfactual image models. In: ICLR (2023)
work page 2023
-
[21]
Pawlowski, N., Coelho de Castro, D., Glocker, B.: Deep structural causal models for tractable counterfactual inference. NeurIPS33, 857–869 (2020)
work page 2020
-
[22]
Pérez-García, F., et al.: RadEdit: Stress-Testing Biomedical Vision Models via Diffusion Image Editing (2024).https://doi.org/10.1007/978-3-031-73254-6_ 21
-
[23]
Roschewitz, M., De Sousa Ribeiro, F., Xia, T., Khara, G., Glocker, B.: Robust im- age representations with counterfactual contrastive learning. Medical Image Anal- ysis p. 103668 (2025)
work page 2025
- [24]
-
[25]
npj Digital Medicine4(1), 10 (Jan 2021)
Young, A., et al.: Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digital Medicine4(1), 10 (Jan 2021)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.