Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples
Pith reviewed 2026-05-18 05:46 UTC · model grok-4.3
The pith
Counterfactual Knowledge Distillation corrects Clever Hans predictors in image classifiers by generating annotated counterexamples without confounder labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CFKD sidesteps the need for confounder labels by generating diverse counterfactuals that enable a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step, reweighting undersampled groups while also enriching them with new data points to yield balanced generalization across groups.
What carries the argument
Counterfactual Knowledge Distillation (CFKD), which combines a counterfactual explainer to produce new examples with human annotation followed by distillation to transfer corrected decision boundaries into the final model.
If this is right
- CFKD reweights and enriches undersampled subgroups with newly generated data points rather than only reweighting existing ones.
- The method scales to multiple simultaneous spurious correlations without the data fragmentation that occurs when using explicit group labels.
- Gains are largest in low-data regimes where spurious correlations are pronounced.
- Performance holds across synthetic tasks and an industrial application without requiring any confounder labels.
Where Pith is reading between the lines
- Automating the human annotation step with a learned verifier could reduce the remaining manual effort while preserving the label-free advantage.
- The enrichment mechanism might combine naturally with self-supervised pretraining to improve robustness in large foundation models.
- Extending the framework to video or multimodal data could test whether the same counterfactual correction scales beyond static images.
Load-bearing premise
The counterfactual explainer must produce sufficiently diverse and semantically meaningful examples that a human annotator can use to identify and correct the model's decision boundaries.
What would settle it
Apply CFKD to an image classifier trained on data with a known single spurious correlation such as background color, then measure whether accuracy on a test set that removes that correlation remains higher than a baseline without the method.
Figures
read the original abstract
Deep learning models remain vulnerable to spurious correlations, leading to so-called Clever Hans predictors that undermine robustness even in large-scale foundation and self-supervised models. Group distributional robustness methods, such as Deep Feature Reweighting (DFR) rely on explicit group labels to upweight underrepresented subgroups, but face key limitations: (1) group labels are often unavailable, (2) low within-group sample sizes hinder coverage of the subgroup distribution, and (3) performance degrades sharply when multiple spurious correlations fragment the data into even smaller groups. We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step. Unlike DFR, our method not only reweights the undersampled groups, but it also enriches them with new data points. Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups. We demonstrate CFKD's efficacy across five datasets, spanning synthetic tasks to an industrial application, with particularly strong gains in low-data regimes with pronounced spurious correlations. Additionally, we provide an ablation study on the effect of the chosen counterfactual explainer and teacher model, highlighting their impact on robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Counterfactual Knowledge Distillation (CFKD) as a framework to mitigate Clever Hans (spurious correlation) predictors in image classifiers. It generates diverse counterfactual examples via an explainer, enables a human annotator to identify and correct decision boundaries, and performs knowledge distillation to produce a more robust model. Unlike Deep Feature Reweighting (DFR), CFKD claims to require no confounder/group labels, scale effectively to multiple confounders, and enrich undersampled subgroups with new synthetic points rather than merely reweighting. Efficacy is demonstrated across five datasets (synthetic to industrial) with ablations on the explainer and teacher model, with emphasis on low-data regimes.
Significance. If the counterfactuals prove sufficiently diverse and human-useful, CFKD could provide a practical label-free route to distributional robustness that addresses DFR's limitations on label availability, small subgroup sizes, and fragmentation under multiple confounders. The combination of counterfactual generation with distillation and human-in-the-loop correction is a potentially useful direction for enriching training distributions without explicit annotations.
major comments (2)
- [Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.
- [Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.
minor comments (2)
- [Method section] Clarify the precise interface between the human annotator and the distillation objective (e.g., how corrections are encoded as soft labels or augmented data).
- [Discussion or conclusion] Add a limitations paragraph discussing failure modes when the chosen explainer produces low-diversity counterfactuals.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. The feedback highlights important areas for strengthening the quantitative support and validation of our claims. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.
read point-by-point responses
-
Referee: [Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.
Authors: We acknowledge that the current presentation of results is primarily high-level and would benefit from additional quantitative rigor. In the revised manuscript, we will expand the experimental results section to include: (i) explicit control experiments (e.g., CFKD without the human correction step and without distillation), (ii) statistical significance testing such as bootstrap confidence intervals or paired statistical tests on the performance differences across the five datasets, and (iii) quantitative metrics for counterfactual quality and diversity, including feature-space distance from original samples and a diversity score based on pairwise embedding similarities. These additions will allow clearer attribution of gains to the CFKD framework rather than other factors. revision: yes
-
Referee: [Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.
Authors: We agree that the load-bearing assumption regarding counterfactual quality requires stronger empirical support. Although the explainer ablation demonstrates performance variation, we did not report explicit metrics for diversity, semantic validity, or human usefulness. In the revision, we will add these to the framework description and ablation study: measures of deviation from spurious features (e.g., correlation analysis with Clever Hans attributes), semantic validity via reconstruction or classifier consistency checks, and results from a small-scale human study assessing usefulness for boundary correction. This will directly address whether the generated points provide new information beyond the original model. revision: yes
Circularity Check
No circularity: procedural framework defined independently of fitted quantities or self-citation chains
full rationale
The paper describes CFKD as a sequence of steps—counterfactual generation via an explainer, human annotation to identify spurious features, and subsequent knowledge distillation—without any equations, parameter fits, or derivations that reduce the reported robustness gains to quantities defined on the same data or to prior self-citations. The central claims rest on empirical demonstration across datasets and an ablation on the explainer choice rather than on a mathematical identity or load-bearing self-reference. This is a standard empirical methods paper whose performance claims are externally falsifiable via the reported experiments and do not collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual generation produces examples that differ from the original only along the spurious feature while preserving the true label.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model’s decision boundaries through a knowledge distillation step.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...
Reference graph
Works this paper leans on
-
[1]
Pfungst, Clever Hans: (the horse of Mr
O. Pfungst, Clever Hans: (the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology, Holt, Rinehart and Winston, 1911
work page 1911
-
[2]
S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller, Unmasking clever hans predictors and assessing what machines really learn, Nature communications 10 (1) (2019) 1096
work page 2019
-
[3]
URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
Organisation for Economic Co-operation and Development, Recommendation of the Council on Artificial Intelligence, originally adopted in 2019, amended in 2024 (2024). URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
work page 2019
-
[4]
J. Kauffmann, J. Dippel, L. Ruff, W. Samek, K.-R. Müller, G. Montavon, Explainable ai reveals clever hans effects in unsupervised learning models, Nature Machine Intelligence 7 (2025) 412–422
work page 2025
- [5]
- [6]
-
[7]
B. Trabucco, K. Doherty, M. Gurinas, R. Salakhutdinov, Effective data augmentation with diffusion models, ICLR (2024)
work page 2024
- [8]
-
[9]
E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, C. Finn, Just train twice: Improving group robustness without training group information, in: Interna- tional Conference on Machine Learning, PMLR, 2021, pp. 6781–6792
work page 2021
-
[10]
P. Kirichenko, P. Izmailov, A. G. Wilson, Last layer re-training is sufficient for robustness to spurious correlations, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2023. 13
work page 2023
-
[11]
C. J. Anders, L. Weber, D. Neumann, W. Samek, K.-R. Müller, S. Lapuschkin, Finding and removing clever hans: using explanation methods to debug and improve deep models, Infor- mation Fusion 77 (2022) 261–295
work page 2022
- [12]
- [13]
-
[14]
L. Linhardt, K.-R. Müller, G. Montavon, Preemptively pruning clever-hans strategies in deep neural networks, Information Fusion 103 (2024) 102094
work page 2024
- [15]
-
[16]
T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with cutout, in: ICLR workshop, 2017
work page 2017
-
[17]
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6023–6032
work page 2019
-
[18]
E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen- tation with a reduced search space, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 702–703
work page 2020
-
[19]
E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmen- tation policies from data, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113–123
work page 2019
-
[20]
D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, B. Lakshminarayanan, Augmix: A simple data processing method to improve robustness and uncertainty, in: International Conference on Learning Representations (ICLR), 2020
work page 2020
- [21]
-
[22]
R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel, Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robust- ness, in: International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[23]
P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K. Mahlein, K. Kersting, Making deep neural networks right for the right scientific reasons by interacting with their explanations, Nature Machine Intelligence 2 (8) (2020) 476–486. 14
work page 2020
-
[24]
S. Teso, K. Kersting, Explanatory interactive machine learning, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 239–245
work page 2019
-
[25]
A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differentiable models by constraining their explanations, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (2017)
work page 2017
-
[26]
D. Bareeva, M. Dreyer, F. Pahde, W. Samek, S. Lapuschkin, Reactive model correction: Mit- igating harm to task-relevant features via conditional bias suppression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3532–3541
work page 2024
-
[27]
D. Kaushik, E. Hovy, Z. C. Lipton, Learning the difference that makes a difference with counterfactually-augmented data, ICLR (2020)
work page 2020
-
[28]
X. Deng, W. Wang, F. Feng, H. Zhang, X. He, Y. Liao, Counterfactual active learning for out- of-distribution generalization, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 11362–11377
work page 2023
-
[29]
A. Balashankar, X. Wang, Y. Qin, B. Packer, N. Thain, J. Chen, E. H. Chi, A. Beutel, Improving classifier robustness through active generation of pairwise counterfactuals, EMNLP (2023)
work page 2023
-
[30]
K. Margatina, G. Vernikos, L. Barrault, N. Aletras, Active learning by acquiring contrastive examples, EMNLP (2021)
work page 2021
-
[31]
A.-K. Dombrowski, J. E. Gerken, K.-R. Müller, P. Kessel, Diffeomorphic counterfactuals with generativemodels, IEEE TransactionsonPatternRecognitionandMachineIntelligence(2024)
work page 2024
-
[32]
P. Rodriguez, M. Caccia, A. Lacoste, L. Zamparo, I. Laradji, L. Charlin, D. Vazquez, Beyond trivial counterfactual explanations with diverse valuable explanations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1056–1065
work page 2021
-
[33]
G. Jeanneret, L. Simon, F. Jurie, Adversarial counterfactual visual explanations, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16425–16435
work page 2023
-
[34]
J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural infor- mation processing systems 33 (2020) 6840–6851
work page 2020
-
[35]
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, L. Van Gool, Repaint: Inpainting using denoising diffusion probabilistic models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11461–11471
work page 2022
-
[36]
M. Augustin, V. Boreiko, F. Croce, M. Hein, Diffusion visual counterfactual explanations, Advances in Neural Information Processing Systems 35 (2022) 364–377
work page 2022
-
[37]
M. Augustin, Y. Neuhaus, M. Hein, Dig-in: Diffusion guidance for investigating networks- uncovering classifier differences neuron visualisations and visual counterfactual explanations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11093–11103. 15
work page 2024
-
[38]
G. Jeanneret, L. Simon, F. Jurie, Diffusion models for counterfactual explanations, in: Pro- ceedings of the Asian Conference on Computer Vision, 2022, pp. 858–876
work page 2022
-
[39]
N. Weng, P. Pegios, E. Petersen, A. Feragen, S. Bigdeli, Fast diffusion-based counterfactuals for shortcut removal and generation, in: European Conference on Computer Vision, Springer, 2025, pp. 338–357
work page 2025
-
[40]
P. Dhariwal, A. Q. Nichol, Diffusion models beat gans on image synthesis, in: Advances in Neural Information Processing Systems 34 (NeurIPS), 2021, pp. 8780–8794
work page 2021
-
[41]
T. D. Ha, S. Bender, Diffusion counterfactuals for image regressors, 3rd Annual Conference for explainable Machine Learning (2025)
work page 2025
-
[42]
arXiv preprint arXiv:2506.14698 (2025)
S. Bender, J. Herrmann, K.-R. Müller, G. Montavon, Towards desiderata-driven design of visual counterfactual explainers, arXiv preprint arXiv:2506.14698 (2025)
-
[43]
K. L. Hermann, H. Mobahi, T. Fel, M. C. Mozer, On the foundations of shortcut learning, International Conference On Learning Representations 2024 (2023)
work page 2024
-
[44]
Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015
work page 2015
-
[45]
P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al., From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge, IEEE transactions on medical imaging 38 (2) (2018) 550–560
work page 2018
-
[46]
Organisation for Economic Co-operation and Development, Test No. 443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025). URL https://doi.org/10.1787/9789264185371-en
-
[47]
R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban, et al., Towards a general-purpose foundation model for computational pathology, Nature medicine 30 (3) (2024) 850–862. 16
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.