Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples

Gr\'egoire Montavon; Heike Antje Marxfeld; Jan Herrmann; Klaus-Robert M\"uller; Ole Delzer; Sidney Bender

arxiv: 2510.17524 · v1 · submitted 2025-10-20 · 💻 cs.LG

Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples

Sidney Bender , Ole Delzer , Jan Herrmann , Heike Antje Marxfeld , Klaus-Robert M\"uller , Gr\'egoire Montavon This is my paper

Pith reviewed 2026-05-18 05:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords counterfactual knowledge distillationspurious correlationsClever Hans predictorsimage classification robustnessknowledge distillationcounterfactual examplesgroup robustness

0 comments

The pith

Counterfactual Knowledge Distillation corrects Clever Hans predictors in image classifiers by generating annotated counterexamples without confounder labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning image classifiers often rely on spurious correlations that create Clever Hans predictors, which fail under distribution shifts. Methods like Deep Feature Reweighting require explicit group labels to upweight underrepresented subgroups but struggle when labels are missing, sample sizes are low, or multiple confounders split the data further. The paper proposes Counterfactual Knowledge Distillation, which generates diverse counterfactual examples so a human annotator can explore and adjust the model's decision boundaries. These annotated examples then guide a knowledge distillation step that reweights and enriches the subgroups. The approach scales to multiple confounders and performs well in low-data regimes across synthetic and real datasets.

Core claim

CFKD sidesteps the need for confounder labels by generating diverse counterfactuals that enable a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step, reweighting undersampled groups while also enriching them with new data points to yield balanced generalization across groups.

What carries the argument

Counterfactual Knowledge Distillation (CFKD), which combines a counterfactual explainer to produce new examples with human annotation followed by distillation to transfer corrected decision boundaries into the final model.

If this is right

CFKD reweights and enriches undersampled subgroups with newly generated data points rather than only reweighting existing ones.
The method scales to multiple simultaneous spurious correlations without the data fragmentation that occurs when using explicit group labels.
Gains are largest in low-data regimes where spurious correlations are pronounced.
Performance holds across synthetic tasks and an industrial application without requiring any confounder labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automating the human annotation step with a learned verifier could reduce the remaining manual effort while preserving the label-free advantage.
The enrichment mechanism might combine naturally with self-supervised pretraining to improve robustness in large foundation models.
Extending the framework to video or multimodal data could test whether the same counterfactual correction scales beyond static images.

Load-bearing premise

The counterfactual explainer must produce sufficiently diverse and semantically meaningful examples that a human annotator can use to identify and correct the model's decision boundaries.

What would settle it

Apply CFKD to an image classifier trained on data with a known single spurious correlation such as background color, then measure whether accuracy on a test set that removes that correlation remains higher than a baseline without the method.

Figures

Figures reproduced from arXiv: 2510.17524 by Gr\'egoire Montavon, Heike Antje Marxfeld, Jan Herrmann, Klaus-Robert M\"uller, Ole Delzer, Sidney Bender.

**Figure 2.** Figure 2: Cartoon depiction of the Follicle dataset and its spurious correlation. If the number of inner Granulosa cells [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of sample size and spurious correlation level on accuracy of benchmarked methods. Results are [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: A comparison between the different teachers that were used on the Square dataset with correlation [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of CFKD, where for a variety of data points from the datasets considered (top-row), [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Deep learning models remain vulnerable to spurious correlations, leading to so-called Clever Hans predictors that undermine robustness even in large-scale foundation and self-supervised models. Group distributional robustness methods, such as Deep Feature Reweighting (DFR) rely on explicit group labels to upweight underrepresented subgroups, but face key limitations: (1) group labels are often unavailable, (2) low within-group sample sizes hinder coverage of the subgroup distribution, and (3) performance degrades sharply when multiple spurious correlations fragment the data into even smaller groups. We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step. Unlike DFR, our method not only reweights the undersampled groups, but it also enriches them with new data points. Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups. We demonstrate CFKD's efficacy across five datasets, spanning synthetic tasks to an industrial application, with particularly strong gains in low-data regimes with pronounced spurious correlations. Additionally, we provide an ablation study on the effect of the chosen counterfactual explainer and teacher model, highlighting their impact on robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CFKD gives a label-free route to fixing multiple spurious correlations in image classifiers by generating counterfactuals for human correction then distilling, but the gains stand or fall on whether those examples actually add useful diversity.

read the letter

Hi colleague, the main point is that this paper puts forward Counterfactual Knowledge Distillation as a way to reduce Clever Hans behavior in vision models without any group or confounder labels. It generates counterfactual examples, lets a human spot the biased features, and then uses distillation to push the model toward more balanced performance while also adding new points to the undersampled groups. That combination is presented as new relative to label-dependent methods like DFR, and the authors test it on five datasets that range from synthetic cases to a real industrial task, with the biggest reported lifts in low-data regimes that have strong biases. They also run an ablation on the explainer and the teacher model, which is useful for seeing what parts of the pipeline matter most. The work does a clean job spelling out why existing approaches break when multiple confounders split the data into tiny subgroups and why enriching the data with new points could help where simple reweighting does not. The procedural framing is straightforward and the human-in-the-loop angle feels practical for settings where balanced labels are expensive to collect. The softer part is the dependence on the counterfactual explainer actually producing diverse, semantically valid examples that sit outside the original biased manifold. The abstract claims balanced generalization across groups and effective scaling to multiple confounders, yet it gives no numbers on diversity metrics, human usefulness, or statistical controls for the generated points. If the examples remain correlated with the Clever Hans features, the enrichment step adds little and the distillation step cannot deliver the promised robustness. The stress-test note on this assumption looks on target from the summary, and the paper would be stronger with explicit checks that the new points are not just recycling the same correlations. This is the kind of paper that would interest people working on deployable robust vision systems in domains where subgroup labels are scarce. Readers who care about practical fixes for spurious correlations without extra supervision would get something out of the framework and the ablation results. It shows clear engagement with the literature and a coherent procedural approach, so it deserves a serious referee rather than a desk reject. I would send it for review and ask for more quantitative validation on the counterfactual quality and diversity.

Referee Report

2 major / 2 minor

Summary. The paper proposes Counterfactual Knowledge Distillation (CFKD) as a framework to mitigate Clever Hans (spurious correlation) predictors in image classifiers. It generates diverse counterfactual examples via an explainer, enables a human annotator to identify and correct decision boundaries, and performs knowledge distillation to produce a more robust model. Unlike Deep Feature Reweighting (DFR), CFKD claims to require no confounder/group labels, scale effectively to multiple confounders, and enrich undersampled subgroups with new synthetic points rather than merely reweighting. Efficacy is demonstrated across five datasets (synthetic to industrial) with ablations on the explainer and teacher model, with emphasis on low-data regimes.

Significance. If the counterfactuals prove sufficiently diverse and human-useful, CFKD could provide a practical label-free route to distributional robustness that addresses DFR's limitations on label availability, small subgroup sizes, and fragmentation under multiple confounders. The combination of counterfactual generation with distillation and human-in-the-loop correction is a potentially useful direction for enriching training distributions without explicit annotations.

major comments (2)

[Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.
[Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.

minor comments (2)

[Method section] Clarify the precise interface between the human annotator and the distillation objective (e.g., how corrections are encoded as soft labels or augmented data).
[Discussion or conclusion] Add a limitations paragraph discussing failure modes when the chosen explainer produces low-diversity counterfactuals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. The feedback highlights important areas for strengthening the quantitative support and validation of our claims. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.

read point-by-point responses

Referee: [Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.

Authors: We acknowledge that the current presentation of results is primarily high-level and would benefit from additional quantitative rigor. In the revised manuscript, we will expand the experimental results section to include: (i) explicit control experiments (e.g., CFKD without the human correction step and without distillation), (ii) statistical significance testing such as bootstrap confidence intervals or paired statistical tests on the performance differences across the five datasets, and (iii) quantitative metrics for counterfactual quality and diversity, including feature-space distance from original samples and a diversity score based on pairwise embedding similarities. These additions will allow clearer attribution of gains to the CFKD framework rather than other factors. revision: yes
Referee: [Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.

Authors: We agree that the load-bearing assumption regarding counterfactual quality requires stronger empirical support. Although the explainer ablation demonstrates performance variation, we did not report explicit metrics for diversity, semantic validity, or human usefulness. In the revision, we will add these to the framework description and ablation study: measures of deviation from spurious features (e.g., correlation analysis with Clever Hans attributes), semantic validity via reconstruction or classifier consistency checks, and results from a small-scale human study assessing usefulness for boundary correction. This will directly address whether the generated points provide new information beyond the original model. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework defined independently of fitted quantities or self-citation chains

full rationale

The paper describes CFKD as a sequence of steps—counterfactual generation via an explainer, human annotation to identify spurious features, and subsequent knowledge distillation—without any equations, parameter fits, or derivations that reduce the reported robustness gains to quantities defined on the same data or to prior self-citations. The central claims rest on empirical demonstration across datasets and an ablation on the explainer choice rather than on a mathematical identity or load-bearing self-reference. This is a standard empirical methods paper whose performance claims are externally falsifiable via the reported experiments and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that generated counterfactuals are useful for human correction and that distillation transfers the corrected behavior; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Counterfactual generation produces examples that differ from the original only along the spurious feature while preserving the true label.
Required for the human annotation step to isolate and correct Clever Hans behavior.

pith-pipeline@v0.9.0 · 5780 in / 1262 out tokens · 38866 ms · 2026-05-18T05:46:30.970574+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model’s decision boundaries through a knowledge distillation step.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
cs.LG 2026-04 unverdicted novelty 4.0

XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper

[1]

Pfungst, Clever Hans: (the horse of Mr

O. Pfungst, Clever Hans: (the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology, Holt, Rinehart and Winston, 1911

work page 1911
[2]

Lapuschkin, S

S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller, Unmasking clever hans predictors and assessing what machines really learn, Nature communications 10 (1) (2019) 1096

work page 2019
[3]

URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

Organisation for Economic Co-operation and Development, Recommendation of the Council on Artificial Intelligence, originally adopted in 2019, amended in 2024 (2024). URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

work page 2019
[4]

Kauffmann, J

J. Kauffmann, J. Dippel, L. Ruff, W. Samek, K.-R. Müller, G. Montavon, Explainable ai reveals clever hans effects in unsupervised learning models, Nature Machine Intelligence 7 (2025) 412–422

work page 2025
[5]

Kömen, E

J. Kömen, E. D. de Jong, J. Hense, H. Marienwald, J. Dippel, P. Naumann, E. Marcus, L. Ruff, M. Alber, J. Teuwen, et al., Towards robust foundation models for digital pathology, arXiv preprint arXiv:2507.17845 (2025)

work page arXiv 2025
[6]

Zhang, M

H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2018

work page 2018
[7]

Trabucco, K

B. Trabucco, K. Doherty, M. Gurinas, R. Salakhutdinov, Effective data augmentation with diffusion models, ICLR (2024)

work page 2024
[8]

Sagawa, P

S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2020

work page 2020
[9]

E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, C. Finn, Just train twice: Improving group robustness without training group information, in: Interna- tional Conference on Machine Learning, PMLR, 2021, pp. 6781–6792

work page 2021
[10]

Kirichenko, P

P. Kirichenko, P. Izmailov, A. G. Wilson, Last layer re-training is sufficient for robustness to spurious correlations, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2023. 13

work page 2023
[11]

C. J. Anders, L. Weber, D. Neumann, W. Samek, K.-R. Müller, S. Lapuschkin, Finding and removing clever hans: using explanation methods to debug and improve deep models, Infor- mation Fusion 77 (2022) 261–295

work page 2022
[12]

Dreyer, F

M. Dreyer, F. Pahde, C. J. Anders, W. Samek, S. Lapuschkin, From hope to safety: Unlearning biases of deep models via gradient penalization in latent space, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 21046–21054

work page 2024
[13]

Pahde, M

F. Pahde, M. Dreyer, L. Weber, M. Weckbecker, C. J. Anders, T. Wiegand, W. Samek, S. La- puschkin, Navigating neural space: Revisiting concept activation vectors to overcome direc- tional divergence, ICLR (2025)

work page 2025
[14]

Linhardt, K.-R

L. Linhardt, K.-R. Müller, G. Montavon, Preemptively pruning clever-hans strategies in deep neural networks, Information Fusion 103 (2024) 102094

work page 2024
[15]

Bender, C

S. Bender, C. J. Anders, P. Chormai, H. Marxfeld, J. Herrmann, G. Montavon, Towards fixingclever-hanspredictorswithcounterfactualknowledgedistillation, in: ICCV(Workshops), IEEE, 2023, pp. 2599–2607

work page 2023
[16]

DeVries, G

T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with cutout, in: ICLR workshop, 2017

work page 2017
[17]

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6023–6032

work page 2019
[18]

E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen- tation with a reduced search space, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 702–703

work page 2020
[19]

E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmen- tation policies from data, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113–123

work page 2019
[20]

Hendrycks, N

D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, B. Lakshminarayanan, Augmix: A simple data processing method to improve robustness and uncertainty, in: International Conference on Learning Representations (ICLR), 2020

work page 2020
[21]

Volpi, H

R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, S. Savarese, Generalizing to unseen domains via adversarial data augmentation, in: Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 5334–5344

work page 2018
[22]

Geirhos, P

R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel, Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robust- ness, in: International Conference on Learning Representations (ICLR), 2019

work page 2019
[23]

Schramowski, W

P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K. Mahlein, K. Kersting, Making deep neural networks right for the right scientific reasons by interacting with their explanations, Nature Machine Intelligence 2 (8) (2020) 476–486. 14

work page 2020
[24]

S. Teso, K. Kersting, Explanatory interactive machine learning, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 239–245

work page 2019
[25]

A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differentiable models by constraining their explanations, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (2017)

work page 2017
[26]

Bareeva, M

D. Bareeva, M. Dreyer, F. Pahde, W. Samek, S. Lapuschkin, Reactive model correction: Mit- igating harm to task-relevant features via conditional bias suppression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3532–3541

work page 2024
[27]

Kaushik, E

D. Kaushik, E. Hovy, Z. C. Lipton, Learning the difference that makes a difference with counterfactually-augmented data, ICLR (2020)

work page 2020
[28]

X. Deng, W. Wang, F. Feng, H. Zhang, X. He, Y. Liao, Counterfactual active learning for out- of-distribution generalization, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 11362–11377

work page 2023
[29]

Balashankar, X

A. Balashankar, X. Wang, Y. Qin, B. Packer, N. Thain, J. Chen, E. H. Chi, A. Beutel, Improving classifier robustness through active generation of pairwise counterfactuals, EMNLP (2023)

work page 2023
[30]

Margatina, G

K. Margatina, G. Vernikos, L. Barrault, N. Aletras, Active learning by acquiring contrastive examples, EMNLP (2021)

work page 2021
[31]

Dombrowski, J

A.-K. Dombrowski, J. E. Gerken, K.-R. Müller, P. Kessel, Diffeomorphic counterfactuals with generativemodels, IEEE TransactionsonPatternRecognitionandMachineIntelligence(2024)

work page 2024
[32]

Rodriguez, M

P. Rodriguez, M. Caccia, A. Lacoste, L. Zamparo, I. Laradji, L. Charlin, D. Vazquez, Beyond trivial counterfactual explanations with diverse valuable explanations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1056–1065

work page 2021
[33]

Jeanneret, L

G. Jeanneret, L. Simon, F. Jurie, Adversarial counterfactual visual explanations, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16425–16435

work page 2023
[34]

J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural infor- mation processing systems 33 (2020) 6840–6851

work page 2020
[35]

Lugmayr, M

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, L. Van Gool, Repaint: Inpainting using denoising diffusion probabilistic models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11461–11471

work page 2022
[36]

Augustin, V

M. Augustin, V. Boreiko, F. Croce, M. Hein, Diffusion visual counterfactual explanations, Advances in Neural Information Processing Systems 35 (2022) 364–377

work page 2022
[37]

Augustin, Y

M. Augustin, Y. Neuhaus, M. Hein, Dig-in: Diffusion guidance for investigating networks- uncovering classifier differences neuron visualisations and visual counterfactual explanations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11093–11103. 15

work page 2024
[38]

Jeanneret, L

G. Jeanneret, L. Simon, F. Jurie, Diffusion models for counterfactual explanations, in: Pro- ceedings of the Asian Conference on Computer Vision, 2022, pp. 858–876

work page 2022
[39]

N. Weng, P. Pegios, E. Petersen, A. Feragen, S. Bigdeli, Fast diffusion-based counterfactuals for shortcut removal and generation, in: European Conference on Computer Vision, Springer, 2025, pp. 338–357

work page 2025
[40]

Dhariwal, A

P. Dhariwal, A. Q. Nichol, Diffusion models beat gans on image synthesis, in: Advances in Neural Information Processing Systems 34 (NeurIPS), 2021, pp. 8780–8794

work page 2021
[41]

T. D. Ha, S. Bender, Diffusion counterfactuals for image regressors, 3rd Annual Conference for explainable Machine Learning (2025)

work page 2025
[42]

arXiv preprint arXiv:2506.14698 (2025)

S. Bender, J. Herrmann, K.-R. Müller, G. Montavon, Towards desiderata-driven design of visual counterfactual explainers, arXiv preprint arXiv:2506.14698 (2025)

work page arXiv 2025
[43]

K. L. Hermann, H. Mobahi, T. Fel, M. C. Mozer, On the foundations of shortcut learning, International Conference On Learning Representations 2024 (2023)

work page 2024
[44]

Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015

work page 2015
[45]

Bandi, O

P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al., From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge, IEEE transactions on medical imaging 38 (2) (2018) 550–560

work page 2018
[46]

443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025)

Organisation for Economic Co-operation and Development, Test No. 443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025). URL https://doi.org/10.1787/9789264185371-en

work page doi:10.1787/9789264185371-en 2025
[47]

R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban, et al., Towards a general-purpose foundation model for computational pathology, Nature medicine 30 (3) (2024) 850–862. 16

work page 2024

[1] [1]

Pfungst, Clever Hans: (the horse of Mr

O. Pfungst, Clever Hans: (the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology, Holt, Rinehart and Winston, 1911

work page 1911

[2] [2]

Lapuschkin, S

S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller, Unmasking clever hans predictors and assessing what machines really learn, Nature communications 10 (1) (2019) 1096

work page 2019

[3] [3]

URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

Organisation for Economic Co-operation and Development, Recommendation of the Council on Artificial Intelligence, originally adopted in 2019, amended in 2024 (2024). URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

work page 2019

[4] [4]

Kauffmann, J

J. Kauffmann, J. Dippel, L. Ruff, W. Samek, K.-R. Müller, G. Montavon, Explainable ai reveals clever hans effects in unsupervised learning models, Nature Machine Intelligence 7 (2025) 412–422

work page 2025

[5] [5]

Kömen, E

J. Kömen, E. D. de Jong, J. Hense, H. Marienwald, J. Dippel, P. Naumann, E. Marcus, L. Ruff, M. Alber, J. Teuwen, et al., Towards robust foundation models for digital pathology, arXiv preprint arXiv:2507.17845 (2025)

work page arXiv 2025

[6] [6]

Zhang, M

H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2018

work page 2018

[7] [7]

Trabucco, K

B. Trabucco, K. Doherty, M. Gurinas, R. Salakhutdinov, Effective data augmentation with diffusion models, ICLR (2024)

work page 2024

[8] [8]

Sagawa, P

S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2020

work page 2020

[9] [9]

E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, C. Finn, Just train twice: Improving group robustness without training group information, in: Interna- tional Conference on Machine Learning, PMLR, 2021, pp. 6781–6792

work page 2021

[10] [10]

Kirichenko, P

P. Kirichenko, P. Izmailov, A. G. Wilson, Last layer re-training is sufficient for robustness to spurious correlations, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2023. 13

work page 2023

[11] [11]

C. J. Anders, L. Weber, D. Neumann, W. Samek, K.-R. Müller, S. Lapuschkin, Finding and removing clever hans: using explanation methods to debug and improve deep models, Infor- mation Fusion 77 (2022) 261–295

work page 2022

[12] [12]

Dreyer, F

M. Dreyer, F. Pahde, C. J. Anders, W. Samek, S. Lapuschkin, From hope to safety: Unlearning biases of deep models via gradient penalization in latent space, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 21046–21054

work page 2024

[13] [13]

Pahde, M

F. Pahde, M. Dreyer, L. Weber, M. Weckbecker, C. J. Anders, T. Wiegand, W. Samek, S. La- puschkin, Navigating neural space: Revisiting concept activation vectors to overcome direc- tional divergence, ICLR (2025)

work page 2025

[14] [14]

Linhardt, K.-R

L. Linhardt, K.-R. Müller, G. Montavon, Preemptively pruning clever-hans strategies in deep neural networks, Information Fusion 103 (2024) 102094

work page 2024

[15] [15]

Bender, C

S. Bender, C. J. Anders, P. Chormai, H. Marxfeld, J. Herrmann, G. Montavon, Towards fixingclever-hanspredictorswithcounterfactualknowledgedistillation, in: ICCV(Workshops), IEEE, 2023, pp. 2599–2607

work page 2023

[16] [16]

DeVries, G

T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with cutout, in: ICLR workshop, 2017

work page 2017

[17] [17]

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6023–6032

work page 2019

[18] [18]

E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen- tation with a reduced search space, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 702–703

work page 2020

[19] [19]

E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmen- tation policies from data, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113–123

work page 2019

[20] [20]

Hendrycks, N

D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, B. Lakshminarayanan, Augmix: A simple data processing method to improve robustness and uncertainty, in: International Conference on Learning Representations (ICLR), 2020

work page 2020

[21] [21]

Volpi, H

R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, S. Savarese, Generalizing to unseen domains via adversarial data augmentation, in: Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 5334–5344

work page 2018

[22] [22]

Geirhos, P

R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel, Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robust- ness, in: International Conference on Learning Representations (ICLR), 2019

work page 2019

[23] [23]

Schramowski, W

P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K. Mahlein, K. Kersting, Making deep neural networks right for the right scientific reasons by interacting with their explanations, Nature Machine Intelligence 2 (8) (2020) 476–486. 14

work page 2020

[24] [24]

S. Teso, K. Kersting, Explanatory interactive machine learning, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 239–245

work page 2019

[25] [25]

A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differentiable models by constraining their explanations, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (2017)

work page 2017

[26] [26]

Bareeva, M

D. Bareeva, M. Dreyer, F. Pahde, W. Samek, S. Lapuschkin, Reactive model correction: Mit- igating harm to task-relevant features via conditional bias suppression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3532–3541

work page 2024

[27] [27]

Kaushik, E

D. Kaushik, E. Hovy, Z. C. Lipton, Learning the difference that makes a difference with counterfactually-augmented data, ICLR (2020)

work page 2020

[28] [28]

X. Deng, W. Wang, F. Feng, H. Zhang, X. He, Y. Liao, Counterfactual active learning for out- of-distribution generalization, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 11362–11377

work page 2023

[29] [29]

Balashankar, X

A. Balashankar, X. Wang, Y. Qin, B. Packer, N. Thain, J. Chen, E. H. Chi, A. Beutel, Improving classifier robustness through active generation of pairwise counterfactuals, EMNLP (2023)

work page 2023

[30] [30]

Margatina, G

K. Margatina, G. Vernikos, L. Barrault, N. Aletras, Active learning by acquiring contrastive examples, EMNLP (2021)

work page 2021

[31] [31]

Dombrowski, J

A.-K. Dombrowski, J. E. Gerken, K.-R. Müller, P. Kessel, Diffeomorphic counterfactuals with generativemodels, IEEE TransactionsonPatternRecognitionandMachineIntelligence(2024)

work page 2024

[32] [32]

Rodriguez, M

P. Rodriguez, M. Caccia, A. Lacoste, L. Zamparo, I. Laradji, L. Charlin, D. Vazquez, Beyond trivial counterfactual explanations with diverse valuable explanations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1056–1065

work page 2021

[33] [33]

Jeanneret, L

G. Jeanneret, L. Simon, F. Jurie, Adversarial counterfactual visual explanations, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16425–16435

work page 2023

[34] [34]

J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural infor- mation processing systems 33 (2020) 6840–6851

work page 2020

[35] [35]

Lugmayr, M

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, L. Van Gool, Repaint: Inpainting using denoising diffusion probabilistic models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11461–11471

work page 2022

[36] [36]

Augustin, V

M. Augustin, V. Boreiko, F. Croce, M. Hein, Diffusion visual counterfactual explanations, Advances in Neural Information Processing Systems 35 (2022) 364–377

work page 2022

[37] [37]

Augustin, Y

M. Augustin, Y. Neuhaus, M. Hein, Dig-in: Diffusion guidance for investigating networks- uncovering classifier differences neuron visualisations and visual counterfactual explanations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11093–11103. 15

work page 2024

[38] [38]

Jeanneret, L

G. Jeanneret, L. Simon, F. Jurie, Diffusion models for counterfactual explanations, in: Pro- ceedings of the Asian Conference on Computer Vision, 2022, pp. 858–876

work page 2022

[39] [39]

N. Weng, P. Pegios, E. Petersen, A. Feragen, S. Bigdeli, Fast diffusion-based counterfactuals for shortcut removal and generation, in: European Conference on Computer Vision, Springer, 2025, pp. 338–357

work page 2025

[40] [40]

Dhariwal, A

P. Dhariwal, A. Q. Nichol, Diffusion models beat gans on image synthesis, in: Advances in Neural Information Processing Systems 34 (NeurIPS), 2021, pp. 8780–8794

work page 2021

[41] [41]

T. D. Ha, S. Bender, Diffusion counterfactuals for image regressors, 3rd Annual Conference for explainable Machine Learning (2025)

work page 2025

[42] [42]

arXiv preprint arXiv:2506.14698 (2025)

S. Bender, J. Herrmann, K.-R. Müller, G. Montavon, Towards desiderata-driven design of visual counterfactual explainers, arXiv preprint arXiv:2506.14698 (2025)

work page arXiv 2025

[43] [43]

K. L. Hermann, H. Mobahi, T. Fel, M. C. Mozer, On the foundations of shortcut learning, International Conference On Learning Representations 2024 (2023)

work page 2024

[44] [44]

Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015

work page 2015

[45] [45]

Bandi, O

P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al., From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge, IEEE transactions on medical imaging 38 (2) (2018) 550–560

work page 2018

[46] [46]

443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025)

Organisation for Economic Co-operation and Development, Test No. 443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025). URL https://doi.org/10.1787/9789264185371-en

work page doi:10.1787/9789264185371-en 2025

[47] [47]

R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban, et al., Towards a general-purpose foundation model for computational pathology, Nature medicine 30 (3) (2024) 850–862. 16

work page 2024