pith. sign in

arxiv: 2510.17524 · v1 · submitted 2025-10-20 · 💻 cs.LG

Mitigating Clever Hans Strategies in Image Classifiers through Generating Counterexamples

Pith reviewed 2026-05-18 05:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords counterfactual knowledge distillationspurious correlationsClever Hans predictorsimage classification robustnessknowledge distillationcounterfactual examplesgroup robustness
0
0 comments X

The pith

Counterfactual Knowledge Distillation corrects Clever Hans predictors in image classifiers by generating annotated counterexamples without confounder labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning image classifiers often rely on spurious correlations that create Clever Hans predictors, which fail under distribution shifts. Methods like Deep Feature Reweighting require explicit group labels to upweight underrepresented subgroups but struggle when labels are missing, sample sizes are low, or multiple confounders split the data further. The paper proposes Counterfactual Knowledge Distillation, which generates diverse counterfactual examples so a human annotator can explore and adjust the model's decision boundaries. These annotated examples then guide a knowledge distillation step that reweights and enriches the subgroups. The approach scales to multiple confounders and performs well in low-data regimes across synthetic and real datasets.

Core claim

CFKD sidesteps the need for confounder labels by generating diverse counterfactuals that enable a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step, reweighting undersampled groups while also enriching them with new data points to yield balanced generalization across groups.

What carries the argument

Counterfactual Knowledge Distillation (CFKD), which combines a counterfactual explainer to produce new examples with human annotation followed by distillation to transfer corrected decision boundaries into the final model.

If this is right

  • CFKD reweights and enriches undersampled subgroups with newly generated data points rather than only reweighting existing ones.
  • The method scales to multiple simultaneous spurious correlations without the data fragmentation that occurs when using explicit group labels.
  • Gains are largest in low-data regimes where spurious correlations are pronounced.
  • Performance holds across synthetic tasks and an industrial application without requiring any confounder labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automating the human annotation step with a learned verifier could reduce the remaining manual effort while preserving the label-free advantage.
  • The enrichment mechanism might combine naturally with self-supervised pretraining to improve robustness in large foundation models.
  • Extending the framework to video or multimodal data could test whether the same counterfactual correction scales beyond static images.

Load-bearing premise

The counterfactual explainer must produce sufficiently diverse and semantically meaningful examples that a human annotator can use to identify and correct the model's decision boundaries.

What would settle it

Apply CFKD to an image classifier trained on data with a known single spurious correlation such as background color, then measure whether accuracy on a test set that removes that correlation remains higher than a baseline without the method.

Figures

Figures reproduced from arXiv: 2510.17524 by Gr\'egoire Montavon, Heike Antje Marxfeld, Jan Herrmann, Klaus-Robert M\"uller, Ole Delzer, Sidney Bender.

Figure 1
Figure 1. Figure 1: Proposed CFKD method compared to a subgroup reweighting baseline (e.g. DFR). In the shown ‘blond’ vs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cartoon depiction of the Follicle dataset and its spurious correlation. If the number of inner Granulosa cells [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of sample size and spurious correlation level on accuracy of benchmarked methods. Results are [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A comparison between the different teachers that were used on the Square dataset with correlation [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of CFKD, where for a variety of data points from the datasets considered (top-row), [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Deep learning models remain vulnerable to spurious correlations, leading to so-called Clever Hans predictors that undermine robustness even in large-scale foundation and self-supervised models. Group distributional robustness methods, such as Deep Feature Reweighting (DFR) rely on explicit group labels to upweight underrepresented subgroups, but face key limitations: (1) group labels are often unavailable, (2) low within-group sample sizes hinder coverage of the subgroup distribution, and (3) performance degrades sharply when multiple spurious correlations fragment the data into even smaller groups. We propose Counterfactual Knowledge Distillation (CFKD), a framework that sidesteps these issues by generating diverse counterfactuals, enabling a human annotator to efficiently explore and correct the model's decision boundaries through a knowledge distillation step. Unlike DFR, our method not only reweights the undersampled groups, but it also enriches them with new data points. Our method does not require any confounder labels, achieves effective scaling to multiple confounders, and yields balanced generalization across groups. We demonstrate CFKD's efficacy across five datasets, spanning synthetic tasks to an industrial application, with particularly strong gains in low-data regimes with pronounced spurious correlations. Additionally, we provide an ablation study on the effect of the chosen counterfactual explainer and teacher model, highlighting their impact on robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Counterfactual Knowledge Distillation (CFKD) as a framework to mitigate Clever Hans (spurious correlation) predictors in image classifiers. It generates diverse counterfactual examples via an explainer, enables a human annotator to identify and correct decision boundaries, and performs knowledge distillation to produce a more robust model. Unlike Deep Feature Reweighting (DFR), CFKD claims to require no confounder/group labels, scale effectively to multiple confounders, and enrich undersampled subgroups with new synthetic points rather than merely reweighting. Efficacy is demonstrated across five datasets (synthetic to industrial) with ablations on the explainer and teacher model, with emphasis on low-data regimes.

Significance. If the counterfactuals prove sufficiently diverse and human-useful, CFKD could provide a practical label-free route to distributional robustness that addresses DFR's limitations on label availability, small subgroup sizes, and fragmentation under multiple confounders. The combination of counterfactual generation with distillation and human-in-the-loop correction is a potentially useful direction for enriching training distributions without explicit annotations.

major comments (2)
  1. [Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.
  2. [Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.
minor comments (2)
  1. [Method section] Clarify the precise interface between the human annotator and the distillation objective (e.g., how corrections are encoded as soft labels or augmented data).
  2. [Discussion or conclusion] Add a limitations paragraph discussing failure modes when the chosen explainer produces low-diversity counterfactuals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. The feedback highlights important areas for strengthening the quantitative support and validation of our claims. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract and experimental results section] Abstract and experimental results: gains are reported across five datasets and an explainer/teacher ablation, yet no quantitative details appear on control experiments, statistical significance testing, or metrics confirming counterfactual quality/diversity. This leaves the central claims of balanced generalization and scaling to multiple confounders resting on high-level outcomes whose attribution to CFKD versus other factors cannot be assessed.

    Authors: We acknowledge that the current presentation of results is primarily high-level and would benefit from additional quantitative rigor. In the revised manuscript, we will expand the experimental results section to include: (i) explicit control experiments (e.g., CFKD without the human correction step and without distillation), (ii) statistical significance testing such as bootstrap confidence intervals or paired statistical tests on the performance differences across the five datasets, and (iii) quantitative metrics for counterfactual quality and diversity, including feature-space distance from original samples and a diversity score based on pairwise embedding similarities. These additions will allow clearer attribution of gains to the CFKD framework rather than other factors. revision: yes

  2. Referee: [Framework description and explainer ablation] Framework description (implicit in the method overview and explainer ablation): the claim that CFKD avoids confounder labels while enriching groups rests on the counterfactual explainer producing examples sufficiently outside the model's biased manifold for reliable human correction. No diversity, semantic validity, or human-usefulness metrics are referenced, which is load-bearing; if the generated points remain correlated with original Clever Hans features, the distillation step adds no new information and performance would not exceed the original model.

    Authors: We agree that the load-bearing assumption regarding counterfactual quality requires stronger empirical support. Although the explainer ablation demonstrates performance variation, we did not report explicit metrics for diversity, semantic validity, or human usefulness. In the revision, we will add these to the framework description and ablation study: measures of deviation from spurious features (e.g., correlation analysis with Clever Hans attributes), semantic validity via reconstruction or classifier consistency checks, and results from a small-scale human study assessing usefulness for boundary correction. This will directly address whether the generated points provide new information beyond the original model. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework defined independently of fitted quantities or self-citation chains

full rationale

The paper describes CFKD as a sequence of steps—counterfactual generation via an explainer, human annotation to identify spurious features, and subsequent knowledge distillation—without any equations, parameter fits, or derivations that reduce the reported robustness gains to quantities defined on the same data or to prior self-citations. The central claims rest on empirical demonstration across datasets and an ablation on the explainer choice rather than on a mathematical identity or load-bearing self-reference. This is a standard empirical methods paper whose performance claims are externally falsifiable via the reported experiments and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that generated counterfactuals are useful for human correction and that distillation transfers the corrected behavior; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Counterfactual generation produces examples that differ from the original only along the spurious feature while preserving the true label.
    Required for the human annotation step to isolate and correct Clever Hans behavior.

pith-pipeline@v0.9.0 · 5780 in / 1262 out tokens · 38866 ms · 2026-05-18T05:46:30.970574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

    cs.LG 2026-04 unverdicted novelty 4.0

    XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper

  1. [1]

    Pfungst, Clever Hans: (the horse of Mr

    O. Pfungst, Clever Hans: (the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology, Holt, Rinehart and Winston, 1911

  2. [2]

    Lapuschkin, S

    S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller, Unmasking clever hans predictors and assessing what machines really learn, Nature communications 10 (1) (2019) 1096

  3. [3]

    URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

    Organisation for Economic Co-operation and Development, Recommendation of the Council on Artificial Intelligence, originally adopted in 2019, amended in 2024 (2024). URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449

  4. [4]

    Kauffmann, J

    J. Kauffmann, J. Dippel, L. Ruff, W. Samek, K.-R. Müller, G. Montavon, Explainable ai reveals clever hans effects in unsupervised learning models, Nature Machine Intelligence 7 (2025) 412–422

  5. [5]

    Kömen, E

    J. Kömen, E. D. de Jong, J. Hense, H. Marienwald, J. Dippel, P. Naumann, E. Marcus, L. Ruff, M. Alber, J. Teuwen, et al., Towards robust foundation models for digital pathology, arXiv preprint arXiv:2507.17845 (2025)

  6. [6]

    Zhang, M

    H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2018

  7. [7]

    Trabucco, K

    B. Trabucco, K. Doherty, M. Gurinas, R. Salakhutdinov, Effective data augmentation with diffusion models, ICLR (2024)

  8. [8]

    Sagawa, P

    S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2020

  9. [9]

    E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, C. Finn, Just train twice: Improving group robustness without training group information, in: Interna- tional Conference on Machine Learning, PMLR, 2021, pp. 6781–6792

  10. [10]

    Kirichenko, P

    P. Kirichenko, P. Izmailov, A. G. Wilson, Last layer re-training is sufficient for robustness to spurious correlations, in: International Conference on Learning Representations (ICLR), OpenReview.net, 2023. 13

  11. [11]

    C. J. Anders, L. Weber, D. Neumann, W. Samek, K.-R. Müller, S. Lapuschkin, Finding and removing clever hans: using explanation methods to debug and improve deep models, Infor- mation Fusion 77 (2022) 261–295

  12. [12]

    Dreyer, F

    M. Dreyer, F. Pahde, C. J. Anders, W. Samek, S. Lapuschkin, From hope to safety: Unlearning biases of deep models via gradient penalization in latent space, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 21046–21054

  13. [13]

    Pahde, M

    F. Pahde, M. Dreyer, L. Weber, M. Weckbecker, C. J. Anders, T. Wiegand, W. Samek, S. La- puschkin, Navigating neural space: Revisiting concept activation vectors to overcome direc- tional divergence, ICLR (2025)

  14. [14]

    Linhardt, K.-R

    L. Linhardt, K.-R. Müller, G. Montavon, Preemptively pruning clever-hans strategies in deep neural networks, Information Fusion 103 (2024) 102094

  15. [15]

    Bender, C

    S. Bender, C. J. Anders, P. Chormai, H. Marxfeld, J. Herrmann, G. Montavon, Towards fixingclever-hanspredictorswithcounterfactualknowledgedistillation, in: ICCV(Workshops), IEEE, 2023, pp. 2599–2607

  16. [16]

    DeVries, G

    T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with cutout, in: ICLR workshop, 2017

  17. [17]

    S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6023–6032

  18. [18]

    E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaugment: Practical automated data augmen- tation with a reduced search space, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 702–703

  19. [19]

    E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmen- tation policies from data, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113–123

  20. [20]

    Hendrycks, N

    D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, B. Lakshminarayanan, Augmix: A simple data processing method to improve robustness and uncertainty, in: International Conference on Learning Representations (ICLR), 2020

  21. [21]

    Volpi, H

    R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, S. Savarese, Generalizing to unseen domains via adversarial data augmentation, in: Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 5334–5344

  22. [22]

    Geirhos, P

    R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel, Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robust- ness, in: International Conference on Learning Representations (ICLR), 2019

  23. [23]

    Schramowski, W

    P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K. Mahlein, K. Kersting, Making deep neural networks right for the right scientific reasons by interacting with their explanations, Nature Machine Intelligence 2 (8) (2020) 476–486. 14

  24. [24]

    S. Teso, K. Kersting, Explanatory interactive machine learning, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 239–245

  25. [25]

    A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differentiable models by constraining their explanations, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (2017)

  26. [26]

    Bareeva, M

    D. Bareeva, M. Dreyer, F. Pahde, W. Samek, S. Lapuschkin, Reactive model correction: Mit- igating harm to task-relevant features via conditional bias suppression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3532–3541

  27. [27]

    Kaushik, E

    D. Kaushik, E. Hovy, Z. C. Lipton, Learning the difference that makes a difference with counterfactually-augmented data, ICLR (2020)

  28. [28]

    X. Deng, W. Wang, F. Feng, H. Zhang, X. He, Y. Liao, Counterfactual active learning for out- of-distribution generalization, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 11362–11377

  29. [29]

    Balashankar, X

    A. Balashankar, X. Wang, Y. Qin, B. Packer, N. Thain, J. Chen, E. H. Chi, A. Beutel, Improving classifier robustness through active generation of pairwise counterfactuals, EMNLP (2023)

  30. [30]

    Margatina, G

    K. Margatina, G. Vernikos, L. Barrault, N. Aletras, Active learning by acquiring contrastive examples, EMNLP (2021)

  31. [31]

    Dombrowski, J

    A.-K. Dombrowski, J. E. Gerken, K.-R. Müller, P. Kessel, Diffeomorphic counterfactuals with generativemodels, IEEE TransactionsonPatternRecognitionandMachineIntelligence(2024)

  32. [32]

    Rodriguez, M

    P. Rodriguez, M. Caccia, A. Lacoste, L. Zamparo, I. Laradji, L. Charlin, D. Vazquez, Beyond trivial counterfactual explanations with diverse valuable explanations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1056–1065

  33. [33]

    Jeanneret, L

    G. Jeanneret, L. Simon, F. Jurie, Adversarial counterfactual visual explanations, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16425–16435

  34. [34]

    J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural infor- mation processing systems 33 (2020) 6840–6851

  35. [35]

    Lugmayr, M

    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, L. Van Gool, Repaint: Inpainting using denoising diffusion probabilistic models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11461–11471

  36. [36]

    Augustin, V

    M. Augustin, V. Boreiko, F. Croce, M. Hein, Diffusion visual counterfactual explanations, Advances in Neural Information Processing Systems 35 (2022) 364–377

  37. [37]

    Augustin, Y

    M. Augustin, Y. Neuhaus, M. Hein, Dig-in: Diffusion guidance for investigating networks- uncovering classifier differences neuron visualisations and visual counterfactual explanations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11093–11103. 15

  38. [38]

    Jeanneret, L

    G. Jeanneret, L. Simon, F. Jurie, Diffusion models for counterfactual explanations, in: Pro- ceedings of the Asian Conference on Computer Vision, 2022, pp. 858–876

  39. [39]

    N. Weng, P. Pegios, E. Petersen, A. Feragen, S. Bigdeli, Fast diffusion-based counterfactuals for shortcut removal and generation, in: European Conference on Computer Vision, Springer, 2025, pp. 338–357

  40. [40]

    Dhariwal, A

    P. Dhariwal, A. Q. Nichol, Diffusion models beat gans on image synthesis, in: Advances in Neural Information Processing Systems 34 (NeurIPS), 2021, pp. 8780–8794

  41. [41]

    T. D. Ha, S. Bender, Diffusion counterfactuals for image regressors, 3rd Annual Conference for explainable Machine Learning (2025)

  42. [42]

    arXiv preprint arXiv:2506.14698 (2025)

    S. Bender, J. Herrmann, K.-R. Müller, G. Montavon, Towards desiderata-driven design of visual counterfactual explainers, arXiv preprint arXiv:2506.14698 (2025)

  43. [43]

    K. L. Hermann, H. Mobahi, T. Fel, M. C. Mozer, On the foundations of shortcut learning, International Conference On Learning Representations 2024 (2023)

  44. [44]

    Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015

  45. [45]

    Bandi, O

    P. Bandi, O. Geessink, Q. Manson, M. Van Dijk, M. Balkenhol, M. Hermsen, B. E. Bejnordi, B. Lee, K. Paeng, A. Zhong, et al., From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge, IEEE transactions on medical imaging 38 (2) (2018) 550–560

  46. [46]

    443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025)

    Organisation for Economic Co-operation and Development, Test No. 443: Extended One- Generation Reproductive Toxicity Study, OECD Guidelines for the Testing of Chemicals, Sec- tion 4, OECD Publishing, Paris (2025). URL https://doi.org/10.1787/9789264185371-en

  47. [47]

    R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban, et al., Towards a general-purpose foundation model for computational pathology, Nature medicine 30 (3) (2024) 850–862. 16