pith. sign in

arxiv: 2606.30498 · v1 · pith:BY3IL3R6new · submitted 2026-06-29 · 💻 cs.CV · cs.AI

On the Faithfulness of Post-Hoc Concept Bottleneck Models

Pith reviewed 2026-06-30 06:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords post-hoc concept bottleneck modelsconcept faithfulnesscovariate shiftlabel noisevision-language modelsinterpretabilitydeep learning evaluation
0
0 comments X

The pith

Covariate shifts and label noise make post-hoc concept bottleneck models learn unfaithful projections even at high accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Post-hoc concept bottleneck models project latent features from deep networks onto human-interpretable concepts, either by training on auxiliary datasets or by using labels from vision-language models. The paper establishes that these projections become unfaithful to the target concepts under covariate shift between auxiliary and target data, and supplies an explicit upper bound on the error that shift produces. It further shows that systematic noise in the surrogate labels supplied by vision-language models creates a second, independent source of unfaithfulness. Because overall task accuracy can stay high for projections that are semantically meaningless, the authors introduce new metrics that measure the alignment of the learned projection with the intended concepts directly, without reference to downstream accuracy. Experiments on both synthetic and real benchmarks confirm that the new metrics flag failures that accuracy-based checks overlook.

Core claim

Post-hoc concept projections learned from auxiliary data become unfaithful under covariate shift, with an upper bound given on the resulting error, while surrogate labels from vision-language models introduce systematic unfaithfulness; new metrics that evaluate the projections themselves separate faithfulness from task accuracy.

What carries the argument

Novel metrics that directly quantify faithfulness of the learned concept projection independent of predictive accuracy on the target task.

If this is right

  • Accuracy alone is insufficient to certify that a post-hoc CBM has learned semantically meaningful concepts.
  • Covariate shift between auxiliary and target distributions produces a bounded but nonzero error in concept faithfulness.
  • Systematic label noise from vision-language models produces unfaithful projections even when target accuracy remains competitive.
  • The new metrics detect unfaithful projections on both synthetic and real-world data where accuracy metrics do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners may need to collect or simulate target-distribution concept labels rather than relying exclusively on auxiliary sources.
  • The error bound could be tightened or extended to other distribution mismatches such as label shift or domain gap.
  • If the metrics generalize, they offer a template for auditing other post-hoc interpretability techniques that also rely on auxiliary supervision.

Load-bearing premise

The new metrics measure concept faithfulness without inheriting biases from the auxiliary data or vision-language model labels that create the unfaithfulness they are meant to detect.

What would settle it

A controlled experiment in which a known unfaithful projection (either from a large covariate shift or from noisy VLM labels) receives a high faithfulness score from the new metrics while a faithful projection receives a low score.

Figures

Figures reproduced from arXiv: 2606.30498 by Jan Blunk, Joachim Denzler, Julia Niebling, Laines Schmalwasser, Niklas Penzel.

Figure 1
Figure 1. Figure 1: In the post-hoc CBM setup [49, 63, 71], inputs are mapped to activations using some frozen backbone model f. Then, a learned concept projection πθ extracts concepts before a final classifier h predicts a class based on these concepts. Unfortunately, concept labels in the downstream task domain are often unavailable. Thus, ○1 auxiliary concept sets (e.g., Broden [7]) or ○2 surrogate VLM labels are used [49,… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the error orthogonality condition in a 2D sample space. The axes represent two samples r and s from the training set of πθ. (a) A faithful model is learned if the surrogate label errors of these samples (∆k) are orthogonal to their activations (αj ), resulting in a zero projection. (b) Non-orthogonal errors have a non-zero projection onto the feature vector, resulting in a non-zero gradien… view at source ↗
Figure 3
Figure 3. Figure 3: Samples of the concept “striped” for each visual set used. In Figure 3a, there is exactly one object with the concept. Figure 3b shows a sample from the backbone domain where all objects share the concept. Figure 3c shows a sample from the training domain and Figure 3d shows a complete out-of-domain example. CIFAR-10/100 [33] using CLIP ResNet-50 (RN50) [13] as backbone. Additionally, we consider the synth… view at source ↗
Figure 4
Figure 4. Figure 4: High downstream accuracy does not imply faithfulness. The dashed line shows the classifier’s performance without a bottleneck layer. The downstream classification accuracy of a post-hoc CBM increases with the number of random concept vectors. Ad￾ditionally, various post-hoc CBMs that use meaningful concept sources only marginally outperform random concepts. Hence, downstream performance does not indicate f… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the estimated H∆H-divergence as a proxy for the upper bound of the task domain error. We pairwise compare the four probing datasets for the synthetic Elements downstream task [48], highlighting the actual downstream task in our setup. Note that the colormaps in (a) and (b) use different scales. We find that stronger covariate shifts lead to increased unfaithfulness, whereas lower divergenc… view at source ↗
Figure 6
Figure 6. Figure 6: Systematic vs. Unsystematic Surrogate Errors. Unfaithful projections occur when surrogate errors are both large (|∆|k ) and highly correlated (|ρ|k ) with activations αj . (a) Surrogates from Grounding DINO [39] exhibit systematic, correlated errors, causing unfaithful projections. (b) Random unsystematic label noise produces no correlation. If the VLM perfectly predicts a concept (|∆|k = 0), correlation i… view at source ↗
Figure 7
Figure 7. Figure 7: Geometric intuition for surrogate faithfulness based on a concept encoded as binary classification (shapes denote true concept labels according to the ground-truth concept labeling function c ∗ ; colored borders denote surrogate labels generated with c˜). (a) The ideal boundary learned from true labels. (b) If the surrogate labeling function generates errors that are systematically correlated with specific… view at source ↗
Figure 8
Figure 8. Figure 8: Extended results for the random concept projection experiments in [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the estimated H∆H-divergence [8] as a proxy for the upper bound of the task generalization error for the CUB dataset [70], when comparing training πθ based on the ground-truth concept labels to training it on the Broden dataset [7]. As in the main text, stronger distribution shifts between the concept-probing datasets and the downstream domain, as indicated by our divergence metric, indica… view at source ↗
Figure 10
Figure 10. Figure 10: Systematic vs. unsystematic surrogate errors for various VLMs (Appendix A.4), extending the results from [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Statistically significant Pearson correlation coefficients [51] from [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Systematic vs. unsystematic surrogate errors for various VLMs (Appendix A.4) evaluated on the CUB dataset [70]. Following the methodology of [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Statistically significant Pearson correlation coefficients [51] from [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
read the original abstract

Human decision-making interprets the world through high-level concepts, such as recognizing a bird by its belly color. To bridge the gap between opaque deep learning representations and human understanding, Post-Hoc Concept Bottleneck Models (post-hoc CBMs) project latent features onto interpretable concept spaces using auxiliary datasets or vision-language models. However, relying on target task accuracy as the primary measure of post-hoc CBM success obscures whether the learned concepts are semantically meaningful or merely predictive artifacts. For example, random concept projections can achieve competitive accuracy despite being semantically meaningless. In this work, we analyze the learned projections directly and identify two failure cases: First, for concept projections learned from auxiliary data, covariate shifts can lead to unfaithful concept representations for the target task. In particular, we provide an upper bound on the error introduced by this shift. Second, systematic label noise in surrogate concept labels generated by vision-language models leads to unfaithful projections. After formalizing these failure modes, we introduce novel metrics that decouple concept faithfulness from predictive accuracy. Our empirical results across real-world and synthetic benchmarks confirm that these metrics identify unfaithful behaviors that standard accuracy-based evaluation fails to detect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that post-hoc CBMs can produce unfaithful concept projections due to covariate shifts between auxiliary data and the target task (with an explicit upper bound on the resulting error) and due to systematic label noise in VLM-generated surrogate labels. It formalizes these failure modes, introduces novel metrics intended to measure concept faithfulness independently of predictive accuracy, and reports that experiments on real-world and synthetic benchmarks show these metrics detect unfaithful behavior that accuracy-based evaluation misses.

Significance. If the upper bound is rigorously derived and the new metrics are shown to be independent of the auxiliary data and VLM labels that induce the failures, the work would usefully shift evaluation practice in interpretable ML away from accuracy alone toward direct assessment of semantic faithfulness.

major comments (2)
  1. [Failure Modes / Upper Bound] The upper bound on covariate-shift error is presented as a central contribution, yet the abstract provides no derivation, assumptions, or proof sketch; if this bound is load-bearing for the first failure mode, the manuscript must include the full derivation (likely in the formalization section) with explicit conditions under which it holds.
  2. [Novel Metrics] The claim that the novel metrics 'decouple concept faithfulness from predictive accuracy' is central, but the construction must be shown to avoid circular dependence on the same auxiliary datasets and VLM surrogate labels used to induce the documented failure modes; without an explicit independence argument or ablation, the metrics may simply rediscover the shift or noise rather than measure semantic faithfulness.
minor comments (1)
  1. [Experiments] Empirical results are described as confirming the metrics' utility, but the abstract lacks any mention of error bars, number of runs, or statistical tests; these details should be added to the experimental section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of the upper bound and the independence of the proposed metrics.

read point-by-point responses
  1. Referee: [Failure Modes / Upper Bound] The upper bound on covariate-shift error is presented as a central contribution, yet the abstract provides no derivation, assumptions, or proof sketch; if this bound is load-bearing for the first failure mode, the manuscript must include the full derivation (likely in the formalization section) with explicit conditions under which it holds.

    Authors: We agree that the derivation, assumptions, and conditions should be stated explicitly. The bound appears in the formalization section under the assumption of bounded distribution shift and Lipschitz continuity of the projection function, but we will add a complete proof sketch with all conditions to that section and a brief mention of the key assumptions to the abstract in the revision. revision: yes

  2. Referee: [Novel Metrics] The claim that the novel metrics 'decouple concept faithfulness from predictive accuracy' is central, but the construction must be shown to avoid circular dependence on the same auxiliary datasets and VLM surrogate labels used to induce the documented failure modes; without an explicit independence argument or ablation, the metrics may simply rediscover the shift or noise rather than measure semantic faithfulness.

    Authors: We acknowledge the need for an explicit independence argument. The metrics are defined using only target-task inputs and (where available) ground-truth concept labels, without reference to auxiliary data or VLM outputs at evaluation time. We will add a dedicated paragraph proving this separation and an ablation that recomputes the metrics after removing auxiliary/VLM influence to confirm they do not rediscover the original failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained against external benchmarks

full rationale

The abstract formalizes two failure modes (covariate shift with an upper bound on error, and label noise from VLMs) then introduces novel metrics claimed to decouple faithfulness from accuracy. No equations, parameter fits, or self-citations are shown that reduce the bound or metrics to inputs by construction. The provided text contains no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. Claims rest on analysis of existing post-hoc CBM methods without evident reduction to the authors' own auxiliary data or VLM labels. This is the common honest non-finding for papers whose central contributions are empirical identification of failure modes rather than closed-form derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about how covariate shift and label noise are measured and how the new metrics are constructed.

pith-pipeline@v0.9.1-grok · 5747 in / 1212 out tokens · 33406 ms · 2026-06-30T06:34:15.801338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    In: The Fourteenth International Conference on Learning Representations (2026) 1

    Almudévar, A., Hernández-Lobato, J.M., Ortega, A.: There Was Never a Bottleneck in Concept Bottleneck Models. In: The Fourteenth International Conference on Learning Representations (2026) 1

  2. [2]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Alukaev, D., Kiselev, S., Pershin, I., Ibragimov, B., Ivanov, V., Kornaev, A., Titov, I.: Cross-Modal Conceptualization in Bottleneck Models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 5241–5253 (2023) 1

  3. [3]

    Advances in Neural Information Processing Systems32(2019) 2, 3, 5, 6, 21

    Ansuini, A., Laio, A., Macke, J.H., Zoccolan, D.: Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems32(2019) 2, 3, 5, 6, 21

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    Baraniuk, R.G., Wakin, M.B.: Random Projections of Smooth Manifolds. Found. Comput. Math.9(1), 51–77 (2009) 2, 3, 5, 6, 15, 21

  6. [6]

    Barsalou, L.W.: Perceptual symbol systems. Behav. Brain. Sci.22(4), 577–660 (1999) 1

  7. [7]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network Dissection: Quanti- fying Interpretability of Deep Visual Representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6541–6549 (2017) 1, 2, 3, 7, 10, 11, 32, 36, 37

  8. [8]

    Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn.79(1-2), 151–175 (2010) 3, 7, 13, 15, 22, 35, 36, 37

  9. [9]

    Springer (2006) 6

    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006) 6

  10. [10]

    Advances in Neural Information Processing Systems33, 1877–1901 (2020) 4, 11, 32

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

  11. [11]

    Chauhan, K., Tiwari, R., Freyberg, J., Shenoy, P., Dvijotham, K.: Interactive concept bottleneck models. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI’23/IAAI...

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv preprint arXiv:2507.06261 (2025) 24 On the Faithfulness of Post-Hoc Concept Bottlenec...

  13. [13]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009) 10, 31

  14. [14]

    In: International Conference on Learning Representations (2021) 10

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations (2021) 10

  15. [15]

    Advances in Neural Infor- mation Processing Systems35, 21400–21413 (2022) 2, 4, 5, 21, 25

    Espinosa Zarlenga, M., Barbiero, P., Ciravegna, G., Marra, G., Giannini, F., Dili- genti, M., Shams, Z., Precioso, F., Melacci, S., Weller, A., et al.: Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off. Advances in Neural Infor- mation Processing Systems35, 21400–21413 (2022) 2, 4, 5, 21, 25

  16. [16]

    Fahrmeir, L., Lang, S.: Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors. J. R. Stat. Soc. Ser. C50(2), 201–220 (2001) 8, 25, 26, 34

  17. [17]

    In: The Fourteenth International Conference on Learning Representations (2026) 1

    Galliamov, K., Kazmi, S.M.A., Khan, A., Rivera, A.R.: Concepts’ Information Bottleneck Models. In: The Fourteenth International Conference on Learning Representations (2026) 1

  18. [18]

    MIT press (2004) 1

    Gardenfors, P.: Conceptual spaces: The geometry of thought. MIT press (2004) 1

  19. [19]

    MIT Press (2016) 6

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 6

  20. [20]

    Advances in Neural Information Processing Systems35, 23386–23397 (2022) 3, 5

    Havasi, M., Parbhoo, S., Doshi-Velez, F.: Addressing Leakage in Concept Bottleneck Models. Advances in Neural Information Processing Systems35, 23386–23397 (2022) 3, 5

  21. [21]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 9, 31

    He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 9, 31

  22. [22]

    Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 6(2), 65–70 (1979) 15, 39, 41

  23. [23]

    Neural Netw.2(5), 359–366 (1989) 5, 21

    Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw.2(5), 359–366 (1989) 5, 21

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Huang, Q., Song, J., Hu, J., Zhang, H., Wang, Y., Song, M.: On the concept trust- worthiness in concept bottleneck models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 21161–21168 (2024) 4

  25. [25]

    Jacovi, A., Goldberg, Y.: Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4198–4205 (2020) 2

  26. [26]

    Johnson, W.B., Lindenstrauss, J., et al.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math.26(189-206), 1 (1984) 2, 3, 5, 21

  27. [27]

    Jørgensen, B.: Exponential Dispersion Models. J. R. Stat. Soc. Ser. B49(2), 127–162 (1987) 8, 26

  28. [28]

    Transact

    Kazmierczak, R., Berthier, E., Frehse, G., Franchi, G.: CLIP-QDA: An Explainable Concept Bottleneck Model. Transact. Mach. Learn. Res. (2024) 1

  29. [29]

    In: Pro- ceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30

    Kifer, D., Ben-David, S., Gehrke, J.: Detecting Change in Data Streams. In: Pro- ceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30. p. 180–191. VLDB ’04, VLDB Endowment (2004) 7, 13

  30. [30]

    In: Proceedings of the 35th International Conference on Machine Learning

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., Sayres, R.: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2668–2677. PMLR (2018) 7, 31, 32

  31. [31]

    In: The Third International Conference on Learning Representations (2015) 31 18 L

    Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: The Third International Conference on Learning Representations (2015) 31 18 L. Schmalwasser et al

  32. [32]

    In: Proceedings of the 37th International Conference on Machine Learning

    Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., Liang, P.: Concept Bottleneck Models. In: Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 5338–5348. PMLR (2020) 1, 2, 6, 12

  33. [33]

    Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Tech. rep. (2009) 3, 10, 11, 31, 32, 33

  34. [34]

    arXiv preprint arXiv:2504.10833 (2025) 4

    Kumar, S., Ahuja, N.: Measuring the (Un) Faithfulness of Concept-Based Explana- tions. arXiv preprint arXiv:2504.10833 (2025) 4

  35. [35]

    In: The Twelfth International Conference on Learning Representations (2024) 4

    Lai, S., Hu, L., Wang, J., Berti-Equille, L., Wang, D.: Faithful Vision-Language Inter- pretation via Concept Bottleneck Models. In: The Twelfth International Conference on Learning Representations (2024) 4

  36. [36]

    In: Proceedings of the 39th International Conference on Machine Learning

    Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 12888–12900 (2022) 24

  37. [37]

    Lipton, Z.C.: The Mythos of Model Interpretability. Commun. ACM61(10), 36–43 (2018) 1

  38. [38]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: LLaVA-NeXT: Improved reasoning, OCR, and world knowledge (2024) 24

  39. [39]

    In: Computer Vision – ECCV 2024

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In: Computer Vision – ECCV 2024. p. 38–55. Lecture Notes in Computer Science (2024) 3, 11, 12, 13, 14, 15, 24, 36, 38, 39

  40. [40]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024) 4

    Luyten, M.R., van der Schaar, M.: A theoretical design of concept sets: improving the predictability of concept bottleneck models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024) 4

  41. [41]

    arXiv preprint arXiv:2106.13314 (2021) 2, 3

    Mahinpei, A., Clark, J., Lage, I., Doshi-Velez, F., Pan, W.: Promises and Pitfalls of Black-Box Concept Learning Models. arXiv preprint arXiv:2106.13314 (2021) 2, 3

  42. [42]

    In: ICLR 2025 Workshop: XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge (2025) 2, 5, 21, 34

    Makonnen, M., Vandenhirtz, M., Laguna, S., Vogt, J.E.: Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach. In: ICLR 2025 Workshop: XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge (2025) 2, 5, 21, 34

  43. [43]

    Margeloiu,A.,Ashman,M.,Bhatt,U.,Chen,Y.,Jamnik,M.,Weller,A.:DoConcept Bottleneck Models Learn as Intended? ICLR 2021 Workshop on Responsible AI (2021) 2, 3, 5

  44. [44]

    Monographs on statistics and applied probability, 2 edn

    McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monographs on statistics and applied probability, 2 edn. (1989) 8, 25, 26, 27, 28, 34

  45. [45]

    Midavaine, N., Go, G.H.T., Canez, D., Simion, I., Chatterji, S.: [Re] On the Reproducibility of Post-Hoc Concept Bottleneck Models. Trans. Mach. Learn. Res. (2024) 2, 5, 11, 30

  46. [46]

    In: Proceedings of the 40th International Conference on Machine Learning

    Moayeri, M., Rezaei, K., Sanjabi, M., Feizi, S.: Text-To-Concept (and Back) via Cross-Model Alignment. In: Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 25037–25060. PMLR (2023) 1

  47. [47]

    MIT Press (2002) 1

    Murphy, G.: The Big Book of Concepts. MIT Press (2002) 1

  48. [48]

    Nicolson, A., Schut, L., Noble, A., Gal, Y.: Explaining Explainability: Recommen- dations for Effective Use of Concept Activation Vectors. Trans. Mach. Learn. Res. (2025) 3, 10, 12, 13, 14, 31, 32, 33, 34, 36, 38, 39, 40, 41

  49. [49]

    Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.W.: Label-free Concept Bottleneck Models (2023) 2, 3, 4, 7, 8, 10, 11, 12, 13, 23, 31, 32, 34 On the Faithfulness of Post-Hoc Concept Bottleneck Models 19

  50. [50]

    In: The Eleventh International Conference on Learning Representations (2023) 2, 23

    Oikarinen, T., Weng, T.W.: Clip-dissect: Automatic description of neuron repre- sentations in deep vision networks. In: The Eleventh International Conference on Learning Representations (2023) 2, 23

  51. [51]

    Mathematical Contributions to the Theory of Evolution

    Pearson, K.: VII. Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. A. (187), 253–318 (12

  52. [52]

    9, 14, 36, 38, 39, 40, 41

  53. [53]

    In: Math

    Penrose, R.: On best approximate solutions of linear matrix equations. In: Math. Proc. Camb. Philos. Soc. vol. 52, pp. 17–19. Cambridge University Press (1956) 6, 22

  54. [54]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Penzel, N., Denzler, J.: Locally explaining prediction behavior via gradual inter- ventions and measuring property gradients. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 7398– 7408 (March 2026). https://doi.org/10.48550/arXiv.2503.05424 , https: //propgrad.github.io25

  55. [55]

    In: International Conference on Learning Representations (2021) 2, 3, 5, 6, 21

    Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., Goldstein, T.: The Intrinsic Dimension of Images and Its Impact on Learning. In: International Conference on Learning Representations (2021) 2, 3, 5, 6, 21

  56. [56]

    In: Proceedings of the 38th International Conference on Machine Learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763...

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ramaswamy, V.V., Kim, S.S., Fong, R., Russakovsky, O.: Overlooked Factors in Concept-based Explanations: Dataset Choice, Concept Learnability, and Human Capability. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10932–10941 (2023) 4, 6

  58. [58]

    Sampat, S.K., Patel, M., Yang, Y., Baral, C.: Help Me Identify: Is an LLM+ VQA System All We Need to Identify Visual Concepts? arXiv preprint arXiv:2410.13651 (2024) 2, 8, 23, 24

  59. [59]

    In: International Conference on Machine Learning (ICML) (2025),https://fastcav.github.io/ 7

    Schmalwasser,L.,Penzel,N.,Denzler,J.,Niebling,J.:Fastcav:Efficientcomputation of concept activation vectors for explaining deep neural networks. In: International Conference on Machine Learning (ICML) (2025),https://fastcav.github.io/ 7

  60. [60]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Schoen, R., Abeloos, B., Herbin, S.: Measuring and Addressing Information Leakage in Concept Bottleneck Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 624–632 (2025) 3, 5, 34

  61. [61]

    Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference90(2), 227–244 (2000) 7, 23

  62. [62]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3. arXiv preprint a...

  63. [63]

    Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of generalknowledge.In:ProceedingsoftheThirty-FirstAAAIConferenceonArtificial Intelligence. p. 4444–4451 (2017) 11, 32

  64. [64]

    Schmalwasser et al

    Srivastava, D., Yan, G., Weng, T.W.: VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance (2024) 2, 3, 4, 6, 7, 8, 11, 12, 13, 14, 21, 23, 24, 32, 34 20 L. Schmalwasser et al

  65. [65]

    The MIT Press (2012) 7, 23

    Sugiyama, M., Kawanabe, M.: Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. The MIT Press (2012) 7, 23

  66. [66]

    arXiv preprint arXiv:2402.05945 (2024) 3

    Sun, A., Yuan, Y., Ma, P., Wang, S.: Eliminating information leakage in hard concept bottleneck models with supervised, hierarchical concept learning. arXiv preprint arXiv:2402.05945 (2024) 3

  67. [67]

    https://github.com/osmr/ imgclsmob(2024), GitHub repository, accessed February 2026 9

    Sémery, O.: imgclsmob: Deep learning networks. https://github.com/osmr/ imgclsmob(2024), GitHub repository, accessed February 2026 9

  68. [68]

    In: Computer Vision – ECCV 2024

    Tan, A., Zhou, F., Chen, H.: Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts. In: Computer Vision – ECCV 2024. p. 123–138. Lecture Notes in Computer Science, Springer-Verlag (2024) 1

  69. [69]

    Vandenhirtz, M., Laguna, S., Marcinkevičs, R., Vogt, J.E.: Stochastic Concept Bottleneck Models (2024) 1

  70. [70]

    47 (2018) 22

    Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (2018) 22

  71. [71]

    Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD Birds 200 (2010) 3, 9, 14, 31, 32, 33, 34, 36, 37, 40, 41

  72. [72]

    In: The Eleventh International Conference on Learning Representations (2023) 2, 3, 4, 6, 7, 10, 11, 12, 14, 21, 25, 30, 31, 32, 34

    Yuksekgonul, M., Wang, M., Zou, J.: Post-hoc concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2023) 2, 3, 4, 6, 7, 10, 11, 12, 14, 21, 25, 30, 31, 32, 34

  73. [73]

    In: Proceedings of the Twenty-First International Conference on Machine Learning

    Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning. Association for Computing Machinery (2004) 13, 36

  74. [74]

    Is the concept ’tk’ present in this image? Answer Yes or No

    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B67(2), 301–320 (03 2005) 31 On the Faithfulness of Post-Hoc Concept Bottleneck Models 21 A Additional Theoretical Details A.1 Johnson-Lindenstrauss for Smooth Manifolds While neural activation spaces are high-dimensional (A ⊂R d), valid activations empir...

  75. [75]

    Distributional Assumption.Because concepts in post-hoc CBMs can be mod- eled in various ways, such as binary indicators (e.g., presence of “wings”) or continuous similarity scores (e.g., similarity with a VLM prompt), we need a gen- eral framework to model the concept distribution. TheMultivariate Exponential Dispersion Family(EDF) [27,44] provides this g...

  76. [76]

    Structural Assumption.To formally model the concept projection, we follow the standard GLM framework [16,44], which connects the backbone activations to the expected concept valuesµ through a linear predictorζ and an invertible link function ψ. In the context of post-hoc CBMs, the linear predictor corresponds to On the Faithfulness of Post-Hoc Concept Bot...

  77. [77]

    Gradient of the Log-Likelihood.Our optimization objective is to minimize the Negative Log-Likelihood (NLL) on the training data. For a single observation (f(x), c)∈ A × C, the loss functionLis derived from the EDF density: L(πΘ(f(x)), c) =−logp(c|η, ϕ)(28) =−log exp ⟨c, η⟩ −A(η) s(ϕ) +k(c, ϕ) (29) =− ⟨c, η⟩ −A(η) s(ϕ) +k(c, ϕ) (30) ∝ − 1 s(ϕ) (⟨c, η⟩ −A(η...

  78. [78]

    Thus, the surrogate objective ˜Jtask(Θ; L)is the expected risk over the task distributionPtask, calculated using these surrogate labels: ˜Jtask(Θ;L) =E x∼Ptask [L(πΘ(f(x)),˜c(x))]

    Surrogate Optimality Condition.We assume the projection parametersΘ are learned using a surrogate labeling function˜c: X → C (see Appendix A.4 for examples) rather than the inaccessible ground truth. Thus, the surrogate objective ˜Jtask(Θ; L)is the expected risk over the task distributionPtask, calculated using these surrogate labels: ˜Jtask(Θ;L) =E x∼Pta...

  79. [79]

    (41) We evaluate the gradient of this objective at the parameters˜Θ learned via the surrogate optimization

    Gradient of the True Objective.We define thetrueobjective J ∗ task(Θ; L)as the expected risk with respect to the ground-truth concept functionc∗(x)and the true dispersion parameterϕ∗: J ∗ task(Θ;L) =E x∼Ptask [L(πΘ(f(x)), c ∗(x))]. (41) We evaluate the gradient of this objective at the parameters˜Θ learned via the surrogate optimization. Substituting the ...

  80. [80]

    For the concept projectionπΘrand, we scale the bottleneck dimensionK of the weight matrixΘrand ∈R K×d to align with the number of concepts present in the trained PCBMs

    with concept projections trained on the CUB [70], CIFAR-10 [33], and Elements [48] datasets and compares them with PCBMs utilizing a semantically meaningless concept projectionπΘrand for the respective dataset (Appendix B.1). For the concept projectionπΘrand, we scale the bottleneck dimensionK of the weight matrixΘrand ∈R K×d to align with the number of c...

Showing first 80 references.