If Concept Bottlenecks are the Question, are Foundation Models the Answer?

Andrea Passerini; Emanuele Marconato; Francesco Giannini; Nicola Debole; Pietro Barbiero; Stefano Teso

arxiv: 2504.19774 · v2 · submitted 2025-04-28 · 💻 cs.LG

If Concept Bottlenecks are the Question, are Foundation Models the Answer?

Nicola Debole , Pietro Barbiero , Francesco Giannini , Andrea Passerini , Stefano Teso , Emanuele Marconato This is my paper

Pith reviewed 2026-05-22 17:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords concept bottleneck modelsvision-language modelsweak supervisionconcept qualityinterpretabilityfoundation modelsempirical evaluation

0 comments

The pith

Vision-language models supply concept labels for bottleneck models that often differ from expert annotations and whose accuracy does not track their actual quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Concept bottleneck models first map inputs to human-understandable concepts and then use those concepts to make the final prediction. This design promises both accuracy and built-in interpretability, yet it has long depended on costly expert labels for the concepts. Recent work replaces those labels with weak supervision drawn from foundation models. The paper tests whether this substitution preserves concept quality by running state-of-the-art VLM-CBMs through a battery of quality metrics on multiple tasks. The results indicate that VLM supervision produces noticeably different concepts from expert supervision and that simple accuracy on the concepts is only weakly related to their usefulness for the downstream task or for interpretability.

Core claim

The authors evaluate several VLM-CBM architectures by comparing the concepts they learn under VLM supervision against those learned under expert supervision. They measure concept quality with a selection of established metrics and find that the two sources of supervision diverge substantially on some tasks while agreeing on others. In addition, they report that concept accuracy and the chosen quality metrics are not strongly correlated, implying that accuracy alone is an unreliable indicator of whether a concept will support good performance or clear explanations.

What carries the argument

The empirical comparison of VLM-supervised CBMs against expert-supervised CBMs using a battery of concept quality metrics.

If this is right

Concept accuracy should not be used as the sole criterion when selecting or validating concepts in bottleneck models.
VLM supervision may be adequate for some tasks but requires task-specific validation before it replaces expert labels.
Quality metrics beyond accuracy become necessary to decide whether a VLM-CBM meets interpretability goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future CBMs may need to combine VLM and expert signals rather than treat them as interchangeable.
The gap between VLM and expert concepts could be narrowed by more targeted prompting or fine-tuning of the foundation model on the target domain.

Load-bearing premise

The chosen metrics for concept quality give a valid and complete picture of what makes concepts useful for both task performance and interpretability.

What would settle it

A follow-up study in which the same models are evaluated with a different set of quality metrics or with direct human judgments of interpretability, and the new metrics reverse the ranking of VLM versus expert concepts.

Figures

Figures reproduced from arXiv: 2504.19774 by Andrea Passerini, Emanuele Marconato, Francesco Giannini, Nicola Debole, Pietro Barbiero, Stefano Teso.

**Figure 1.** Figure 1: A prototypical VLM-CBM. Left: given textual descriptions T of visual concepts, a VLM is used to annotate a training set D with per-concept labels or activation scores and (optionally) masks indicating where each concept activates. Right: the supervision is typically used to fine-tune a backbone f that extracts concepts from new inputs and a linear layer g for inferring predictions from the concepts. See Se… view at source ↗

**Figure 2.** Figure 2: Issues with CBM concepts. An illustration of the four issues with concept quality and usage affecting CBMs on the CelebA task (cf. Section 4): (1) Attaining high label accuracy does not prevent learning inaccurate concepts. (2) CBMs can maximize label performance by learning leaky concepts that include irrelevant contextual cues. (3) Entangled concepts encode unwanted information about one another, affecti… view at source ↗

**Figure 3.** Figure 3: Gaps of the concept leakage test in CelebA and CUB. The higher the gap, the more the subset of learned concepts leaks information for label prediction. (Left) Gaps on CelebA upon varying the subset of concepts from 1 to 39. All model concepts always display a non-negative gain in label F1 compared to ground-truth concepts. (Right) Gaps on CUB upon varying the subset of concepts from 1 to 112. Only CBM is a… view at source ↗

**Figure 4.** Figure 4: Gaps in concept leakage test on Shapes3d dataset. Negative values indicate that the classifier using predicted concepts has lower F1 score than the same which has access to ground-truth concept annotations. B.1 Disentanglement We provide in [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Annotations obtained using G-DINO on the 3 different datasets. On CelebA (left) and Shapes3d (center) the VLM makes several mistakes highlighted by red crosses. On the other hand, the model does a good job annotating CUB (right). CBM Naïve LF-CBM LaBo VLG-CBM [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Importance matrices in DCI on Shapes3d. High disentanglement is achieved when, for each row, only one square is active. For reference, a perfect DCI matrix is a diagonal matrix. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Importance matrices in DCI on CelebA. High disentanglement is achieved when, for each row, only one square is active. For reference, a perfect DCI matrix is a diagonal matrix. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Importance matrices in DCI on CUB. High disentanglement is achieved when, for each row, only one square is active. For reference, a perfect DCI matrix is a diagonal matrix. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability. CBMs work by first mapping inputs (e.g., images) to high-level concepts (e.g., visible objects and their properties) and then use these to solve a downstream task (e.g., tagging or scoring an image) in an interpretable manner. Their performance and interpretability, however, hinge on the quality of the concepts they learn. The go-to strategy for ensuring good quality concepts is to leverage expert annotations, which are expensive to collect and seldom available in applications. Researchers have recently addressed this issue by introducing "VLM-CBM" architectures that replace manual annotations with weak supervision from foundation models. It is however unclear what is the impact of doing so on the quality of the learned concepts. To answer this question, we put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics. Our results show that, depending on the task, VLM supervision can sensibly differ from expert annotations, and that concept accuracy and quality are not strongly correlated. Our code is available at https://github.com/debryu/CQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM supervision for CBMs produces concepts that diverge from expert labels on some tasks, with concept accuracy showing only weak ties to the quality metrics they measured.

read the letter

The key point is that swapping expert annotations for VLM supervision in concept bottleneck models does not preserve the same concepts across tasks, and accuracy on those concepts does not track well with the other quality measures the authors tracked. This is the main empirical observation they surface. The work is straightforward: they take existing VLM-CBM pipelines, apply them to standard image datasets, and score the resulting concepts with a handful of metrics that go beyond simple accuracy. Releasing the code is useful and lets others rerun or extend the comparisons directly. That part is clean and reproducible on its own terms. The finding that accuracy and quality metrics are not strongly correlated is new relative to the earlier VLM-CBM papers they cite, and it is worth noting because many follow-up papers still treat concept accuracy as a sufficient proxy. The soft spot is the metric selection itself. The claim about weak correlation stands or falls on whether those particular quality metrics actually capture what matters for downstream interpretability and robustness. If they under-weight factors like human alignment or behavior under distribution shift, the reported disconnect could be narrower than it first appears. The experiments stay within common benchmarks, so it is still open how large the gap becomes in applied settings where expert labels are truly scarce. This is the kind of paper that belongs in a reading group focused on interpretable models or weak supervision. People already working on CBMs or VLM-based pipelines will find the comparison concrete enough to build on. It is worth sending to peer review because the empirical observation is falsifiable and the code is public, even if the authors need to tighten the justification for their quality metrics and add checks on downstream task impact.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an empirical study on the use of Vision-Language Models (VLMs) for supervising Concept Bottleneck Models (CBMs) in place of expert annotations. The authors evaluate state-of-the-art VLM-CBM architectures across different tasks using multiple metrics for concept accuracy and quality. Key findings indicate that VLM supervision can differ from expert annotations in a task-dependent manner and that concept accuracy does not strongly correlate with the assessed concept quality.

Significance. Should the empirical results prove robust, this work underscores the challenges in replacing expert supervision with VLMs for CBMs and questions the adequacy of accuracy as a measure of concept quality. It contributes to the field by providing a comparative analysis and releasing code, which aids in assessing the practical utility of VLM-CBMs for interpretable machine learning.

major comments (1)

[Empirical Analysis] The conclusion that concept accuracy and quality are not strongly correlated hinges on the chosen metrics providing a valid assessment of concept usefulness for CBM performance and interpretability. The manuscript should include analysis showing how these metrics correlate with downstream task accuracy or human-judged interpretability to rule out the possibility that the observed lack of correlation is due to metric selection rather than an inherent property of VLM supervision.

minor comments (1)

[Abstract] The phrase 'a selection of significant metrics' is vague; specifying the metrics or pointing to the relevant section would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for the positive assessment of our work and the detailed feedback. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [Empirical Analysis] The conclusion that concept accuracy and quality are not strongly correlated hinges on the chosen metrics providing a valid assessment of concept usefulness for CBM performance and interpretability. The manuscript should include analysis showing how these metrics correlate with downstream task accuracy or human-judged interpretability to rule out the possibility that the observed lack of correlation is due to metric selection rather than an inherent property of VLM supervision.

Authors: We appreciate this suggestion, which helps strengthen the validity of our conclusions. In the original manuscript, we selected metrics based on established practices in the CBM literature to assess concept quality in terms of their alignment with semantic meaning and contribution to interpretability. To address the referee's concern directly, we will add a new analysis in the revised manuscript that examines the correlation between our quality metrics and the downstream task accuracy for each VLM-CBM architecture and task. This will provide evidence that the observed weak correlation between concept accuracy and quality metrics reflects a genuine property rather than a limitation of the metrics. We note that conducting human-judged interpretability studies would require additional resources and participant recruitment, which we did not perform in the current work; however, we will expand the discussion to acknowledge this and suggest it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison against external annotations

full rationale

The paper performs direct empirical evaluation of VLM-CBM concept quality against expert annotations and independent metrics on multiple tasks. No derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatzes appear in the reported results. All claims rest on observable differences between VLM outputs and held-out expert labels, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard machine-learning assumptions about the validity of expert concept annotations as a benchmark and the relevance of the chosen quality metrics for assessing CBM interpretability.

axioms (1)

domain assumption Expert annotations represent a reliable ground-truth benchmark for evaluating the quality of learned concepts.
Used as the reference against which VLM supervision is compared throughout the empirical analysis.

pith-pipeline@v0.9.0 · 5765 in / 1132 out tokens · 134539 ms · 2026-05-22T17:47:01.283716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We measure disentanglement with DCI... OIS = 2/k ∑ |Rij − R(gt)ij|²
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LEAK = 1/k · Z ∑ max(gapℓ, 0)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 4 internal anchors

[1]

To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty

Yasin Abbasi Yadkori et al. To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty. NeurIPS, 2024

work page 2024
[2]

Towards robust interpretability with self-explaining neural networks

David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural networks. NeurIPS, 2018

work page 2018
[3]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

Jason Ansel et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In ASPLOS, 2024

work page 2024
[4]

Debiasing concept-based explanations with causal analysis

Mohammad Taha Bahadori and David Heckerman. Debiasing concept-based explanations with causal analysis. In ICLR, 2021

work page 2021
[5]

Entropy-based logic explanations of neural networks

Pietro Barbiero et al. Entropy-based logic explanations of neural networks. In AAAI, 2022

work page 2022
[6]

Interpretable neural-symbolic concept reasoning

Pietro Barbiero et al. Interpretable neural-symbolic concept reasoning. In ICML, 2023

work page 2023
[7]

Relational concept bottleneck models

Pietro Barbiero et al. Relational concept bottleneck models. NeurIPS, 2024

work page 2024
[8]

Neural interpretable reasoning

Pietro Barbiero et al. Neural interpretable reasoning. arXiv:2502.11639, 2025

work page arXiv 2025
[9]

Concept-level debugging of part-prototype networks

Andrea Bontempelli et al. Concept-level debugging of part-prototype networks. In ICLR, 2023

work page 2023
[10]

Shortcuts and identifiability in concept-based models from a neuro-symbolic lens

Samuele Bortolotti et al. Shortcuts and identifiability in concept-based models from a neuro-symbolic lens. arXiv:2502.11245, 2025

work page arXiv 2025
[11]

Logically consistent language models via neuro-symbolic integration

Diego Calanzone et al. Logically consistent language models via neuro-symbolic integration. In ICLR, 2025

work page 2025
[12]

Interactive concept bottleneck models

Kushal Chauhan et al. Interactive concept bottleneck models. In AAAI, 2023

work page 2023
[13]

This looks like that: Deep learning for interpretable image recognition

Chaofan Chen et al. This looks like that: Deep learning for interpretable image recognition. NeurIPS, 2019

work page 2019
[14]

Concept whitening for interpretable image recognition

Zhi Chen et al. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2020

work page 2020
[15]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023

work page 2023
[16]

Support-vector networks

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 1995

work page 1995
[17]

Causally reliable concept bottleneck models

Giovanni De Felice et al. Causally reliable concept bottleneck models. arXiv:2503.04363, 2025

work page arXiv 2025
[18]

Interpretable concept-based memory reasoning

David Debot et al. Interpretable concept-based memory reasoning. arXiv:2407.15527, 2024

work page arXiv 2024
[19]

Anycbms: How to turn any black box into a concept bottleneck model, 2024

Gabriele Dominici et al. Anycbms: How to turn any black box into a concept bottleneck model, 2024

work page 2024
[20]

Causal concept graph models: Beyond causal opacity in deep learning.arXiv:2405.16507, 2024

Gabriele Dominici et al. Causal concept graph models: Beyond causal opacity in deep learning.arXiv:2405.16507, 2024

work page arXiv 2024
[21]

Counterfactual concept bottleneck models

Gabriele Dominici et al. Counterfactual concept bottleneck models. arXiv:2402.01408, 2024

work page arXiv 2024
[22]

A framework for the quantitative evaluation of disentangled representations

Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In ICLR, 2018

work page 2018
[23]

Learning to receive help: Intervention-aware concept embedding models

Mateo Espinosa Zarlenga et al. Learning to receive help: Intervention-aware concept embedding models. NeurIPS, 2023

work page 2023
[24]

Learning to receive help: Intervention-aware concept embedding models

Mateo Espinosa Zarlenga et al. Learning to receive help: Intervention-aware concept embedding models. NeurIPS, 2024

work page 2024
[25]

Bayesian concept bottleneck models with llm priors

Jean Feng et al. Bayesian concept bottleneck models with llm priors. arXiv:2410.15555, 2024

work page arXiv 2024
[26]

Sample-efficient learning of concepts with theoretical guarantees: from data to concepts without interventions

Hidde Fokkema et al. Sample-efficient learning of concepts with theoretical guarantees: from data to concepts without interventions. arXiv:2502.06536, 2025

work page arXiv 2025
[27]

Towards a deeper understanding of concept bottleneck models through end-to-end explanation

Jack Furby et al. Towards a deeper understanding of concept bottleneck models through end-to-end explanation. In Workshop on Representation Learning for Responsible Human-Centric AI @ AAAI, 2023

work page 2023
[28]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Addressing leakage in concept bottleneck models

Marton Havasi et al. Addressing leakage in concept bottleneck models. In NeurIPS, 2022

work page 2022
[30]

Deep residual learning for image recognition

Kaiming He et al. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[31]

Towards a Definition of Disentangled Representations

Irina Higgins et al. Towards a definition of disentangled representations. arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM TOIS, 2023

work page 2023
[33]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card. arXiv:2410.21276, 2024. 11 A PREPRINT - A PRIL 30, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Concept bottleneck generative models

Aya Abdelsalam Ismail et al. Concept bottleneck generative models. In ICLR, 2023

work page 2023
[35]

A comprehensive survey on self-interpretable neural networks

Yang Ji et al. A comprehensive survey on self-interpretable neural networks. arXiv:2501.15638, 2025

work page arXiv 2025
[36]

Is disentanglement all you need? comparing concept-based & disentanglement approaches

Dmitry Kazhdan et al. Is disentanglement all you need? comparing concept-based & disentanglement approaches. arXiv:2104.06917, 2021

work page arXiv 2021
[37]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors

Been Kim et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors. In ICML, 2018

work page 2018
[38]

Probabilistic concept bottleneck models

Eunji Kim et al. Probabilistic concept bottleneck models. In ICML, 2023

work page 2023
[39]

Disentangling by factorising

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018

work page 2018
[40]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[41]

Concept bottleneck models

Pang Wei Koh et al. Concept bottleneck models. In ICML, 2020

work page 2020
[42]

Beyond concept bottleneck models: How to make black boxes intervenable? NeurIPS, 2024

Sonia Laguna et al. Beyond concept bottleneck models: How to make black boxes intervenable? NeurIPS, 2024

work page 2024
[43]

Faithful vision-language interpretation via concept bottleneck models

Songning Lai et al. Faithful vision-language interpretation via concept bottleneck models. In ICLR, 2024

work page 2024
[44]

Find: human-in-the-loop debugging deep text classifiers

Piyawat Lertvittayakumjorn et al. Find: human-in-the-loop debugging deep text classifiers. In EMNLP, 2020

work page 2020
[45]

Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions

Oscar Li et al. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In AAAI, 2018

work page 2018
[46]

On erroneous agreements of clip image embeddings

Siting Li et al. On erroneous agreements of clip image embeddings. arXiv:2411.05195, 2024

work page arXiv 2024
[47]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In ECCV, 2024

work page 2024
[48]

Deep learning face attributes in the wild

Ziwei Liu et al. Deep learning face attributes in the wild. In ICCV, 2015

work page 2015
[49]

Towards learning to explain with concept bottleneck models: mitigating information leakage

Joshua Lockhart et al. Towards learning to explain with concept bottleneck models: mitigating information leakage. arXiv:2211.03656, 2022

work page arXiv 2022
[50]

Promises and pitfalls of black-box concept learning models

Anita Mahinpei et al. Promises and pitfalls of black-box concept learning models. In Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI @ ICML, 2021

work page 2021
[51]

GlanceNets: Interpretabile, Leak-proof Concept-based Models

Emanuele Marconato et al. GlanceNets: Interpretabile, Leak-proof Concept-based Models. In NeurIPS, 2022

work page 2022
[52]

Interpretability is in the mind of the beholder: A causal framework for human- interpretable representation learning

Emanuele Marconato et al. Interpretability is in the mind of the beholder: A causal framework for human- interpretable representation learning. Entropy, 2023

work page 2023
[53]

Not all neuro-symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts

Emanuele Marconato et al. Not all neuro-symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts. NeurIPS, 2024

work page 2024
[54]

Do concept bottleneck models learn as intended? arXiv:2105.04289, 2021

Andrei Margeloiu et al. Do concept bottleneck models learn as intended? arXiv:2105.04289, 2021

work page arXiv 2021
[55]

Evaluating the stability of semantic concept representations in CNNs for robust explainability

Georgii Mikriukov et al. Evaluating the stability of semantic concept representations in CNNs for robust explainability. In World Conference on Explainable Artificial Intelligence, 2023

work page 2023
[56]

Lost in latent space: Examining failures of disentangled models at combinatorial generalisa- tion

Milton Montero et al. Lost in latent space: Examining failures of disentangled models at combinatorial generalisa- tion. NeurIPS, 2022

work page 2022
[57]

Diconstruct: Causal concept-based explanations through black-box distillation

Ricardo Moreira et al. Diconstruct: Causal concept-based explanations through black-box distillation. arXiv:2401.08534, 2024

work page arXiv 2024
[58]

Label-free concept bottleneck models

Tuomas Oikarinen et al. Label-free concept bottleneck models. In ICLR, 2023

work page 2023
[59]

Concept-based explainable artificial intelligence: A survey

Eleonora Poeta et al. Concept-based explainable artificial intelligence: A survey. arXiv:2312.12936, 2023

work page arXiv 2023
[60]

Learning transferable visual models from natural language supervision

Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[61]

From causal to concept-based representation learning

Goutham Rajendran et al. From causal to concept-based representation learning. NeurIPS, 2024

work page 2024
[62]

Do concept bottleneck models obey locality? In XAI in Action: Past, Present, and Future Applications, 2023

Naveen Raman et al. Do concept bottleneck models obey locality? In XAI in Action: Past, Present, and Future Applications, 2023

work page 2023
[63]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery, 2024

Sukrut Rao et al. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery, 2024

work page 2024
[64]

Unpacking large language models with conceptual consistency

Pritish Sahu et al. Unpacking large language models with conceptual consistency. arXiv:2209.15093, 2022

work page arXiv 2022
[65]

C-senn: Contrastive self-explaining neural network

Yoshihide Sawada and Keigo Nakamura. C-senn: Contrastive self-explaining neural network. arXiv:2206.09575, 2022

work page arXiv 2022
[66]

Concept bottleneck model with additional unsupervised concepts

Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 2022. 12 A PREPRINT - A PRIL 30, 2025

work page 2022
[67]

New support vector algorithms

Bernhard Schölkopf et al. New support vector algorithms. Neural computation, 2000

work page 2000
[68]

Toward causal representation learning

Bernhard Schölkopf et al. Toward causal representation learning. IEEE, 2021

work page 2021
[69]

Concept bottleneck models without predefined concepts

Simon Schrodi et al. Concept bottleneck models without predefined concepts. arXiv:2407.03921, 2024

work page arXiv 2024
[70]

Concept embedding analysis: A review

Gesina Schwalbe. Concept embedding analysis: A review. arXiv:2203.13909, 2022

work page arXiv 2022
[71]

A closer look at the intervention procedure of concept bottleneck models

Sungbin Shin et al. A closer look at the intervention procedure of concept bottleneck models. In ICML, 2023

work page 2023
[72]

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Divyansh Srivastava et al. VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance. In NeurIPS, 2024

work page 2024
[73]

Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations

Wolfgang Stammer et al. Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations. In CVPR, 2021

work page 2021
[74]

Learning to intervene on concept bottlenecks

David Steinmann et al. Learning to intervene on concept bottlenecks. In ICML, 2024

work page 2024
[75]

Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness

Raphael Suter et al. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In ICML, 2019

work page 2019
[76]

Leveraging explanations in interactive machine learning: An overview

Stefano Teso et al. Leveraging explanations in interactive machine learning: An overview. Frontiers in Artificial Intelligence, 2023

work page 2023
[77]

Stochastic concept bottleneck models

Moritz Vandenhirtz et al. Stochastic concept bottleneck models. arXiv:2406.19272, 2024

work page arXiv 2024
[78]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah et al. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011
[79]

Leveraging sparse linear layers for debuggable deep networks

Eric Wong et al. Leveraging sparse linear layers for debuggable deep networks. In ICML, 2021

work page 2021
[80]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang et al. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, 2023

work page 2023

Showing first 80 references.

[1] [1]

To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty

Yasin Abbasi Yadkori et al. To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty. NeurIPS, 2024

work page 2024

[2] [2]

Towards robust interpretability with self-explaining neural networks

David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural networks. NeurIPS, 2018

work page 2018

[3] [3]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

Jason Ansel et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In ASPLOS, 2024

work page 2024

[4] [4]

Debiasing concept-based explanations with causal analysis

Mohammad Taha Bahadori and David Heckerman. Debiasing concept-based explanations with causal analysis. In ICLR, 2021

work page 2021

[5] [5]

Entropy-based logic explanations of neural networks

Pietro Barbiero et al. Entropy-based logic explanations of neural networks. In AAAI, 2022

work page 2022

[6] [6]

Interpretable neural-symbolic concept reasoning

Pietro Barbiero et al. Interpretable neural-symbolic concept reasoning. In ICML, 2023

work page 2023

[7] [7]

Relational concept bottleneck models

Pietro Barbiero et al. Relational concept bottleneck models. NeurIPS, 2024

work page 2024

[8] [8]

Neural interpretable reasoning

Pietro Barbiero et al. Neural interpretable reasoning. arXiv:2502.11639, 2025

work page arXiv 2025

[9] [9]

Concept-level debugging of part-prototype networks

Andrea Bontempelli et al. Concept-level debugging of part-prototype networks. In ICLR, 2023

work page 2023

[10] [10]

Shortcuts and identifiability in concept-based models from a neuro-symbolic lens

Samuele Bortolotti et al. Shortcuts and identifiability in concept-based models from a neuro-symbolic lens. arXiv:2502.11245, 2025

work page arXiv 2025

[11] [11]

Logically consistent language models via neuro-symbolic integration

Diego Calanzone et al. Logically consistent language models via neuro-symbolic integration. In ICLR, 2025

work page 2025

[12] [12]

Interactive concept bottleneck models

Kushal Chauhan et al. Interactive concept bottleneck models. In AAAI, 2023

work page 2023

[13] [13]

This looks like that: Deep learning for interpretable image recognition

Chaofan Chen et al. This looks like that: Deep learning for interpretable image recognition. NeurIPS, 2019

work page 2019

[14] [14]

Concept whitening for interpretable image recognition

Zhi Chen et al. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2020

work page 2020

[15] [15]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023

work page 2023

[16] [16]

Support-vector networks

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 1995

work page 1995

[17] [17]

Causally reliable concept bottleneck models

Giovanni De Felice et al. Causally reliable concept bottleneck models. arXiv:2503.04363, 2025

work page arXiv 2025

[18] [18]

Interpretable concept-based memory reasoning

David Debot et al. Interpretable concept-based memory reasoning. arXiv:2407.15527, 2024

work page arXiv 2024

[19] [19]

Anycbms: How to turn any black box into a concept bottleneck model, 2024

Gabriele Dominici et al. Anycbms: How to turn any black box into a concept bottleneck model, 2024

work page 2024

[20] [20]

Causal concept graph models: Beyond causal opacity in deep learning.arXiv:2405.16507, 2024

Gabriele Dominici et al. Causal concept graph models: Beyond causal opacity in deep learning.arXiv:2405.16507, 2024

work page arXiv 2024

[21] [21]

Counterfactual concept bottleneck models

Gabriele Dominici et al. Counterfactual concept bottleneck models. arXiv:2402.01408, 2024

work page arXiv 2024

[22] [22]

A framework for the quantitative evaluation of disentangled representations

Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In ICLR, 2018

work page 2018

[23] [23]

Learning to receive help: Intervention-aware concept embedding models

Mateo Espinosa Zarlenga et al. Learning to receive help: Intervention-aware concept embedding models. NeurIPS, 2023

work page 2023

[24] [24]

Learning to receive help: Intervention-aware concept embedding models

Mateo Espinosa Zarlenga et al. Learning to receive help: Intervention-aware concept embedding models. NeurIPS, 2024

work page 2024

[25] [25]

Bayesian concept bottleneck models with llm priors

Jean Feng et al. Bayesian concept bottleneck models with llm priors. arXiv:2410.15555, 2024

work page arXiv 2024

[26] [26]

Sample-efficient learning of concepts with theoretical guarantees: from data to concepts without interventions

Hidde Fokkema et al. Sample-efficient learning of concepts with theoretical guarantees: from data to concepts without interventions. arXiv:2502.06536, 2025

work page arXiv 2025

[27] [27]

Towards a deeper understanding of concept bottleneck models through end-to-end explanation

Jack Furby et al. Towards a deeper understanding of concept bottleneck models through end-to-end explanation. In Workshop on Representation Learning for Responsible Human-Centric AI @ AAAI, 2023

work page 2023

[28] [28]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Addressing leakage in concept bottleneck models

Marton Havasi et al. Addressing leakage in concept bottleneck models. In NeurIPS, 2022

work page 2022

[30] [30]

Deep residual learning for image recognition

Kaiming He et al. Deep residual learning for image recognition. In CVPR, 2016

work page 2016

[31] [31]

Towards a Definition of Disentangled Representations

Irina Higgins et al. Towards a definition of disentangled representations. arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM TOIS, 2023

work page 2023

[33] [33]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card. arXiv:2410.21276, 2024. 11 A PREPRINT - A PRIL 30, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Concept bottleneck generative models

Aya Abdelsalam Ismail et al. Concept bottleneck generative models. In ICLR, 2023

work page 2023

[35] [35]

A comprehensive survey on self-interpretable neural networks

Yang Ji et al. A comprehensive survey on self-interpretable neural networks. arXiv:2501.15638, 2025

work page arXiv 2025

[36] [36]

Is disentanglement all you need? comparing concept-based & disentanglement approaches

Dmitry Kazhdan et al. Is disentanglement all you need? comparing concept-based & disentanglement approaches. arXiv:2104.06917, 2021

work page arXiv 2021

[37] [37]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors

Been Kim et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors. In ICML, 2018

work page 2018

[38] [38]

Probabilistic concept bottleneck models

Eunji Kim et al. Probabilistic concept bottleneck models. In ICML, 2023

work page 2023

[39] [39]

Disentangling by factorising

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018

work page 2018

[40] [40]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[41] [41]

Concept bottleneck models

Pang Wei Koh et al. Concept bottleneck models. In ICML, 2020

work page 2020

[42] [42]

Beyond concept bottleneck models: How to make black boxes intervenable? NeurIPS, 2024

Sonia Laguna et al. Beyond concept bottleneck models: How to make black boxes intervenable? NeurIPS, 2024

work page 2024

[43] [43]

Faithful vision-language interpretation via concept bottleneck models

Songning Lai et al. Faithful vision-language interpretation via concept bottleneck models. In ICLR, 2024

work page 2024

[44] [44]

Find: human-in-the-loop debugging deep text classifiers

Piyawat Lertvittayakumjorn et al. Find: human-in-the-loop debugging deep text classifiers. In EMNLP, 2020

work page 2020

[45] [45]

Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions

Oscar Li et al. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In AAAI, 2018

work page 2018

[46] [46]

On erroneous agreements of clip image embeddings

Siting Li et al. On erroneous agreements of clip image embeddings. arXiv:2411.05195, 2024

work page arXiv 2024

[47] [47]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In ECCV, 2024

work page 2024

[48] [48]

Deep learning face attributes in the wild

Ziwei Liu et al. Deep learning face attributes in the wild. In ICCV, 2015

work page 2015

[49] [49]

Towards learning to explain with concept bottleneck models: mitigating information leakage

Joshua Lockhart et al. Towards learning to explain with concept bottleneck models: mitigating information leakage. arXiv:2211.03656, 2022

work page arXiv 2022

[50] [50]

Promises and pitfalls of black-box concept learning models

Anita Mahinpei et al. Promises and pitfalls of black-box concept learning models. In Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI @ ICML, 2021

work page 2021

[51] [51]

GlanceNets: Interpretabile, Leak-proof Concept-based Models

Emanuele Marconato et al. GlanceNets: Interpretabile, Leak-proof Concept-based Models. In NeurIPS, 2022

work page 2022

[52] [52]

Interpretability is in the mind of the beholder: A causal framework for human- interpretable representation learning

Emanuele Marconato et al. Interpretability is in the mind of the beholder: A causal framework for human- interpretable representation learning. Entropy, 2023

work page 2023

[53] [53]

Not all neuro-symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts

Emanuele Marconato et al. Not all neuro-symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts. NeurIPS, 2024

work page 2024

[54] [54]

Do concept bottleneck models learn as intended? arXiv:2105.04289, 2021

Andrei Margeloiu et al. Do concept bottleneck models learn as intended? arXiv:2105.04289, 2021

work page arXiv 2021

[55] [55]

Evaluating the stability of semantic concept representations in CNNs for robust explainability

Georgii Mikriukov et al. Evaluating the stability of semantic concept representations in CNNs for robust explainability. In World Conference on Explainable Artificial Intelligence, 2023

work page 2023

[56] [56]

Lost in latent space: Examining failures of disentangled models at combinatorial generalisa- tion

Milton Montero et al. Lost in latent space: Examining failures of disentangled models at combinatorial generalisa- tion. NeurIPS, 2022

work page 2022

[57] [57]

Diconstruct: Causal concept-based explanations through black-box distillation

Ricardo Moreira et al. Diconstruct: Causal concept-based explanations through black-box distillation. arXiv:2401.08534, 2024

work page arXiv 2024

[58] [58]

Label-free concept bottleneck models

Tuomas Oikarinen et al. Label-free concept bottleneck models. In ICLR, 2023

work page 2023

[59] [59]

Concept-based explainable artificial intelligence: A survey

Eleonora Poeta et al. Concept-based explainable artificial intelligence: A survey. arXiv:2312.12936, 2023

work page arXiv 2023

[60] [60]

Learning transferable visual models from natural language supervision

Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021

[61] [61]

From causal to concept-based representation learning

Goutham Rajendran et al. From causal to concept-based representation learning. NeurIPS, 2024

work page 2024

[62] [62]

Do concept bottleneck models obey locality? In XAI in Action: Past, Present, and Future Applications, 2023

Naveen Raman et al. Do concept bottleneck models obey locality? In XAI in Action: Past, Present, and Future Applications, 2023

work page 2023

[63] [63]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery, 2024

Sukrut Rao et al. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery, 2024

work page 2024

[64] [64]

Unpacking large language models with conceptual consistency

Pritish Sahu et al. Unpacking large language models with conceptual consistency. arXiv:2209.15093, 2022

work page arXiv 2022

[65] [65]

C-senn: Contrastive self-explaining neural network

Yoshihide Sawada and Keigo Nakamura. C-senn: Contrastive self-explaining neural network. arXiv:2206.09575, 2022

work page arXiv 2022

[66] [66]

Concept bottleneck model with additional unsupervised concepts

Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 2022. 12 A PREPRINT - A PRIL 30, 2025

work page 2022

[67] [67]

New support vector algorithms

Bernhard Schölkopf et al. New support vector algorithms. Neural computation, 2000

work page 2000

[68] [68]

Toward causal representation learning

Bernhard Schölkopf et al. Toward causal representation learning. IEEE, 2021

work page 2021

[69] [69]

Concept bottleneck models without predefined concepts

Simon Schrodi et al. Concept bottleneck models without predefined concepts. arXiv:2407.03921, 2024

work page arXiv 2024

[70] [70]

Concept embedding analysis: A review

Gesina Schwalbe. Concept embedding analysis: A review. arXiv:2203.13909, 2022

work page arXiv 2022

[71] [71]

A closer look at the intervention procedure of concept bottleneck models

Sungbin Shin et al. A closer look at the intervention procedure of concept bottleneck models. In ICML, 2023

work page 2023

[72] [72]

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Divyansh Srivastava et al. VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance. In NeurIPS, 2024

work page 2024

[73] [73]

Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations

Wolfgang Stammer et al. Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations. In CVPR, 2021

work page 2021

[74] [74]

Learning to intervene on concept bottlenecks

David Steinmann et al. Learning to intervene on concept bottlenecks. In ICML, 2024

work page 2024

[75] [75]

Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness

Raphael Suter et al. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In ICML, 2019

work page 2019

[76] [76]

Leveraging explanations in interactive machine learning: An overview

Stefano Teso et al. Leveraging explanations in interactive machine learning: An overview. Frontiers in Artificial Intelligence, 2023

work page 2023

[77] [77]

Stochastic concept bottleneck models

Moritz Vandenhirtz et al. Stochastic concept bottleneck models. arXiv:2406.19272, 2024

work page arXiv 2024

[78] [78]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah et al. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011

[79] [79]

Leveraging sparse linear layers for debuggable deep networks

Eric Wong et al. Leveraging sparse linear layers for debuggable deep networks. In ICML, 2021

work page 2021

[80] [80]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang et al. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, 2023

work page 2023