pith. sign in

arxiv: 2605.16405 · v1 · pith:RNFUEIKNnew · submitted 2026-05-13 · 💻 cs.CV

Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations

Pith reviewed 2026-05-20 21:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords concept bottleneck modelsvision-language modelsgaussian processesminimal annotationsmodel interpretabilityconcept calibrationactive learning
0
0 comments X

The pith

A Gaussian process in VLM embedding space propagates sparse human annotations to create more accurate concept predictions in bottleneck models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VH-CBM, a hybrid method that improves vision-language model guided concept bottleneck models by adding a small set of human-provided concept labels. It places a Gaussian process over the VLM embeddings so that the few expert annotations can be spread reliably to the rest of the data. This produces higher concept accuracy and better calibration than pure VLM guidance, even when only one percent of the data receives human labels, and it also enables active learning. A sympathetic reader would care because the approach lowers the practical cost of building transparent models without sacrificing their interpretability advantages.

Core claim

VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.

What carries the argument

Gaussian Process in the VLM embedding space that propagates the expert's sparse supervision to unlabeled points.

If this is right

  • Concept bottleneck models become practical with very limited human labeling effort.
  • Downstream predictions gain reliability from improved concept calibration.
  • Active learning can be used to select the most informative points for the next round of annotation.
  • The hybrid method narrows the performance gap between fully automated VLM methods and fully supervised interpretable models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar label-propagation techniques could be tested in other weak-supervision settings where embedding spaces already exist.
  • Pairing the method with active learning loops might further cut annotation budgets in deployed systems.
  • Evaluating the same pipeline on medical or robotics image collections would test how far the embedding-space propagation generalizes.

Load-bearing premise

The Gaussian process operating in the VLM embedding space is assumed to capture useful global information about the target domain that allows reliable propagation of the expert's sparse supervision to unlabeled points.

What would settle it

A direct comparison experiment on held-out data where VH-CBM shows no improvement in concept prediction accuracy or calibration over VLM-guided CBMs when only 1% of annotations are provided.

Figures

Figures reproduced from arXiv: 2605.16405 by Andrea Passerini, Andrea Pugnana, Emanuele Marconato, Nicola Debole, Stefano Teso.

Figure 1
Figure 1. Figure 1: Top: VLM-CBMs rely solely on concepts provided by VLMs – either at inference time (as above) or for distilling a concept extractor g, cf. Section 2 – which can be inaccurate, compromising interpretability [Debole et al., 2026]. Bottom: VH-CBM exploits the VLM’s embedding space to propagate expert concept supervision to obtain concept statistics (i.e., µ and σ from the GP) for any test point, improving conc… view at source ↗
Figure 2
Figure 2. Figure 2: VH-CBM improves concept accuracy. F1(C) of VH-CBM with varying percentages of concept supervision for the CLIP (top) and DINO (bottom) backbones. We report average F1(C) over three runs, with confidence intervals estimated via bootstrap resampling across runs. on DINO) it performs on par with LP-@. We speculate this occurs because in these cases the concepts are linearly retrievable, meaning a simple linea… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between Active vs Random acquisition rule for the selection of new concepts to annotate. We tested two different backbones: CLIP (top) and DINO (bottom). While Random takes a pre-determined number of random concepts to annotate, Active uses the GP uncertainty to select which concepts to annotate. B Random concept selection VH-CBM We report results for the random annotation baseline, where at eac… view at source ↗
Figure 4
Figure 4. Figure 4: VH-CBM improves concept accuracy also with Random annotation strategy. F1(C) of VH-CBM with varying percentages of concept supervision for the CLIP (top) and DINO (bottom) backbones. Results are averaged over three runs, with confidence intervals estimated via bootstrap resampling across runs. Random acquisition with VH-CBM surpasses linear probing at early regimes in 6 to 8 cases, and wins at the last ste… view at source ↗
read the original abstract

Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts -- and the predictions they entail -- by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vision-plus-Human-guided Concept Bottleneck Models (VH-CBM). It augments VLM-guided CBMs with a small fraction (as low as 1%) of dense human concept annotations that are propagated to the full dataset via a Gaussian process operating in the VLM embedding space. The central empirical claim is that VH-CBM yields higher concept accuracy, improved calibration, and active-learning support relative to pure VLM-guided CBM baselines.

Significance. If the reported gains prove robust, the hybrid approach supplies a practical route to higher-quality concept bottlenecks without requiring full supervision. The use of GP label propagation to leverage global structure in VLM embeddings is a concrete technical contribution that could be adopted in other low-annotation regimes.

major comments (2)
  1. [§4] §4 (Empirical Evaluation): the headline result that VH-CBM outperforms VLM-guided CBMs at 1% annotation is presented without error bars, multiple random seeds, explicit train/validation/test splits, or statistical significance tests. These omissions make it impossible to judge whether the accuracy and calibration improvements are reliable or sensitive to post-hoc data choices.
  2. [§3.2] §3.2 (Gaussian Process propagation): the claim that the GP in VLM embedding space 'captures useful global information' and reliably propagates sparse labels is load-bearing for the hybrid advantage. No diagnostic is shown that distances in the embedding space correlate with concept similarity (e.g., nearest-neighbor concept agreement, manifold smoothness plots, or kernel ablation). If the embedding manifold is not smooth w.r.t. the target concepts, the GP posterior reduces to local interpolation and the reported gains over a VLM baseline plus the same 1% labels may vanish.
minor comments (2)
  1. [§3.2] Specify the exact VLM backbone, embedding dimensionality, and kernel hyperparameters (length-scale, variance) used for the GP; these choices directly affect propagation quality.
  2. [§4.3] Clarify how the active-learning acquisition function is defined and whether it operates on GP predictive variance or on downstream task loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will incorporate revisions to improve the rigor and clarity of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Empirical Evaluation): the headline result that VH-CBM outperforms VLM-guided CBMs at 1% annotation is presented without error bars, multiple random seeds, explicit train/validation/test splits, or statistical significance tests. These omissions make it impossible to judge whether the accuracy and calibration improvements are reliable or sensitive to post-hoc data choices.

    Authors: We agree that the current empirical presentation would benefit from greater statistical rigor to demonstrate robustness. In the revised manuscript, we will update §4 to report results averaged over multiple random seeds (at least five), include error bars showing standard deviation, explicitly detail the train/validation/test splits used for each dataset, and add statistical significance tests (e.g., paired t-tests with p-values) comparing VH-CBM to the VLM-guided baselines. These changes will allow readers to better assess the reliability of the accuracy and calibration gains. revision: yes

  2. Referee: [§3.2] §3.2 (Gaussian Process propagation): the claim that the GP in VLM embedding space 'captures useful global information' and reliably propagates sparse labels is load-bearing for the hybrid advantage. No diagnostic is shown that distances in the embedding space correlate with concept similarity (e.g., nearest-neighbor concept agreement, manifold smoothness plots, or kernel ablation). If the embedding manifold is not smooth w.r.t. the target concepts, the GP posterior reduces to local interpolation and the reported gains over a VLM baseline plus the same 1% labels may vanish.

    Authors: We acknowledge that direct evidence linking embedding distances to concept similarity would strengthen the justification for using the GP in this space. While the consistent outperformance of VH-CBM over VLM-guided CBMs with identical 1% annotations provides indirect support that the GP exploits global structure beyond local interpolation, we agree an explicit diagnostic is warranted. In the revision, we will add to §3.2 a nearest-neighbor analysis showing concept agreement rates as a function of embedding distance, along with a brief kernel ablation comparing the RBF kernel to a local baseline to illustrate the value of the global component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VH-CBM derivation or empirical claims

full rationale

The paper's central construction applies standard Gaussian process regression over fixed external VLM embeddings to propagate a small set of human annotations; the resulting concept predictions are generated by an independent kernel-based interpolator whose posterior is not algebraically identical to the input labels or VLM outputs. Empirical accuracy and calibration results are obtained by comparing these GP-augmented predictions against held-out ground truth and against pure VLM baselines, with no equations showing that reported improvements reduce by construction to quantities fitted on the evaluation set itself. No self-definitional steps, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the unstated assumption that VLM embeddings form a suitable metric space for Gaussian-process interpolation and that the target domain exhibits sufficient smoothness for propagation to succeed. No free parameters or invented entities are explicitly named in the provided text.

axioms (1)
  • domain assumption VLM embedding space captures useful global information about the target domain that permits reliable label propagation
    Invoked to justify the Gaussian process step; if false, sparse annotations cannot be effectively spread.

pith-pipeline@v0.9.0 · 5729 in / 1295 out tokens · 53705 ms · 2026-05-20T21:41:46.120073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    Unpacking large language models with conceptual consistency

    Pritish Sahu, Michael Cogswell, Yunye Gong, and Ajay Divakaran. Unpacking large language models with conceptual consistency.arXiv:2209.15093,

  2. [2]

    Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning, volume

    doi: 10.1007/s10994-026-06999-y. Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning, volume

  3. [3]

    Deferring concept bottleneck models: Learning to defer interventions to inaccurate experts.CoRR, abs/2503.16199,

    Andrea Pugnana, Riccardo Massidda, Francesco Giannini, Pietro Barbiero, Mateo Espinosa Zarlenga, Roberto Pellun- grini, Gabriele Dominici, Fosca Giannotti, and Davide Bacciu. Deferring concept bottleneck models: Learning to defer interventions to inaccurate experts.CoRR, abs/2503.16199,

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  5. [5]

    Post-hoc concept bottleneck models

    10 APREPRINT- MAY19, 2026 Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. InICLR,

  6. [6]

    Towards a Definition of Disentangled Representations

    Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.arXiv:1812.02230,

  7. [7]

    Approximations for binary gaussian process classification.Journal of Machine Learning Research, 9(10):2035–2078,

    Hannes Nickisch, Carl Edward Rasmussen, et al. Approximations for binary gaussian process classification.Journal of Machine Learning Research, 9(10):2035–2078,

  8. [8]

    Gaussian Processes for Big Data

    James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data.arXiv preprint arXiv:1309.6835,

  9. [9]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

  10. [10]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset

  11. [11]

    Understanding intermediate layers using linear classifier probes

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023b. Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1...

  12. [12]

    Concept-based explainable artificial intelligence: A survey

    Eleonora Poeta, Gabriele Ciravegna, Eliana Pastor, Tania Cerquitelli, and Elena Baralis. Concept-based explainable artificial intelligence: A survey.arXiv:2312.12936,

  13. [13]

    A comprehensive survey on self-interpretable neural networks

    11 APREPRINT- MAY19, 2026 Yang Ji, Ying Sun, Yuting Zhang, Zhigaoyuan Wang, Yuanxin Zhuang, Zheng Gong, Dazhong Shen, Chuan Qin, Hengshu Zhu, and Hui Xiong. A comprehensive survey on self-interpretable neural networks.arXiv:2501.15638,

  14. [14]

    Nonparametric identification of latent concepts.arXiv preprint arXiv:2510.00136,

    Yujia Zheng, Shaoan Xie, and Kun Zhang. Nonparametric identification of latent concepts.arXiv preprint arXiv:2510.00136,

  15. [15]

    Dcbm: Data-efficient visual concept bottleneck models.arXiv:2412.11576,

    Patrick Knab, Katharina Prasse, Sascha Marton, Christian Bartelt, and Margret Keuper. Dcbm: Data-efficient visual concept bottleneck models.arXiv:2412.11576,

  16. [16]

    Bayesian concept bottleneck models with llm priors

    Jean Feng, Avni Kothari, Luke Zier, Chandan Singh, and Yan Shuo Tan. Bayesian concept bottleneck models with llm priors.arXiv:2410.15555,

  17. [17]

    Do concept bottleneck models learn as intended? arXiv:2105.04289, 2021

    Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended?arXiv:2105.04289,

  18. [18]

    Is disentanglement all you need? comparing concept-based & disentanglement approaches

    Dmitry Kazhdan, Botty Dimanov, Helena Andres Terre, Mateja Jamnik, Pietro Liò, and Adrian Weller. Is disentan- glement all you need? comparing concept-based & disentanglement approaches.arXiv preprint arXiv:2104.06917,

  19. [19]

    Enhancing concept localization in clip-based concept bottleneck models.arXiv preprint arXiv:2510.07115,

    Rémi Kazmierczak, Steve Azzolin, Eloïse Berthier, Goran Frehse, and Gianni Franchi. Enhancing concept localization in clip-based concept bottleneck models.arXiv preprint arXiv:2510.07115,

  20. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  21. [21]

    We choose 40 as the initial value to keep annotation costs low while providing enough samples for the GP to begin propagating information

    A Pipeline The choice of the number of annotated samples to start with, as well as the number of concepts to annotate at each step, is arbitrary. We choose 40 as the initial value to keep annotation costs low while providing enough samples for the GP to begin propagating information. We tested other values with consistent results. The same applies to the ...

  22. [22]

    A.1 Uncertainty-based Acquisition Function More specifically, our active acquisition function begins by randomly sampling a subset of the training data. This promotes exploration and avoids clustering annotations around uncertain samples that are nonetheless close to one another in the embedding space, which would be inefficient in terms of annotation cos...

  23. [23]

    We observe thatVH-CBMgives the best calibration results across datasets, except in Shapes3d where it ranks second-best and LP-@ demonstrates particularly competitive

    It should be noted that calibration is measured with respect to the concept predictions — that is, the output of the backbone — within each CBM variant, rather than with respect to the final label predictions. We observe thatVH-CBMgives the best calibration results across datasets, except in Shapes3d where it ranks second-best and LP-@ demonstrates partic...