pith. sign in

arxiv: 2509.20899 · v3 · submitted 2025-09-25 · 💻 cs.CV

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Pith reviewed 2026-05-18 14:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords concept bottleneck modelsvideo classificationinterpretable AItemporal modelingvision-language modelstransformerself-attention
0
0 comments X

The pith

MoTIF adds per-concept temporal self-attention to concept bottlenecks for video classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Concept bottleneck models make predictions interpretable by routing them through human-understandable intermediate concepts. Videos introduce the extra requirement of tracking how those concepts evolve and recur across frames. The paper introduces MoTIF, which feeds sequences of concept activations into a transformer that applies self-attention separately to each concept's timeline. Concepts themselves are discovered automatically by a class-conditioned vision-language model that pulls object and action descriptions from the training videos. The resulting model outperforms static global concept bottlenecks on video benchmarks and stays competitive with other interpretable approaches while reducing the accuracy gap to black-box video classifiers.

Core claim

The paper claims that structuring video predictions around sequences of temporally grounded concept activations, discovered without manual labels by a class-conditioned VLM and modeled via per-concept temporal self-attention, produces an interpretable classifier that improves over global concept bottlenecks while remaining competitive in the interpretable setting and narrowing the gap to strong black-box baselines.

What carries the argument

Per-concept temporal self-attention operating on sequences of concept activations inside the Moving Temporal Interpretable Framework (MoTIF), paired with class-conditioned VLM concept discovery.

If this is right

  • The model improves accuracy over global, non-temporal concept bottlenecks on multiple video benchmarks.
  • Performance stays competitive with other methods inside the interpretable concept-bottleneck category.
  • The gap to strong black-box video classification baselines narrows while retaining interpretability.
  • Concept timelines become available for inspection without requiring any manual concept labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention weights over each concept's timeline could be inspected to identify the exact frames that most influence a prediction.
  • The same per-concept temporal structure might transfer to other time-series tasks such as audio event classification if suitable concept extractors exist.
  • Automatic concept discovery could lower the barrier for applying interpretable bottlenecks to new video domains where experts have not predefined concepts.

Load-bearing premise

The class-conditioned VLM extracts object- and action-centric textual concepts from training videos that are complete and accurate enough for the downstream temporal modeling to work without manual annotation.

What would settle it

Replace the VLM-discovered concepts with a fixed set of generic or random textual descriptors unrelated to the video classes and measure whether accuracy falls to the level of non-temporal concept bottlenecks; if accuracy remains high, the claim that the specific discovered concepts plus temporal modeling are responsible would be falsified.

Figures

Figures reproduced from arXiv: 2509.20899 by Christian Bartelt, Drago Guggiana, Patrick Knab, Philipp J. Schubert, Sascha Marton.

Figure 1
Figure 1. Figure 1: MoTIF. The framework takes videos as input and produces local concept explanations for local windows, global explanations for entire videos, and temporal dependency maps from the attention heads of the transformer module. Model represents MoTIF (ViT-L14) and sample frames are from HMDB51 (Kuehne et al., 2011), licensed under CC BY 4.0. 1 INTRODUCTION Modern deep learning models already achieve outstanding … view at source ↗
Figure 2
Figure 2. Figure 2: MoTIF pipeline. Videos are embedded with a vision–language backbone and mapped to concept activations via cosine similarity. Per-channel temporal self-attention models dynamics independently for each concept, followed by a nonnegative affine transformation and classification. MoTIF enables explanations across three views: global concepts, local concepts, and temporal de￾pendencies. Sample frames from SSv2 … view at source ↗
Figure 3
Figure 3. Figure 3: Effect of log-sum-exp temperature τ on accuracy and entropy. Accuracy is stable across τ , while both concept- and logit-level entropy decrease as τ increases, yielding sharper time￾importance distributions. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Full vs. diagonal attention. Train and test accuracy with and without enforcing diagonal attention over five seeds. Temperature τ . We vary the log-sum-exp pooling temperature τ to assess its effect on both accu￾racy and the sharpness of the time-importance distribution. Sharpness is quantified by the entropy of the softmax weights, computed either at the concept or at the logits level. Experiments show (s… view at source ↗
Figure 5
Figure 5. Figure 5: Concept set influence. Distribution of test accuracy with five different concept sets. Concept set influence. To investigate the influence of the con￾cept proposal set on the performance of MoTIF, we prompt GPT-5 five times, each time requesting a distinct set of can￾didate concepts (with some natural overlap across runs). All resulting concept sets for the five datasets are listed in Ap￾pendix F. Our expe… view at source ↗
Figure 6
Figure 6. Figure 6: MoTIF explanations. Example videos from Breakfast and SSv2 with correct classifica￾tions, illustrating the three explanation modes supported by MoTIF (ViT-L14). most salient concepts are paddle and kayak, which remain stable over the short clip. The attention map for paddle shows a more uniform distribution across time, with slight emphasis at u = 3 and u = 6, indicating consistent expression of the concep… view at source ↗
Figure 7
Figure 7. Figure 7: MoTIF explanations. Example videos from all datasets with incorrect classifications. The first example, from Pirates of the Caribbean, shows Barbossa eating an apple before throwing it away. Our model predicts eat, while the ground truth is throw. Global and local concept attributions reveal that the concept eat, triggered by the apple, was most strongly activated. The corresponding attention map illustrat… view at source ↗
Figure 8
Figure 8. Figure 8: Architectural choices. Effects of non-negativity and weight decay. For completeness, we report further ablation studies that complement the main paper (Section 3.2.2). Nonnegativity. The non-negativity constraint on classifier weights improves interpretability by en￾suring class predictions are explained solely by positive concept evidence (Zhang et al., 2021). We enforce non-negativity of classifier weigh… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of the per-concept affine transformation. Accuracy improves marginally, while entropy in both logits and concept activations decreases, suggesting that the affine block stabilizes and sharpens the CBM’s internal representations. Classifier sparsity. The sparsity penalty on classifier weights has a pronounced impact on test accuracy, as shown in Figure 10a. Larger values of λℓ1 consistently reduce ac… view at source ↗
Figure 10
Figure 10. Figure 10: Architectural choices. Effects of classifier sparsity, activation sparsity, and learning rate. Activation sparsity. The activation sparsity penalty shows little variation in test accuracy (see Fig￾ure 10b) across different values of λsparse, except for Breakfast. We attribute this to the compara￾tively long video sequences in that dataset. Accordingly, we use λsparse = 10−3 for all experiments, except for… view at source ↗
Figure 11
Figure 11. Figure 11: Window size influence. Test accuracy across different window sizes. Window size. We evaluate MoTIF with varying tempo￾ral input lengths to study robustness to sequence duration and efficiency trade-offs. While increasing the number of frames provides more temporal context, it also raises memory requirements and may not yield consistent accu￾racy gains. For comparability, we fix the batch size to 8 across … view at source ↗
Figure 12
Figure 12. Figure 12: illustrates, the influence of the seed depends on the dataset. For HMDB51 the effect is negligible, whereas for Breakfast — particularly with the RN/50 backbone — we observe notice￾ably higher variance. Nevertheless, MoTIF consistently outperforms the linear probe. Moreover, the figure indicates that the reported numbers in [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MoTIF, a Moving Temporal Interpretable Framework for video classification. It extends concept bottleneck models to videos by using a class-conditioned VLM to automatically discover object- and action-centric textual concepts from training videos, then feeding sequences of concept activations into a transformer with per-concept temporal self-attention to model temporal dynamics. The paper reports that this approach improves upon global concept bottlenecks on several video benchmarks, stays competitive among interpretable models, and reduces the performance gap to black-box video classifiers.

Significance. If the central modeling choices prove robust, the work could meaningfully advance interpretable video classification by incorporating temporal dynamics into concept-based predictions while avoiding manual concept annotation. The public code release is a clear strength for reproducibility.

major comments (2)
  1. [§3.2] §3.2 (Concept Discovery Module): the claim that the class-conditioned VLM yields object- and action-centric concepts that are 'sufficiently complete and accurate' for downstream temporal modeling is load-bearing, yet the manuscript provides no quantitative fidelity metrics such as recall against human-annotated action/object sets or performance sensitivity when concepts are replaced or perturbed.
  2. [§4] §4 (Experiments and Tables): reported gains over global (non-temporal) concept bottlenecks are presented without error bars, multi-seed statistics, or an ablation that isolates per-concept temporal self-attention from the VLM discovery step; this leaves open whether the temporal component drives the improvement or whether gains are attributable to VLM supervision alone.
minor comments (2)
  1. [§3.1] Notation for concept activation sequences and the per-concept attention mask could be clarified with an explicit equation or pseudocode block.
  2. [Figure 1] Figure 1 (architecture diagram): the flow from VLM concept extraction to temporal transformer layers would benefit from explicit time-axis annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and commit to incorporating revisions that enhance the rigor of our experimental analysis and concept evaluation.

read point-by-point responses
  1. Referee: §3.2 (Concept Discovery Module): the claim that the class-conditioned VLM yields object- and action-centric concepts that are 'sufficiently complete and accurate' for downstream temporal modeling is load-bearing, yet the manuscript provides no quantitative fidelity metrics such as recall against human-annotated action/object sets or performance sensitivity when concepts are replaced or perturbed.

    Authors: We agree that providing quantitative metrics for the concept discovery module would better support our claims regarding the sufficiency of the discovered concepts. Although the primary evaluation in the manuscript focuses on end-to-end classification performance, we recognize the value of direct assessment. In the revised version, we will add quantitative fidelity metrics, including recall of discovered concepts against human-annotated object and action sets on applicable benchmarks. We will also include a sensitivity analysis by systematically replacing or perturbing subsets of concepts and reporting the resulting changes in model performance. This will help validate the completeness and accuracy of the VLM-based discovery process. revision: yes

  2. Referee: §4 (Experiments and Tables): reported gains over global (non-temporal) concept bottlenecks are presented without error bars, multi-seed statistics, or an ablation that isolates per-concept temporal self-attention from the VLM discovery step; this leaves open whether the temporal component drives the improvement or whether gains are attributable to VLM supervision alone.

    Authors: We appreciate this observation, as it highlights the need for more robust statistical reporting and targeted ablations. The current manuscript reports results from single experimental runs without variance estimates. To address this, we will re-run all experiments across multiple random seeds (e.g., 5 seeds) and include error bars along with mean and standard deviation in the updated tables. Additionally, we will introduce a new ablation study that fixes the VLM-discovered concepts and compares the full temporal self-attention model against a non-temporal baseline (such as mean-pooling of concept activations over time). This will isolate the contribution of the per-concept temporal self-attention mechanism from the benefits of the concept discovery alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical architecture (MoTIF) whose performance claims are evaluated via standard benchmark comparisons on video datasets against global CBM baselines and black-box models. The class-conditioned VLM concept discovery is described as an independent upstream module that produces textual concepts fed into per-concept temporal self-attention; no equations define the downstream predictions in terms of the discovery step itself, nor are any fitted parameters from the model renamed as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central results. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the untested premise that VLM-extracted concepts are temporally expressive and sufficient; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Vision-language models conditioned on class labels can extract object- and action-centric textual concepts from video frames that capture the information needed for classification.
    Invoked in the description of the concept discovery module.

pith-pipeline@v0.9.0 · 5710 in / 1198 out tokens · 44836 ms · 2026-05-18T14:34:01.400806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

  3. [3]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network, 2025. URL https...

  4. [4]

    Interactive concept bottleneck models

    Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, and Krishnamurthy Dvijotham. Interactive concept bottleneck models. In Proceedings of the aaai conference on artificial intelligence, volume 37, pp.\ 5948--5955, 2023

  5. [5]

    Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

    Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  7. [7]

    Towards automatic concept-based explanations

    Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations. Advances in neural information processing systems, 32, 2019

  8. [8]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

  9. [9]

    Hierarchical explanations for video action recognition

    Sadaf Gulshad, Teng Long, and Nanne van Noord. Hierarchical explanations for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3703--3708, 2023

  10. [10]

    Self-attention attribution: Interpreting information interactions inside transformer

    Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 12963--12971, 2021

  11. [11]

    Addressing leakage in concept bottleneck models

    Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems, 35: 0 23386--23397, 2022

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  13. [13]

    Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Don Stanton, Hector Corrada Bravo, Kyunghyun Cho, and Nathan C. Frey. Concept bottleneck language models for protein design. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Yt9CFhOOFe

  14. [14]

    Kaplan, and Mani Srivastava

    Jeya Vikranth Jeyakumar, Luke Dickens, Yu-Hsi Cheng, Joseph Noor, Luis Antonio Garcia, Diego Ramirez Echavarria, Alessandra Russo, Lance M. Kaplan, and Mani Srivastava. Automatic concept extraction for concept bottleneck-based video classification, 2022. URL https://openreview.net/forum?id=66kgCIYQW3

  15. [15]

    Spatial-temporal concept based explanation of 3d convnets

    Ying Ji, Yu Wang, and Jien Kato. Spatial-temporal concept based explanation of 3d convnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15444--15453, 2023

  16. [16]

    Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models

    Patrick Knab, Sascha Marton, and Christian Bartelt. Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025 a . URL https://openreview.net/forum?id=JHs5p6nPbG

  17. [17]

    Which lime should i trust? concepts, challenges, and solutions, 2025 b

    Patrick Knab, Sascha Marton, Udo Schlegel, and Christian Bartelt. Which lime should i trust? concepts, challenges, and solutions, 2025 b . URL https://arxiv.org/abs/2503.24365

  18. [18]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 5338--5348. PMLR, 13--18 Jul 2020. URL https://proceedin...

  19. [19]

    Understanding video transformers via universal concept discovery

    Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G Derpanis, and Pavel Tokmakov. Understanding video transformers via universal concept discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10946--10956, 2024

  20. [20]

    Kuehne, H

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB : a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011

  21. [21]

    Kuehne, A

    H. Kuehne, A. B. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014

  22. [22]

    Pcbear: Pose concept bottleneck for explainable action recognition

    Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, and Jinwoo Choi. Pcbear: Pose concept bottleneck for explainable action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 2690--2699, June 2025

  23. [23]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7083--7093, 2019

  24. [24]

    Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C

    Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C. van Gemert. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14892--14901, June 2021

  25. [25]

    Something-else: Compositional action recognition with spatial-temporal interaction networks

    Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. pp.\ 1049--1059, 2020

  26. [26]

    Visual classification via description from large language models

    Sachit Menon and Carl Vondrick. Visual classification via description from large language models. In International Conference on Learning Representations, 2023

  27. [27]

    O zlem \

    Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. Interpretable machine learning -- a brief history, state-of-the-art and challenges. In Irena Koprinska, Michael Kamp, Annalisa Appice, Corrado Loglisci, Luiza Antonie, Albrecht Zimmermann, Riccardo Guidotti, \"O zlem \"O zg \"o bek, Rita P. Ribeiro, Ricard Gavald \`a , Jo \ a o Gama, Linara Adilova...

  28. [28]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  29. [29]

    A survey on video-based human action recognition: recent updates, datasets, challenges, and applications

    Preksha Pareek and Ankit Thakkar. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54 0 (3): 0 2259--2322, 2021

  30. [30]

    DCBM : Data-efficient visual concept bottleneck models

    Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM : Data-efficient visual concept bottleneck models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BdO4R6XxUH

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

  32. [32]

    Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

    Sukrut Rao, Sweta Mahajan, Moritz B\" o hle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, pp.\ 444–461, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72979-...

  33. [33]

    Exploring explainability in video action recognition

    Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, and Joydeep Ghosh. Exploring explainability in video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 8176--8181, June 2024

  34. [34]

    Concept bottleneck model with additional unsupervised concepts

    Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10: 0 41758--41765, 2022

  35. [35]

    Schrodi, J

    S. Schrodi, J. Schur, M. Argus, and T. Brox. Selective concept bottleneck models without predefined concepts. Transactions on Machine Learning Research (TMLR), May 2025. URL http://lmb.informatik.uni-freiburg.de/Publications/2025/SAB25

  36. [36]

    A dataset of 101 human action classes from videos in the wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2 0 (11): 0 1--7, 2012

  37. [37]

    Concept bottleneck large language models

    Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. Concept bottleneck large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=RC5FPYVQaH

  38. [38]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 0 10078--10093, 2022

  39. [39]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp.\ 4489--4497, 2015

  40. [40]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  41. [41]

    Cmf-transformer: cross-modal fusion transformer for human action recognition

    Jun Wang, Limin Xia, and Xin Wen. Cmf-transformer: cross-modal fusion transformer for human action recognition. Mach. Vision Appl., 35 0 (5), August 2024 a . ISSN 0932-8092. doi:10.1007/s00138-024-01598-0. URL https://doi.org/10.1007/s00138-024-01598-0

  42. [42]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

  43. [43]

    Revisiting multiple instance neural networks

    Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern recognition, 74: 0 15--24, 2018

  44. [44]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision, pp.\ 396--416. Springer, 2024 b

  45. [45]

    Learning optimal summaries of clinical time-series with concept bottleneck models

    Carissa Wu, Sonali Parbhoo, Marton Havasi, and Finale Doshi-Velez. Learning optimal summaries of clinical time-series with concept bottleneck models. In Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (eds.), Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182 of Proceedings of Machine Learning Re...

  46. [46]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

    Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

  47. [47]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 11975--11986, October 2023

  48. [48]

    Invertible concept-based explanations for cnn models with non-negative concept activation vectors

    Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A Ehinger, and Benjamin IP Rubinstein. Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 11682--11690, 2021

  49. [49]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  50. [50]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  51. [51]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...