Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Christian Bartelt; Drago Guggiana; Patrick Knab; Philipp J. Schubert; Sascha Marton

arxiv: 2509.20899 · v3 · submitted 2025-09-25 · 💻 cs.CV

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Patrick Knab , Sascha Marton , Philipp J. Schubert , Drago Guggiana , Christian Bartelt This is my paper

Pith reviewed 2026-05-18 14:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords concept bottleneck modelsvideo classificationinterpretable AItemporal modelingvision-language modelstransformerself-attention

0 comments

The pith

MoTIF adds per-concept temporal self-attention to concept bottlenecks for video classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Concept bottleneck models make predictions interpretable by routing them through human-understandable intermediate concepts. Videos introduce the extra requirement of tracking how those concepts evolve and recur across frames. The paper introduces MoTIF, which feeds sequences of concept activations into a transformer that applies self-attention separately to each concept's timeline. Concepts themselves are discovered automatically by a class-conditioned vision-language model that pulls object and action descriptions from the training videos. The resulting model outperforms static global concept bottlenecks on video benchmarks and stays competitive with other interpretable approaches while reducing the accuracy gap to black-box video classifiers.

Core claim

The paper claims that structuring video predictions around sequences of temporally grounded concept activations, discovered without manual labels by a class-conditioned VLM and modeled via per-concept temporal self-attention, produces an interpretable classifier that improves over global concept bottlenecks while remaining competitive in the interpretable setting and narrowing the gap to strong black-box baselines.

What carries the argument

Per-concept temporal self-attention operating on sequences of concept activations inside the Moving Temporal Interpretable Framework (MoTIF), paired with class-conditioned VLM concept discovery.

If this is right

The model improves accuracy over global, non-temporal concept bottlenecks on multiple video benchmarks.
Performance stays competitive with other methods inside the interpretable concept-bottleneck category.
The gap to strong black-box video classification baselines narrows while retaining interpretability.
Concept timelines become available for inspection without requiring any manual concept labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention weights over each concept's timeline could be inspected to identify the exact frames that most influence a prediction.
The same per-concept temporal structure might transfer to other time-series tasks such as audio event classification if suitable concept extractors exist.
Automatic concept discovery could lower the barrier for applying interpretable bottlenecks to new video domains where experts have not predefined concepts.

Load-bearing premise

The class-conditioned VLM extracts object- and action-centric textual concepts from training videos that are complete and accurate enough for the downstream temporal modeling to work without manual annotation.

What would settle it

Replace the VLM-discovered concepts with a fixed set of generic or random textual descriptors unrelated to the video classes and measure whether accuracy falls to the level of non-temporal concept bottlenecks; if accuracy remains high, the claim that the specific discovered concepts plus temporal modeling are responsible would be falsified.

Figures

Figures reproduced from arXiv: 2509.20899 by Christian Bartelt, Drago Guggiana, Patrick Knab, Philipp J. Schubert, Sascha Marton.

**Figure 1.** Figure 1: MoTIF. The framework takes videos as input and produces local concept explanations for local windows, global explanations for entire videos, and temporal dependency maps from the attention heads of the transformer module. Model represents MoTIF (ViT-L14) and sample frames are from HMDB51 (Kuehne et al., 2011), licensed under CC BY 4.0. 1 INTRODUCTION Modern deep learning models already achieve outstanding … view at source ↗

**Figure 2.** Figure 2: MoTIF pipeline. Videos are embedded with a vision–language backbone and mapped to concept activations via cosine similarity. Per-channel temporal self-attention models dynamics independently for each concept, followed by a nonnegative affine transformation and classification. MoTIF enables explanations across three views: global concepts, local concepts, and temporal dependencies. Sample frames from SSv2 … view at source ↗

**Figure 3.** Figure 3: Effect of log-sum-exp temperature τ on accuracy and entropy. Accuracy is stable across τ , while both concept- and logit-level entropy decrease as τ increases, yielding sharper timeimportance distributions. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Full vs. diagonal attention. Train and test accuracy with and without enforcing diagonal attention over five seeds. Temperature τ . We vary the log-sum-exp pooling temperature τ to assess its effect on both accuracy and the sharpness of the time-importance distribution. Sharpness is quantified by the entropy of the softmax weights, computed either at the concept or at the logits level. Experiments show (s… view at source ↗

**Figure 5.** Figure 5: Concept set influence. Distribution of test accuracy with five different concept sets. Concept set influence. To investigate the influence of the concept proposal set on the performance of MoTIF, we prompt GPT-5 five times, each time requesting a distinct set of candidate concepts (with some natural overlap across runs). All resulting concept sets for the five datasets are listed in Appendix F. Our expe… view at source ↗

**Figure 6.** Figure 6: MoTIF explanations. Example videos from Breakfast and SSv2 with correct classifications, illustrating the three explanation modes supported by MoTIF (ViT-L14). most salient concepts are paddle and kayak, which remain stable over the short clip. The attention map for paddle shows a more uniform distribution across time, with slight emphasis at u = 3 and u = 6, indicating consistent expression of the concep… view at source ↗

**Figure 7.** Figure 7: MoTIF explanations. Example videos from all datasets with incorrect classifications. The first example, from Pirates of the Caribbean, shows Barbossa eating an apple before throwing it away. Our model predicts eat, while the ground truth is throw. Global and local concept attributions reveal that the concept eat, triggered by the apple, was most strongly activated. The corresponding attention map illustrat… view at source ↗

**Figure 8.** Figure 8: Architectural choices. Effects of non-negativity and weight decay. For completeness, we report further ablation studies that complement the main paper (Section 3.2.2). Nonnegativity. The non-negativity constraint on classifier weights improves interpretability by ensuring class predictions are explained solely by positive concept evidence (Zhang et al., 2021). We enforce non-negativity of classifier weigh… view at source ↗

**Figure 9.** Figure 9: Effect of the per-concept affine transformation. Accuracy improves marginally, while entropy in both logits and concept activations decreases, suggesting that the affine block stabilizes and sharpens the CBM’s internal representations. Classifier sparsity. The sparsity penalty on classifier weights has a pronounced impact on test accuracy, as shown in Figure 10a. Larger values of λℓ1 consistently reduce ac… view at source ↗

**Figure 10.** Figure 10: Architectural choices. Effects of classifier sparsity, activation sparsity, and learning rate. Activation sparsity. The activation sparsity penalty shows little variation in test accuracy (see Figure 10b) across different values of λsparse, except for Breakfast. We attribute this to the comparatively long video sequences in that dataset. Accordingly, we use λsparse = 10−3 for all experiments, except for… view at source ↗

**Figure 11.** Figure 11: Window size influence. Test accuracy across different window sizes. Window size. We evaluate MoTIF with varying temporal input lengths to study robustness to sequence duration and efficiency trade-offs. While increasing the number of frames provides more temporal context, it also raises memory requirements and may not yield consistent accuracy gains. For comparability, we fix the batch size to 8 across … view at source ↗

**Figure 12.** Figure 12: illustrates, the influence of the seed depends on the dataset. For HMDB51 the effect is negligible, whereas for Breakfast — particularly with the RN/50 backbone — we observe noticeably higher variance. Nevertheless, MoTIF consistently outperforms the linear probe. Moreover, the figure indicates that the reported numbers in [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds per-concept temporal self-attention to video concept bottlenecks on top of class-conditioned VLM concept extraction, with reported gains over static CBMs but thin direct evidence that the temporal piece drives the results.

read the letter

The main takeaway is that this work gives a practical way to add time modeling to concept bottleneck models for video. They pull object and action concepts from training videos with a class-conditioned VLM, turn those into activation sequences, and then run per-concept temporal self-attention inside a transformer to let each concept track its own recurrence patterns before the final prediction. That combination is not in the earlier static CBM literature they cite, and the results show clear lifts over global (non-temporal) concept bottlenecks on the video benchmarks they test, while staying inside the interpretable CBM regime and narrowing the distance to black-box video models they include as reference points. Releasing the code is a straightforward plus for anyone who wants to inspect or extend it. The VLM step removes the need for manual concept lists, which helps with scaling to new datasets. The soft spot is the missing quantitative grounding on the concepts themselves. The paper does not report recall against human-annotated action or object sets, nor does it show sensitivity tests where the discovered concepts are swapped or perturbed to isolate whether the temporal attention is actually carrying the performance lift. Without those checks, the benchmark improvements could still be driven more by the quality of the VLM supervision than by the new temporal component. The comparisons to baselines are present, but targeted ablations on the discovery module would make the central claim tighter. This is useful reading for people already working on interpretable video models or trying to move CBMs beyond images. It gives a concrete architecture and open code that a reader can run and modify. I would send it to peer review; the core design is clear enough and the empirical claims are testable with what is provided.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MoTIF, a Moving Temporal Interpretable Framework for video classification. It extends concept bottleneck models to videos by using a class-conditioned VLM to automatically discover object- and action-centric textual concepts from training videos, then feeding sequences of concept activations into a transformer with per-concept temporal self-attention to model temporal dynamics. The paper reports that this approach improves upon global concept bottlenecks on several video benchmarks, stays competitive among interpretable models, and reduces the performance gap to black-box video classifiers.

Significance. If the central modeling choices prove robust, the work could meaningfully advance interpretable video classification by incorporating temporal dynamics into concept-based predictions while avoiding manual concept annotation. The public code release is a clear strength for reproducibility.

major comments (2)

[§3.2] §3.2 (Concept Discovery Module): the claim that the class-conditioned VLM yields object- and action-centric concepts that are 'sufficiently complete and accurate' for downstream temporal modeling is load-bearing, yet the manuscript provides no quantitative fidelity metrics such as recall against human-annotated action/object sets or performance sensitivity when concepts are replaced or perturbed.
[§4] §4 (Experiments and Tables): reported gains over global (non-temporal) concept bottlenecks are presented without error bars, multi-seed statistics, or an ablation that isolates per-concept temporal self-attention from the VLM discovery step; this leaves open whether the temporal component drives the improvement or whether gains are attributable to VLM supervision alone.

minor comments (2)

[§3.1] Notation for concept activation sequences and the per-concept attention mask could be clarified with an explicit equation or pseudocode block.
[Figure 1] Figure 1 (architecture diagram): the flow from VLM concept extraction to temporal transformer layers would benefit from explicit time-axis annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and commit to incorporating revisions that enhance the rigor of our experimental analysis and concept evaluation.

read point-by-point responses

Referee: §3.2 (Concept Discovery Module): the claim that the class-conditioned VLM yields object- and action-centric concepts that are 'sufficiently complete and accurate' for downstream temporal modeling is load-bearing, yet the manuscript provides no quantitative fidelity metrics such as recall against human-annotated action/object sets or performance sensitivity when concepts are replaced or perturbed.

Authors: We agree that providing quantitative metrics for the concept discovery module would better support our claims regarding the sufficiency of the discovered concepts. Although the primary evaluation in the manuscript focuses on end-to-end classification performance, we recognize the value of direct assessment. In the revised version, we will add quantitative fidelity metrics, including recall of discovered concepts against human-annotated object and action sets on applicable benchmarks. We will also include a sensitivity analysis by systematically replacing or perturbing subsets of concepts and reporting the resulting changes in model performance. This will help validate the completeness and accuracy of the VLM-based discovery process. revision: yes
Referee: §4 (Experiments and Tables): reported gains over global (non-temporal) concept bottlenecks are presented without error bars, multi-seed statistics, or an ablation that isolates per-concept temporal self-attention from the VLM discovery step; this leaves open whether the temporal component drives the improvement or whether gains are attributable to VLM supervision alone.

Authors: We appreciate this observation, as it highlights the need for more robust statistical reporting and targeted ablations. The current manuscript reports results from single experimental runs without variance estimates. To address this, we will re-run all experiments across multiple random seeds (e.g., 5 seeds) and include error bars along with mean and standard deviation in the updated tables. Additionally, we will introduce a new ablation study that fixes the VLM-discovered concepts and compares the full temporal self-attention model against a non-temporal baseline (such as mean-pooling of concept activations over time). This will isolate the contribution of the per-concept temporal self-attention mechanism from the benefits of the concept discovery alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical architecture (MoTIF) whose performance claims are evaluated via standard benchmark comparisons on video datasets against global CBM baselines and black-box models. The class-conditioned VLM concept discovery is described as an independent upstream module that produces textual concepts fed into per-concept temporal self-attention; no equations define the downstream predictions in terms of the discovery step itself, nor are any fitted parameters from the model renamed as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central results. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the untested premise that VLM-extracted concepts are temporally expressive and sufficient; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Vision-language models conditioned on class labels can extract object- and action-centric textual concepts from video frames that capture the information needed for classification.
Invoked in the description of the concept discovery module.

pith-pipeline@v0.9.0 · 5710 in / 1198 out tokens · 44836 ms · 2026-05-18T14:34:01.400806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

work page 2021
[3]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network, 2025. URL https...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Interactive concept bottleneck models

Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, and Krishnamurthy Dvijotham. Interactive concept bottleneck models. In Proceedings of the aaai conference on artificial intelligence, volume 37, pp.\ 5948--5955, 2023

work page 2023
[5]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Towards automatic concept-based explanations

Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations. Advances in neural information processing systems, 32, 2019

work page 2019
[8]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

work page 2017
[9]

Hierarchical explanations for video action recognition

Sadaf Gulshad, Teng Long, and Nanne van Noord. Hierarchical explanations for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3703--3708, 2023

work page 2023
[10]

Self-attention attribution: Interpreting information interactions inside transformer

Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 12963--12971, 2021

work page 2021
[11]

Addressing leakage in concept bottleneck models

Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems, 35: 0 23386--23397, 2022

work page 2022
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[13]

Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Don Stanton, Hector Corrada Bravo, Kyunghyun Cho, and Nathan C. Frey. Concept bottleneck language models for protein design. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Yt9CFhOOFe

work page 2025
[14]

Kaplan, and Mani Srivastava

Jeya Vikranth Jeyakumar, Luke Dickens, Yu-Hsi Cheng, Joseph Noor, Luis Antonio Garcia, Diego Ramirez Echavarria, Alessandra Russo, Lance M. Kaplan, and Mani Srivastava. Automatic concept extraction for concept bottleneck-based video classification, 2022. URL https://openreview.net/forum?id=66kgCIYQW3

work page 2022
[15]

Spatial-temporal concept based explanation of 3d convnets

Ying Ji, Yu Wang, and Jien Kato. Spatial-temporal concept based explanation of 3d convnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15444--15453, 2023

work page 2023
[16]

Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models

Patrick Knab, Sascha Marton, and Christian Bartelt. Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025 a . URL https://openreview.net/forum?id=JHs5p6nPbG

work page 2025
[17]

Which lime should i trust? concepts, challenges, and solutions, 2025 b

Patrick Knab, Sascha Marton, Udo Schlegel, and Christian Bartelt. Which lime should i trust? concepts, challenges, and solutions, 2025 b . URL https://arxiv.org/abs/2503.24365

work page arXiv 2025
[18]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 5338--5348. PMLR, 13--18 Jul 2020. URL https://proceedin...

work page 2020
[19]

Understanding video transformers via universal concept discovery

Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G Derpanis, and Pavel Tokmakov. Understanding video transformers via universal concept discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10946--10956, 2024

work page 2024
[20]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB : a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011

work page 2011
[21]

Kuehne, A

H. Kuehne, A. B. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014

work page 2014
[22]

Pcbear: Pose concept bottleneck for explainable action recognition

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, and Jinwoo Choi. Pcbear: Pose concept bottleneck for explainable action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 2690--2699, June 2025

work page 2025
[23]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7083--7093, 2019

work page 2019
[24]

Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C

Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C. van Gemert. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14892--14901, June 2021

work page 2021
[25]

Something-else: Compositional action recognition with spatial-temporal interaction networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. pp.\ 1049--1059, 2020

work page 2020
[26]

Visual classification via description from large language models

Sachit Menon and Carl Vondrick. Visual classification via description from large language models. In International Conference on Learning Representations, 2023

work page 2023
[27]

O zlem \

Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. Interpretable machine learning -- a brief history, state-of-the-art and challenges. In Irena Koprinska, Michael Kamp, Annalisa Appice, Corrado Loglisci, Luiza Antonie, Albrecht Zimmermann, Riccardo Guidotti, \"O zlem \"O zg \"o bek, Rita P. Ribeiro, Ricard Gavald \`a , Jo \ a o Gama, Linara Adilova...

work page 2020
[28]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

A survey on video-based human action recognition: recent updates, datasets, challenges, and applications

Preksha Pareek and Ankit Thakkar. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54 0 (3): 0 2259--2322, 2021

work page 2021
[30]

DCBM : Data-efficient visual concept bottleneck models

Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM : Data-efficient visual concept bottleneck models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BdO4R6XxUH

work page 2025
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

work page 2021
[32]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

Sukrut Rao, Sweta Mahajan, Moritz B\" o hle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, pp.\ 444–461, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72979-...

work page doi:10.1007/978-3-031-72980-5_26 2024
[33]

Exploring explainability in video action recognition

Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, and Joydeep Ghosh. Exploring explainability in video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 8176--8181, June 2024

work page 2024
[34]

Concept bottleneck model with additional unsupervised concepts

Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10: 0 41758--41765, 2022

work page 2022
[35]

Schrodi, J

S. Schrodi, J. Schur, M. Argus, and T. Brox. Selective concept bottleneck models without predefined concepts. Transactions on Machine Learning Research (TMLR), May 2025. URL http://lmb.informatik.uni-freiburg.de/Publications/2025/SAB25

work page 2025
[36]

A dataset of 101 human action classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2 0 (11): 0 1--7, 2012

work page 2012
[37]

Concept bottleneck large language models

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. Concept bottleneck large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=RC5FPYVQaH

work page 2025
[38]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 0 10078--10093, 2022

work page 2022
[39]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp.\ 4489--4497, 2015

work page 2015
[40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[41]

Cmf-transformer: cross-modal fusion transformer for human action recognition

Jun Wang, Limin Xia, and Xin Wen. Cmf-transformer: cross-modal fusion transformer for human action recognition. Mach. Vision Appl., 35 0 (5), August 2024 a . ISSN 0932-8092. doi:10.1007/s00138-024-01598-0. URL https://doi.org/10.1007/s00138-024-01598-0

work page doi:10.1007/s00138-024-01598-0 2024
[42]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

work page 2023
[43]

Revisiting multiple instance neural networks

Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern recognition, 74: 0 15--24, 2018

work page 2018
[44]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision, pp.\ 396--416. Springer, 2024 b

work page 2024
[45]

Learning optimal summaries of clinical time-series with concept bottleneck models

Carissa Wu, Sonali Parbhoo, Marton Havasi, and Finale Doshi-Velez. Learning optimal summaries of clinical time-series with concept bottleneck models. In Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (eds.), Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182 of Proceedings of Machine Learning Re...

work page 2022
[46]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

work page 2023
[47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 11975--11986, October 2023

work page 2023
[48]

Invertible concept-based explanations for cnn models with non-negative concept activation vectors

Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A Ehinger, and Benjamin IP Rubinstein. Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 11682--11690, 2021

work page 2021
[49]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[50]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[51]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Icml, volume 2, pp.\ 4, 2021

work page 2021

[3] [3]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network, 2025. URL https...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Interactive concept bottleneck models

Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, and Krishnamurthy Dvijotham. Interactive concept bottleneck models. In Proceedings of the aaai conference on artificial intelligence, volume 37, pp.\ 5948--5955, 2023

work page 2023

[5] [5]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Towards automatic concept-based explanations

Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations. Advances in neural information processing systems, 32, 2019

work page 2019

[8] [8]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017

work page 2017

[9] [9]

Hierarchical explanations for video action recognition

Sadaf Gulshad, Teng Long, and Nanne van Noord. Hierarchical explanations for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3703--3708, 2023

work page 2023

[10] [10]

Self-attention attribution: Interpreting information interactions inside transformer

Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 12963--12971, 2021

work page 2021

[11] [11]

Addressing leakage in concept bottleneck models

Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems, 35: 0 23386--23397, 2022

work page 2022

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016

[13] [13]

Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Don Stanton, Hector Corrada Bravo, Kyunghyun Cho, and Nathan C. Frey. Concept bottleneck language models for protein design. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Yt9CFhOOFe

work page 2025

[14] [14]

Kaplan, and Mani Srivastava

Jeya Vikranth Jeyakumar, Luke Dickens, Yu-Hsi Cheng, Joseph Noor, Luis Antonio Garcia, Diego Ramirez Echavarria, Alessandra Russo, Lance M. Kaplan, and Mani Srivastava. Automatic concept extraction for concept bottleneck-based video classification, 2022. URL https://openreview.net/forum?id=66kgCIYQW3

work page 2022

[15] [15]

Spatial-temporal concept based explanation of 3d convnets

Ying Ji, Yu Wang, and Jien Kato. Spatial-temporal concept based explanation of 3d convnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15444--15453, 2023

work page 2023

[16] [16]

Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models

Patrick Knab, Sascha Marton, and Christian Bartelt. Beyond pixels: Enhancing LIME with hierarchical features and segmentation foundation models. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025 a . URL https://openreview.net/forum?id=JHs5p6nPbG

work page 2025

[17] [17]

Which lime should i trust? concepts, challenges, and solutions, 2025 b

Patrick Knab, Sascha Marton, Udo Schlegel, and Christian Bartelt. Which lime should i trust? concepts, challenges, and solutions, 2025 b . URL https://arxiv.org/abs/2503.24365

work page arXiv 2025

[18] [18]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 5338--5348. PMLR, 13--18 Jul 2020. URL https://proceedin...

work page 2020

[19] [19]

Understanding video transformers via universal concept discovery

Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G Derpanis, and Pavel Tokmakov. Understanding video transformers via universal concept discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10946--10956, 2024

work page 2024

[20] [20]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB : a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011

work page 2011

[21] [21]

Kuehne, A

H. Kuehne, A. B. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014

work page 2014

[22] [22]

Pcbear: Pose concept bottleneck for explainable action recognition

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, and Jinwoo Choi. Pcbear: Pose concept bottleneck for explainable action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 2690--2699, June 2025

work page 2025

[23] [23]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 7083--7093, 2019

work page 2019

[24] [24]

Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C

Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C. van Gemert. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14892--14901, June 2021

work page 2021

[25] [25]

Something-else: Compositional action recognition with spatial-temporal interaction networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. pp.\ 1049--1059, 2020

work page 2020

[26] [26]

Visual classification via description from large language models

Sachit Menon and Carl Vondrick. Visual classification via description from large language models. In International Conference on Learning Representations, 2023

work page 2023

[27] [27]

O zlem \

Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. Interpretable machine learning -- a brief history, state-of-the-art and challenges. In Irena Koprinska, Michael Kamp, Annalisa Appice, Corrado Loglisci, Luiza Antonie, Albrecht Zimmermann, Riccardo Guidotti, \"O zlem \"O zg \"o bek, Rita P. Ribeiro, Ricard Gavald \`a , Jo \ a o Gama, Linara Adilova...

work page 2020

[28] [28]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

A survey on video-based human action recognition: recent updates, datasets, challenges, and applications

Preksha Pareek and Ankit Thakkar. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54 0 (3): 0 2259--2322, 2021

work page 2021

[30] [30]

DCBM : Data-efficient visual concept bottleneck models

Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM : Data-efficient visual concept bottleneck models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=BdO4R6XxUH

work page 2025

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

work page 2021

[32] [32]

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

Sukrut Rao, Sweta Mahajan, Moritz B\" o hle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXVII, pp.\ 444–461, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72979-...

work page doi:10.1007/978-3-031-72980-5_26 2024

[33] [33]

Exploring explainability in video action recognition

Avinab Saha, Shashank Gupta, Sravan Kumar Ankireddy, Karl Chahine, and Joydeep Ghosh. Exploring explainability in video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 8176--8181, June 2024

work page 2024

[34] [34]

Concept bottleneck model with additional unsupervised concepts

Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10: 0 41758--41765, 2022

work page 2022

[35] [35]

Schrodi, J

S. Schrodi, J. Schur, M. Argus, and T. Brox. Selective concept bottleneck models without predefined concepts. Transactions on Machine Learning Research (TMLR), May 2025. URL http://lmb.informatik.uni-freiburg.de/Publications/2025/SAB25

work page 2025

[36] [36]

A dataset of 101 human action classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2 0 (11): 0 1--7, 2012

work page 2012

[37] [37]

Concept bottleneck large language models

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. Concept bottleneck large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=RC5FPYVQaH

work page 2025

[38] [38]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 0 10078--10093, 2022

work page 2022

[39] [39]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp.\ 4489--4497, 2015

work page 2015

[40] [40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[41] [41]

Cmf-transformer: cross-modal fusion transformer for human action recognition

Jun Wang, Limin Xia, and Xin Wen. Cmf-transformer: cross-modal fusion transformer for human action recognition. Mach. Vision Appl., 35 0 (5), August 2024 a . ISSN 0932-8092. doi:10.1007/s00138-024-01598-0. URL https://doi.org/10.1007/s00138-024-01598-0

work page doi:10.1007/s00138-024-01598-0 2024

[42] [42]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

work page 2023

[43] [43]

Revisiting multiple instance neural networks

Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern recognition, 74: 0 15--24, 2018

work page 2018

[44] [44]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision, pp.\ 396--416. Springer, 2024 b

work page 2024

[45] [45]

Learning optimal summaries of clinical time-series with concept bottleneck models

Carissa Wu, Sonali Parbhoo, Marton Havasi, and Finale Doshi-Velez. Learning optimal summaries of clinical time-series with concept bottleneck models. In Zachary Lipton, Rajesh Ranganath, Mark Sendak, Michael Sjoding, and Serena Yeung (eds.), Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182 of Proceedings of Machine Learning Re...

work page 2022

[46] [46]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19187--19197, 2023

work page 2023

[47] [47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 11975--11986, October 2023

work page 2023

[48] [48]

Invertible concept-based explanations for cnn models with non-negative concept activation vectors

Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A Ehinger, and Benjamin IP Rubinstein. Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 11682--11690, 2021

work page 2021

[49] [49]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[50] [50]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[51] [51]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page