pith. sign in

arxiv: 2601.16211 · v2 · submitted 2026-01-22 · 💻 cs.CV · cs.AI

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Pith reviewed 2026-05-16 11:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords zero-shot compositional action recognitionobject-driven shortcutsco-occurrence regularizationtemporal order regularizationvideo action recognitioncompositional generalizationhard negative sampling
0
0 comments X

The pith

Regularizing against frequent verb-object co-occurrences and enforcing temporal order reduces object-driven shortcuts in zero-shot compositional action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zero-shot compositional action recognition requires identifying novel verb-object pairs from primitives seen only in other combinations. Models often bypass the action's temporal dynamics and instead predict the verb from the object class alone. Diagnostic metrics reveal this pattern arises because training data leaves most compositions unseen and verbs and objects are learned asymmetrically. The paper counters the shortcut with RCORE, whose Co-occurrence Prior Regularization treats common pairings as hard negatives while Temporal Order Regularization forces sensitivity to event sequence. The result is measurably lower shortcut scores and higher accuracy on held-out compositions in both Something-com and Epic-Kitchens-100-com.

Core claim

Object-driven shortcuts in ZS-CAR arise from overfitting to training co-occurrence patterns and verb-object learning asymmetry; RCORE counters them by adding Co-occurrence Prior Regularization that supplies explicit supervision for unseen compositions through hard-negative treatment of frequent priors, together with Temporal Order Regularization for Composition that grounds verb representations in temporal sequence, thereby lowering shortcut diagnostics and raising compositional generalization on Sth-com and EK100-com.

What carries the argument

RCORE framework, whose CPR component treats frequent co-occurrences as hard negatives and whose TORC component enforces temporal-order sensitivity on verb features.

If this is right

  • Models exhibit lower reliance on object class for verb prediction.
  • Accuracy rises on novel verb-object pairs in both Sth-com and EK100-com.
  • Temporal cues receive greater weight in the learned representations.
  • Shortcut diagnostic metrics decrease consistently across the two datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hard-negative and temporal-order regularizers could be tested on other video-text tasks that suffer from static co-occurrence bias.
  • Applying the method to longer untrimmed videos would reveal whether the temporal-order term scales beyond short clips.
  • If the regularizers prove stable, they could be combined with existing data-augmentation strategies to reduce the amount of labeled composition data needed.

Load-bearing premise

Treating frequent co-occurrences as hard negatives and enforcing temporal-order sensitivity will shift models to temporal verb cues without creating new biases or hurting performance on seen compositions.

What would settle it

An evaluation in which RCORE lowers the shortcut diagnostic scores yet fails to raise accuracy on unseen compositions, or in which accuracy on seen compositions drops by more than a few points.

Figures

Figures reproduced from arXiv: 2601.16211 by Dongyoon Wee, Geo Ahn, Inwoong Lee, Jinwoo Choi, Minho Shim, Taeoh Kim.

Figure 1
Figure 1. Figure 1: Why object-driven shortcuts emerge in compositional video understanding? (a) Co-occurrence bias. Datasets are intrin￾sically sparse and highly skewed in their verb–object combinations, creating strong co-occurrence priors. Models exploit these priors as a shortcut: once the object is recognized, the model often predicts the most frequent verb paired with it, ignoring temporal evidence. (b) Asymmetric learn… view at source ↗
Figure 2
Figure 2. Figure 2: Controlled experiments demonstrate object-driven shortcut learning in ZS-CAR. We empirically identify a key failure mode in ZS-CAR—object-driven shortcuts. (a) Objects are easier to learn than verbs. We train a randomly initialized ViT [10] on a balanced 10 × 10 verb-object subset from Sth-com [16]. The learning curves show that object accuracy increases much faster than verb accuracy, indicating that obje… view at source ↗
Figure 3
Figure 3. Figure 3: Learning curve of the SOTA model with our diag￾nostic metrics. We plot the learning curve of C2C [16] trained on Sth-com [16]. We measure the False Seen Prediction (FSP) and False Co-occurrence Prediction (FCP) ratios, and observe that the seen–unseen accuracy gap (∆SU ) correlates strongly with both metrics. There observations indicate that the current SOTA model exhibits overfitting to seen compositions.… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RCORE. (a) Overview of our proposed RCORE framework. (b) VOCAMix synthesizes plausible yet unseen verb–object compositions while preserving the temporal structure of the primary video. (c) TORC penalizes alignment between original and temporally perturbed feature vectors, enforcing explicit temporal order modeling and reducing object-driven shortcuts. 4. RCORE We introduce RCORE, a learning fra… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis on the effects of RCORE on the Sth-com [16] dataset. (a) RCORE prevents the False Co-occurrence Prediction (FCP) ratio from increasing during training, whereas the baseline shows a clear rise in FCP. As a result, RCORE consistently maintains a smaller seen–unseen accuracy gap (∆SU ) throughout training. (b) The cosine similarity between the original and reversed verb features becomes strongly nega… view at source ↗
Figure 6
Figure 6. Figure 6: RCORE mitigates object-driven shortcuts in verb learning. We visualize confusion matrices for six representa￾tive verbs to compare the ability of RCORE and C2C to distinguish opposite temporal semantics on unseen compositions of the Sth￾com [16] test set. All values in the confusion matrices are normal￾ized frequencies across the entire verb classes in the dataset. bels are used to tune biases. We also rep… view at source ↗
Figure 7
Figure 7. Figure 7: Top/Bottom-30 frequent compositions in the EK100-com training set. t (Take, Cup) t (Wash, Spatula) (a) Seen composition examples t (Put, Cup) t (Wash, Pot) (b) Unseen composition examples [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of seen/unseen composition samples in the EK100-com dataset. pairs (refer to [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learning curve of the baseline and RCORE on the EK100-com dataset. RCORE suppresses the increase of the FCP ratio during training, effectively narrowing the performance gap between seen and unseen composition validation accuracies. tivations. We use CoOp-style [43] learnable text prompts for all models. Training Strategy. We train all models, including base￾lines and RCORE, for 30 epochs with a total batch… view at source ↗
Figure 10
Figure 10. Figure 10: Performances on Temporal/Static split of Sth-com. We evaluate the models on Sth-com [16] using both (a) our reconstructed splits and (b) the splits from Sevilla et al [31]. We utilize both original and temporally shuffled inputs to assess the model’s temporal modeling capability and its reliance on static cues. A larger performance gap between original and shuffled inputs indicates that the model predicts… view at source ↗
Figure 11
Figure 11. Figure 11: Conditional modeling overfits to co-occurrence statistics. We track how the composition prediction confidence of C2C [16] evolves during training both on Sth-com [16] and EK100-com datasets. As training progresses, C2C [16] increasingly ignores input evidence and misclassifies unseen compositions as seen ones. D.2. Analysis of confidence distribution in compo￾sition classification In [PITH_FULL_IMAGE:fig… view at source ↗
read the original abstract

Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence. We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning. Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions. To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components. Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent co-occurrence priors by treating them as hard negatives. Temporal Order Regularization for Composition (TORC) enforces temporal-order sensitivity to learn temporally grounded verb representations. Across Sth-com and EK100-com, RCORE reduces shortcut diagnostics and consequently improves compositional generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies object-driven shortcuts in zero-shot compositional action recognition (ZS-CAR), where models rely on labeled object classes rather than temporal verb cues due to sparse compositional supervision and verb-object learning asymmetry. Diagnostic metrics reveal overfitting to training co-occurrence patterns. It proposes RCORE with two components: Co-occurrence Prior Regularization (CPR), which adds supervision for unseen compositions and treats frequent co-occurrences as hard negatives, and Temporal Order Regularization for Composition (TORC), which enforces temporal-order sensitivity. The work claims that RCORE reduces shortcut diagnostics and improves compositional generalization on the Sth-com and EK100-com benchmarks.

Significance. If the empirical claims hold with supporting results, this addresses a core limitation in compositional video understanding by shifting models toward temporally grounded representations, with potential benefits for robust generalization in applications like robotics and human activity analysis.

major comments (2)
  1. [Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.
  2. [CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.
minor comments (1)
  1. [Abstract] The abstract mentions diagnostic metrics but does not define them explicitly; including their formulations would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.

    Authors: We agree that the abstract, as a high-level summary, does not include specific numbers. The full manuscript reports quantitative results with error bars, ablations, and baseline comparisons in Sections 4 and 5. To address the concern, we will revise the abstract to incorporate key quantitative findings, such as the reduction in shortcut diagnostics and the gains on unseen compositions for Sth-com and EK100-com. revision: yes

  2. Referee: [CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.

    Authors: CPR applies hard-negative regularization selectively to frequent co-occurrence priors to discourage object-driven shortcuts, while the primary cross-entropy loss continues to supervise all seen compositions. This does not indiscriminately penalize valid training pairs. We acknowledge that accuracy on seen compositions was not explicitly reported in the initial submission. We will add these results in the revised version, demonstrating that performance on held-out seen pairs is preserved while generalization to unseen compositions improves. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RCORE regularization is explicitly constructed to target identified diagnostics rather than reducing claims to fitted inputs or self-citations.

full rationale

The paper first defines diagnostic metrics for object-driven shortcuts based on training co-occurrence patterns, then introduces CPR (treating frequent pairs as hard negatives) and TORC (enforcing temporal order) as targeted regularizers. These steps are forward-designed interventions, not tautological reductions where a 'prediction' equals the input fit by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided chain. The central claim of reduced diagnostics and improved unseen generalization is presented as an empirical outcome of the proposed terms, with independent content from the diagnostics themselves. This matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard assumptions from video action recognition literature and introduces regularization terms without new physical entities or many explicitly fitted parameters beyond typical training hyperparameters.

axioms (1)
  • domain assumption Video sequences contain distinguishable temporal order information usable for verb discrimination
    Invoked to justify the TORC component's effectiveness

pith-pipeline@v0.9.0 · 5497 in / 1181 out tokens · 27722 ms · 2026-05-16T11:39:33.799490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Don’t just assume; look and answer: Over- coming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- dha Kembhavi. Don’t just assume; look and answer: Over- coming priors for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3

  2. [2]

    Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding

    Kyungho Bae, Geo Ahn, Youngrae Kim, and Jinwoo Choi. Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 3

  3. [3]

    Learning de-biased representations with biased representations

    Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. InInternational Conference on Ma- chine Learning (ICML), 2020. 4

  4. [4]

    Evidential deep learn- 9 ing for open set action recognition

    Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learn- 9 ing for open set action recognition. InIEEE International Conference on Computer Vision (ICCV), 2021. 2

  5. [5]

    Masked autoencoders are scalable vision learners

    Dibyadip Chatterjee, Fadime Sener, Shugao Ma, and An- gela Yao. Masked autoencoders are scalable vision learners

  6. [6]

    Why can’t i dance in the mall? learning to mitigate scene bias in action recognition

    Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. 2019. 3, 4

  7. [7]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 1, 2, 7, 12

  8. [8]

    Large scale holistic video understanding

    Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, J¨urgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2020. 1

  9. [9]

    Motion-aware contrastive video representation learning via foreground-background merging

    Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Hao- hang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong. Motion-aware contrastive video representation learning via foreground-background merging. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 14

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....

  11. [11]

    Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. InInter- national Conference on Learning Representations (ICLR),

  12. [12]

    The ”something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The ”something something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017. 1, 7, 12, 18

  13. [13]

    Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning

    Dongyao Jiang, Haodong Jing, Yongqiang Ma, and Nanning Zheng. Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2, 6

  14. [14]

    Zero-shot compositional video learning with coding rate reduction

    Heeseok Jung, Jun-Hyeon Bak, Yujin Jeong, Gyugeun Lee, Jinwoo Ahn, and Eun-Sol Kim. Zero-shot compositional video learning with coding rate reduction. InIEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 6, 7

  15. [15]

    Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization

    Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. InIEEE International Con- ference on Computer Vision (ICCV), 2017. 3

  16. [16]

    C2c: Component-to-composition learning for zero-shot compositional action recognition

    Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiao- Jun Wu, Muhammad Awais, Sara Atito, and Josef Kit- tler. C2c: Component-to-composition learning for zero-shot compositional action recognition. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18

  17. [17]

    Resound: To- wards action recognition without representation bias

    Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. InEu- ropean Conference on Computer Vision (ECCV), 2018. 2, 3, 4

  18. [18]

    Context-based and diversity-driven specificity in compositional zero-shot learning

    Yun Li, Zhe Liu, Hang Chen, and Lina Yao. Context-based and diversity-driven specificity in compositional zero-shot learning. InIEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2

  19. [19]

    Large-margin softmax loss for convolutional neural net- works

    Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural net- works. InInternational Conference on Machine Learning (ICML), 2016. 9

  20. [20]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 14

  21. [21]

    Open world compositional zero- shot learning

    Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. Open world compositional zero- shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  22. [22]

    Something-else: Com- positional action recognition with spatial-temporal interac- tion networks

    Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Com- positional action recognition with spatial-temporal interac- tion networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 7

  23. [23]

    From red wine to red tomato: Composition with context

    Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

  24. [24]

    Learning graph embeddings for compositional zero-shot learning

    Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  25. [25]

    Learning from failure: De-biasing classifier from biased classifier

    Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. 2020. 2, 3, 4

  26. [26]

    Nayak, Peilin Yu, and Stephen Bach

    Nihal V . Nayak, Peilin Yu, and Stephen Bach. Learning to compose soft prompts for compositional zero-shot learn- ing. InInternational Conference on Learning Representa- tions (ICLR), 2023. 2

  27. [27]

    Task-driven modular networks for zero-shot compositional learning

    Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE International Conference on Computer Vision (ICCV), 2019. 2

  28. [28]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 3, 4, 5, 7, 8, 13, 14, 16, 18

  29. [29]

    Disen- tangling visual embeddings for attributes and objects

    Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disen- tangling visual embeddings for attributes and objects. In IEEE Conference on Computer Vision and Pattern Recog- 10 nition (CVPR), 2022. 2

  30. [30]

    Which shortcut cues will DNNs choose? a study from the parameter-space perspective

    Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael Poli, and Sangdoo Yun. Which shortcut cues will DNNs choose? a study from the parameter-space perspective. InIn- ternational Conference on Learning Representations (ICLR),

  31. [31]

    Only time can tell: Discovering temporal data for temporal modeling

    Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. InIEEE Winter Conference on Applications of Computer Vi- sion (WACV), 2021. 7, 17, 18

  32. [32]

    Don’t judge an object by its context: learning to overcome con- textual bias

    Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don’t judge an object by its context: learning to overcome con- textual bias. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3

  33. [33]

    Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning

    Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3, 5

  34. [34]

    Learning conditional attributes for compositional zero-shot learning

    Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, Peng Wang, and Chunhua Shen. Learning conditional attributes for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 2

  35. [35]

    A-fast-rcnn: Hard positive generation via adversary for ob- ject detection

    Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for ob- ject detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

  36. [36]

    A conditional probability framework for compositional zero-shot learning

    Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin, Xiankai Lu, and Wenguan Wang. A conditional probability framework for compositional zero-shot learning. InIEEE International Conference on Computer Vision (ICCV), 2025. 2

  37. [37]

    Aim: Adapting image models for effi- cient video understanding

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for effi- cient video understanding. InInternational Conference on Learning Representations (ICLR), 2018. 5, 7, 8, 13, 14

  38. [38]

    Cutmix: Regu- larization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InIEEE International Conference on Computer Vi- sion (ICCV), 2019. 8, 9, 14

  39. [39]

    Time is matter: Tem- poral self-supervision for video transformers

    Sukmin Yun, Jaehyung Kim, Dongyoon Han, Hwanjun Song, Jung-Woo Ha, and Jinwoo Shin. Time is matter: Tem- poral self-supervision for video transformers. InInterna- tional Conference on Machine Learning (ICML), 2022. 7, 17

  40. [40]

    Soar: Scene-debiasing open-set action recognition

    Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, and Gang Hua. Soar: Scene-debiasing open-set action recognition. InIEEE Inter- national Conference on Computer Vision (ICCV), 2023. 2

  41. [41]

    mixup: Beyond empirical risk minimiza- tion

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions (ICLR), 2018. 8, 9, 14

  42. [42]

    Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, and Zhanyu Ma. Disentangling before composing: Learning invariant disentangled features for compositional zero-shot learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 47(2):1132–1147, 2024. 2

  43. [43]

    Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022. 14 11 Appendix In this appendix, we provide comprehensive implementa- tion/dataset/method details and quantitative/qualitative re- sults to complement the main paper. We orga...

  44. [44]

    Details on EK100-com dataset (Section A)

  45. [45]

    Complete implementation details (Section B)

  46. [46]

    Additional evidence of object-driven shortcuts (Sec- tion C)

  47. [47]

    Additional results (Section D). A. Details on EK100-com dataset In this section, we provide details about our curated ZS-CAR benchmark, EPIC-KITCHENS-100-composition (EK100-com). We construct EK100-com by repurpos- ing EPIC-KITCHENS-100 (EK100) [7] following the same protocol of constructing Sth-com [16]. In particular, we use the original training (67217...

  48. [48]

    (36.4% vs. 35.5%). This suggests that penalizing only the most confusing candidates—specifically those that frequently co-occur with the input components—effectively mitigates co-occurrence bias without compromising the model’s overall performance. For instance, given an input like ‘(Pretending to tear, Paper)’, our intention is to penalize plausible but ...

  49. [49]

    We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues

    To balance the sample sizes between the Temporal and Static splits, we excluded a few verbs with the fewest sam- ples from the Temporal split. We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues. In Figure 10, we present the results for the Temporal/Static splits...