Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Dongyoon Wee; Geo Ahn; Inwoong Lee; Jinwoo Choi; Minho Shim; Taeoh Kim

arxiv: 2601.16211 · v2 · submitted 2026-01-22 · 💻 cs.CV · cs.AI

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn , Inwoong Lee , Taeoh Kim , Minho Shim , Dongyoon Wee , Jinwoo Choi This is my paper

Pith reviewed 2026-05-16 11:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords zero-shot compositional action recognitionobject-driven shortcutsco-occurrence regularizationtemporal order regularizationvideo action recognitioncompositional generalizationhard negative sampling

0 comments

The pith

Regularizing against frequent verb-object co-occurrences and enforcing temporal order reduces object-driven shortcuts in zero-shot compositional action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zero-shot compositional action recognition requires identifying novel verb-object pairs from primitives seen only in other combinations. Models often bypass the action's temporal dynamics and instead predict the verb from the object class alone. Diagnostic metrics reveal this pattern arises because training data leaves most compositions unseen and verbs and objects are learned asymmetrically. The paper counters the shortcut with RCORE, whose Co-occurrence Prior Regularization treats common pairings as hard negatives while Temporal Order Regularization forces sensitivity to event sequence. The result is measurably lower shortcut scores and higher accuracy on held-out compositions in both Something-com and Epic-Kitchens-100-com.

Core claim

Object-driven shortcuts in ZS-CAR arise from overfitting to training co-occurrence patterns and verb-object learning asymmetry; RCORE counters them by adding Co-occurrence Prior Regularization that supplies explicit supervision for unseen compositions through hard-negative treatment of frequent priors, together with Temporal Order Regularization for Composition that grounds verb representations in temporal sequence, thereby lowering shortcut diagnostics and raising compositional generalization on Sth-com and EK100-com.

What carries the argument

RCORE framework, whose CPR component treats frequent co-occurrences as hard negatives and whose TORC component enforces temporal-order sensitivity on verb features.

If this is right

Models exhibit lower reliance on object class for verb prediction.
Accuracy rises on novel verb-object pairs in both Sth-com and EK100-com.
Temporal cues receive greater weight in the learned representations.
Shortcut diagnostic metrics decrease consistently across the two datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hard-negative and temporal-order regularizers could be tested on other video-text tasks that suffer from static co-occurrence bias.
Applying the method to longer untrimmed videos would reveal whether the temporal-order term scales beyond short clips.
If the regularizers prove stable, they could be combined with existing data-augmentation strategies to reduce the amount of labeled composition data needed.

Load-bearing premise

Treating frequent co-occurrences as hard negatives and enforcing temporal-order sensitivity will shift models to temporal verb cues without creating new biases or hurting performance on seen compositions.

What would settle it

An evaluation in which RCORE lowers the shortcut diagnostic scores yet fails to raise accuracy on unseen compositions, or in which accuracy on seen compositions drops by more than a few points.

Figures

Figures reproduced from arXiv: 2601.16211 by Dongyoon Wee, Geo Ahn, Inwoong Lee, Jinwoo Choi, Minho Shim, Taeoh Kim.

**Figure 1.** Figure 1: Why object-driven shortcuts emerge in compositional video understanding? (a) Co-occurrence bias. Datasets are intrinsically sparse and highly skewed in their verb–object combinations, creating strong co-occurrence priors. Models exploit these priors as a shortcut: once the object is recognized, the model often predicts the most frequent verb paired with it, ignoring temporal evidence. (b) Asymmetric learn… view at source ↗

**Figure 2.** Figure 2: Controlled experiments demonstrate object-driven shortcut learning in ZS-CAR. We empirically identify a key failure mode in ZS-CAR—object-driven shortcuts. (a) Objects are easier to learn than verbs. We train a randomly initialized ViT [10] on a balanced 10 × 10 verb-object subset from Sth-com [16]. The learning curves show that object accuracy increases much faster than verb accuracy, indicating that obje… view at source ↗

**Figure 3.** Figure 3: Learning curve of the SOTA model with our diagnostic metrics. We plot the learning curve of C2C [16] trained on Sth-com [16]. We measure the False Seen Prediction (FSP) and False Co-occurrence Prediction (FCP) ratios, and observe that the seen–unseen accuracy gap (∆SU ) correlates strongly with both metrics. There observations indicate that the current SOTA model exhibits overfitting to seen compositions.… view at source ↗

**Figure 4.** Figure 4: Overview of RCORE. (a) Overview of our proposed RCORE framework. (b) VOCAMix synthesizes plausible yet unseen verb–object compositions while preserving the temporal structure of the primary video. (c) TORC penalizes alignment between original and temporally perturbed feature vectors, enforcing explicit temporal order modeling and reducing object-driven shortcuts. 4. RCORE We introduce RCORE, a learning fra… view at source ↗

**Figure 5.** Figure 5: Analysis on the effects of RCORE on the Sth-com [16] dataset. (a) RCORE prevents the False Co-occurrence Prediction (FCP) ratio from increasing during training, whereas the baseline shows a clear rise in FCP. As a result, RCORE consistently maintains a smaller seen–unseen accuracy gap (∆SU ) throughout training. (b) The cosine similarity between the original and reversed verb features becomes strongly nega… view at source ↗

**Figure 6.** Figure 6: RCORE mitigates object-driven shortcuts in verb learning. We visualize confusion matrices for six representative verbs to compare the ability of RCORE and C2C to distinguish opposite temporal semantics on unseen compositions of the Sthcom [16] test set. All values in the confusion matrices are normalized frequencies across the entire verb classes in the dataset. bels are used to tune biases. We also rep… view at source ↗

**Figure 7.** Figure 7: Top/Bottom-30 frequent compositions in the EK100-com training set. t (Take, Cup) t (Wash, Spatula) (a) Seen composition examples t (Put, Cup) t (Wash, Pot) (b) Unseen composition examples [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Example of seen/unseen composition samples in the EK100-com dataset. pairs (refer to [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Learning curve of the baseline and RCORE on the EK100-com dataset. RCORE suppresses the increase of the FCP ratio during training, effectively narrowing the performance gap between seen and unseen composition validation accuracies. tivations. We use CoOp-style [43] learnable text prompts for all models. Training Strategy. We train all models, including baselines and RCORE, for 30 epochs with a total batch… view at source ↗

**Figure 10.** Figure 10: Performances on Temporal/Static split of Sth-com. We evaluate the models on Sth-com [16] using both (a) our reconstructed splits and (b) the splits from Sevilla et al [31]. We utilize both original and temporally shuffled inputs to assess the model’s temporal modeling capability and its reliance on static cues. A larger performance gap between original and shuffled inputs indicates that the model predicts… view at source ↗

**Figure 11.** Figure 11: Conditional modeling overfits to co-occurrence statistics. We track how the composition prediction confidence of C2C [16] evolves during training both on Sth-com [16] and EK100-com datasets. As training progresses, C2C [16] increasingly ignores input evidence and misclassifies unseen compositions as seen ones. D.2. Analysis of confidence distribution in composition classification In [PITH_FULL_IMAGE:fig… view at source ↗

read the original abstract

Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence. We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning. Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions. To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components. Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent co-occurrence priors by treating them as hard negatives. Temporal Order Regularization for Composition (TORC) enforces temporal-order sensitivity to learn temporally grounded verb representations. Across Sth-com and EK100-com, RCORE reduces shortcut diagnostics and consequently improves compositional generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCORE targets object-driven shortcuts in ZS-CAR with CPR and TORC, but the key open question is whether seen-composition accuracy holds up under the hard-negative treatment.

read the letter

The main point is that this paper pins down a concrete failure mode in zero-shot compositional action recognition: models default to object class cues instead of temporal verb evidence, especially when compositional supervision is sparse. They introduce diagnostic metrics to measure that reliance on training co-occurrence patterns, then propose RCORE with two regularizers to counter it. CPR treats frequent verb-object pairs as hard negatives to discourage shortcut learning, while TORC adds explicit pressure for temporal-order sensitivity in the representations. That combination is the actual new piece, and it is applied to the compositional splits of Something-Something and Epic-Kitchens-100. The diagnostics and the explicit handling of unseen compositions through regularization are useful additions that make the problem measurable rather than just asserted. The approach stays grounded in the observed asymmetry between verb and object learning, which is a fair diagnosis. The soft spot is the potential cost to seen compositions. Penalizing frequent training pairs as hard negatives could reduce accuracy on combinations the model should still handle correctly, unless the temporal signal fully compensates. The abstract highlights reduced shortcut scores and unseen gains but does not report seen-composition numbers or regularization-strength ablations, so the trade-off remains unclear. If the full results show no meaningful drop on seen pairs, the method is more convincing; otherwise the practical value shrinks. This work is aimed at people working on video action recognition and compositional generalization who already use these datasets. It shows clear engagement with the literature on shortcuts and offers implementable components worth testing. I would send it to peer review because the problem is well-defined and the proposed fix is straightforward to evaluate against existing baselines.

Referee Report

2 major / 1 minor

Summary. The paper identifies object-driven shortcuts in zero-shot compositional action recognition (ZS-CAR), where models rely on labeled object classes rather than temporal verb cues due to sparse compositional supervision and verb-object learning asymmetry. Diagnostic metrics reveal overfitting to training co-occurrence patterns. It proposes RCORE with two components: Co-occurrence Prior Regularization (CPR), which adds supervision for unseen compositions and treats frequent co-occurrences as hard negatives, and Temporal Order Regularization for Composition (TORC), which enforces temporal-order sensitivity. The work claims that RCORE reduces shortcut diagnostics and improves compositional generalization on the Sth-com and EK100-com benchmarks.

Significance. If the empirical claims hold with supporting results, this addresses a core limitation in compositional video understanding by shifting models toward temporally grounded representations, with potential benefits for robust generalization in applications like robotics and human activity analysis.

major comments (2)

[Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.
[CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.

minor comments (1)

[Abstract] The abstract mentions diagnostic metrics but does not define them explicitly; including their formulations would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.

Authors: We agree that the abstract, as a high-level summary, does not include specific numbers. The full manuscript reports quantitative results with error bars, ablations, and baseline comparisons in Sections 4 and 5. To address the concern, we will revise the abstract to incorporate key quantitative findings, such as the reduction in shortcut diagnostics and the gains on unseen compositions for Sth-com and EK100-com. revision: yes
Referee: [CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.

Authors: CPR applies hard-negative regularization selectively to frequent co-occurrence priors to discourage object-driven shortcuts, while the primary cross-entropy loss continues to supervise all seen compositions. This does not indiscriminately penalize valid training pairs. We acknowledge that accuracy on seen compositions was not explicitly reported in the initial submission. We will add these results in the revised version, demonstrating that performance on held-out seen pairs is preserved while generalization to unseen compositions improves. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RCORE regularization is explicitly constructed to target identified diagnostics rather than reducing claims to fitted inputs or self-citations.

full rationale

The paper first defines diagnostic metrics for object-driven shortcuts based on training co-occurrence patterns, then introduces CPR (treating frequent pairs as hard negatives) and TORC (enforcing temporal order) as targeted regularizers. These steps are forward-designed interventions, not tautological reductions where a 'prediction' equals the input fit by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided chain. The central claim of reduced diagnostics and improved unseen generalization is presented as an empirical outcome of the proposed terms, with independent content from the diagnostics themselves. This matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard assumptions from video action recognition literature and introduces regularization terms without new physical entities or many explicitly fitted parameters beyond typical training hyperparameters.

axioms (1)

domain assumption Video sequences contain distinguishable temporal order information usable for verb discrimination
Invoked to justify the TORC component's effectiveness

pith-pipeline@v0.9.0 · 5497 in / 1181 out tokens · 27722 ms · 2026-05-16T11:39:33.799490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Don’t just assume; look and answer: Over- coming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- dha Kembhavi. Don’t just assume; look and answer: Over- coming priors for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3

work page 2018
[2]

Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding

Kyungho Bae, Geo Ahn, Youngrae Kim, and Jinwoo Choi. Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 3

work page 2024
[3]

Learning de-biased representations with biased representations

Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. InInternational Conference on Ma- chine Learning (ICML), 2020. 4

work page 2020
[4]

Evidential deep learn- 9 ing for open set action recognition

Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learn- 9 ing for open set action recognition. InIEEE International Conference on Computer Vision (ICCV), 2021. 2

work page 2021
[5]

Masked autoencoders are scalable vision learners

Dibyadip Chatterjee, Fadime Sener, Shugao Ma, and An- gela Yao. Masked autoencoders are scalable vision learners

work page
[6]

Why can’t i dance in the mall? learning to mitigate scene bias in action recognition

Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. 2019. 3, 4

work page 2019
[7]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 1, 2, 7, 12

work page 2022
[8]

Large scale holistic video understanding

Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, J¨urgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2020. 1

work page 2020
[9]

Motion-aware contrastive video representation learning via foreground-background merging

Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Hao- hang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong. Motion-aware contrastive video representation learning via foreground-background merging. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 14

work page 2022
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....

work page 2021
[11]

Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. InInter- national Conference on Learning Representations (ICLR),

work page
[12]

The ”something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The ”something something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017. 1, 7, 12, 18

work page 2017
[13]

Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning

Dongyao Jiang, Haodong Jing, Yongqiang Ma, and Nanning Zheng. Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2, 6

work page 2025
[14]

Zero-shot compositional video learning with coding rate reduction

Heeseok Jung, Jun-Hyeon Bak, Yujin Jeong, Gyugeun Lee, Jinwoo Ahn, and Eun-Sol Kim. Zero-shot compositional video learning with coding rate reduction. InIEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 6, 7

work page 2025
[15]

Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization

Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. InIEEE International Con- ference on Computer Vision (ICCV), 2017. 3

work page 2017
[16]

C2c: Component-to-composition learning for zero-shot compositional action recognition

Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiao- Jun Wu, Muhammad Awais, Sara Atito, and Josef Kit- tler. C2c: Component-to-composition learning for zero-shot compositional action recognition. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18

work page 2024
[17]

Resound: To- wards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. InEu- ropean Conference on Computer Vision (ECCV), 2018. 2, 3, 4

work page 2018
[18]

Context-based and diversity-driven specificity in compositional zero-shot learning

Yun Li, Zhe Liu, Hang Chen, and Lina Yao. Context-based and diversity-driven specificity in compositional zero-shot learning. InIEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2

work page 2024
[19]

Large-margin softmax loss for convolutional neural net- works

Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural net- works. InInternational Conference on Machine Learning (ICML), 2016. 9

work page 2016
[20]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 14

work page 2019
[21]

Open world compositional zero- shot learning

Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. Open world compositional zero- shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[22]

Something-else: Com- positional action recognition with spatial-temporal interac- tion networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Com- positional action recognition with spatial-temporal interac- tion networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 7

work page 2020
[23]

From red wine to red tomato: Composition with context

Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

work page 2017
[24]

Learning graph embeddings for compositional zero-shot learning

Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[25]

Learning from failure: De-biasing classifier from biased classifier

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. 2020. 2, 3, 4

work page 2020
[26]

Nayak, Peilin Yu, and Stephen Bach

Nihal V . Nayak, Peilin Yu, and Stephen Bach. Learning to compose soft prompts for compositional zero-shot learn- ing. InInternational Conference on Learning Representa- tions (ICLR), 2023. 2

work page 2023
[27]

Task-driven modular networks for zero-shot compositional learning

Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE International Conference on Computer Vision (ICCV), 2019. 2

work page 2019
[28]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 3, 4, 5, 7, 8, 13, 14, 16, 18

work page 2021
[29]

Disen- tangling visual embeddings for attributes and objects

Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disen- tangling visual embeddings for attributes and objects. In IEEE Conference on Computer Vision and Pattern Recog- 10 nition (CVPR), 2022. 2

work page 2022
[30]

Which shortcut cues will DNNs choose? a study from the parameter-space perspective

Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael Poli, and Sangdoo Yun. Which shortcut cues will DNNs choose? a study from the parameter-space perspective. InIn- ternational Conference on Learning Representations (ICLR),

work page
[31]

Only time can tell: Discovering temporal data for temporal modeling

Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. InIEEE Winter Conference on Applications of Computer Vi- sion (WACV), 2021. 7, 17, 18

work page 2021
[32]

Don’t judge an object by its context: learning to overcome con- textual bias

Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don’t judge an object by its context: learning to overcome con- textual bias. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3

work page 2020
[33]

Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3, 5

work page 2021
[34]

Learning conditional attributes for compositional zero-shot learning

Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, Peng Wang, and Chunhua Shen. Learning conditional attributes for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 2

work page 2023
[35]

A-fast-rcnn: Hard positive generation via adversary for ob- ject detection

Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for ob- ject detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

work page 2017
[36]

A conditional probability framework for compositional zero-shot learning

Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin, Xiankai Lu, and Wenguan Wang. A conditional probability framework for compositional zero-shot learning. InIEEE International Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[37]

Aim: Adapting image models for effi- cient video understanding

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for effi- cient video understanding. InInternational Conference on Learning Representations (ICLR), 2018. 5, 7, 8, 13, 14

work page 2018
[38]

Cutmix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InIEEE International Conference on Computer Vi- sion (ICCV), 2019. 8, 9, 14

work page 2019
[39]

Time is matter: Tem- poral self-supervision for video transformers

Sukmin Yun, Jaehyung Kim, Dongyoon Han, Hwanjun Song, Jung-Woo Ha, and Jinwoo Shin. Time is matter: Tem- poral self-supervision for video transformers. InInterna- tional Conference on Machine Learning (ICML), 2022. 7, 17

work page 2022
[40]

Soar: Scene-debiasing open-set action recognition

Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, and Gang Hua. Soar: Scene-debiasing open-set action recognition. InIEEE Inter- national Conference on Computer Vision (ICCV), 2023. 2

work page 2023
[41]

mixup: Beyond empirical risk minimiza- tion

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions (ICLR), 2018. 8, 9, 14

work page 2018
[42]

Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, and Zhanyu Ma. Disentangling before composing: Learning invariant disentangled features for compositional zero-shot learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 47(2):1132–1147, 2024. 2

work page 2024
[43]

Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022. 14 11 Appendix In this appendix, we provide comprehensive implementa- tion/dataset/method details and quantitative/qualitative re- sults to complement the main paper. We orga...

work page 2022
[44]

Details on EK100-com dataset (Section A)

work page
[45]

Complete implementation details (Section B)

work page
[46]

Additional evidence of object-driven shortcuts (Sec- tion C)

work page
[47]

Additional results (Section D). A. Details on EK100-com dataset In this section, we provide details about our curated ZS-CAR benchmark, EPIC-KITCHENS-100-composition (EK100-com). We construct EK100-com by repurpos- ing EPIC-KITCHENS-100 (EK100) [7] following the same protocol of constructing Sth-com [16]. In particular, we use the original training (67217...

work page
[48]

(36.4% vs. 35.5%). This suggests that penalizing only the most confusing candidates—specifically those that frequently co-occur with the input components—effectively mitigates co-occurrence bias without compromising the model’s overall performance. For instance, given an input like ‘(Pretending to tear, Paper)’, our intention is to penalize plausible but ...

work page
[49]

We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues

To balance the sample sizes between the Temporal and Static splits, we excluded a few verbs with the fewest sam- ples from the Temporal split. We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues. In Figure 10, we present the results for the Temporal/Static splits...

work page

[1] [1]

Don’t just assume; look and answer: Over- coming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- dha Kembhavi. Don’t just assume; look and answer: Over- coming priors for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3

work page 2018

[2] [2]

Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding

Kyungho Bae, Geo Ahn, Youngrae Kim, and Jinwoo Choi. Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 3

work page 2024

[3] [3]

Learning de-biased representations with biased representations

Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. InInternational Conference on Ma- chine Learning (ICML), 2020. 4

work page 2020

[4] [4]

Evidential deep learn- 9 ing for open set action recognition

Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learn- 9 ing for open set action recognition. InIEEE International Conference on Computer Vision (ICCV), 2021. 2

work page 2021

[5] [5]

Masked autoencoders are scalable vision learners

Dibyadip Chatterjee, Fadime Sener, Shugao Ma, and An- gela Yao. Masked autoencoders are scalable vision learners

work page

[6] [6]

Why can’t i dance in the mall? learning to mitigate scene bias in action recognition

Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. 2019. 3, 4

work page 2019

[7] [7]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 1, 2, 7, 12

work page 2022

[8] [8]

Large scale holistic video understanding

Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, J¨urgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2020. 1

work page 2020

[9] [9]

Motion-aware contrastive video representation learning via foreground-background merging

Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Hao- hang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong. Motion-aware contrastive video representation learning via foreground-background merging. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 14

work page 2022

[10] [10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....

work page 2021

[11] [11]

Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. InInter- national Conference on Learning Representations (ICLR),

work page

[12] [12]

The ”something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The ”something something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017. 1, 7, 12, 18

work page 2017

[13] [13]

Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning

Dongyao Jiang, Haodong Jing, Yongqiang Ma, and Nanning Zheng. Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2, 6

work page 2025

[14] [14]

Zero-shot compositional video learning with coding rate reduction

Heeseok Jung, Jun-Hyeon Bak, Yujin Jeong, Gyugeun Lee, Jinwoo Ahn, and Eun-Sol Kim. Zero-shot compositional video learning with coding rate reduction. InIEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 6, 7

work page 2025

[15] [15]

Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization

Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. InIEEE International Con- ference on Computer Vision (ICCV), 2017. 3

work page 2017

[16] [16]

C2c: Component-to-composition learning for zero-shot compositional action recognition

Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiao- Jun Wu, Muhammad Awais, Sara Atito, and Josef Kit- tler. C2c: Component-to-composition learning for zero-shot compositional action recognition. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18

work page 2024

[17] [17]

Resound: To- wards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. InEu- ropean Conference on Computer Vision (ECCV), 2018. 2, 3, 4

work page 2018

[18] [18]

Context-based and diversity-driven specificity in compositional zero-shot learning

Yun Li, Zhe Liu, Hang Chen, and Lina Yao. Context-based and diversity-driven specificity in compositional zero-shot learning. InIEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2

work page 2024

[19] [19]

Large-margin softmax loss for convolutional neural net- works

Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural net- works. InInternational Conference on Machine Learning (ICML), 2016. 9

work page 2016

[20] [20]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 14

work page 2019

[21] [21]

Open world compositional zero- shot learning

Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. Open world compositional zero- shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021

[22] [22]

Something-else: Com- positional action recognition with spatial-temporal interac- tion networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Com- positional action recognition with spatial-temporal interac- tion networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 7

work page 2020

[23] [23]

From red wine to red tomato: Composition with context

Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

work page 2017

[24] [24]

Learning graph embeddings for compositional zero-shot learning

Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021

[25] [25]

Learning from failure: De-biasing classifier from biased classifier

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. 2020. 2, 3, 4

work page 2020

[26] [26]

Nayak, Peilin Yu, and Stephen Bach

Nihal V . Nayak, Peilin Yu, and Stephen Bach. Learning to compose soft prompts for compositional zero-shot learn- ing. InInternational Conference on Learning Representa- tions (ICLR), 2023. 2

work page 2023

[27] [27]

Task-driven modular networks for zero-shot compositional learning

Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE International Conference on Computer Vision (ICCV), 2019. 2

work page 2019

[28] [28]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 3, 4, 5, 7, 8, 13, 14, 16, 18

work page 2021

[29] [29]

Disen- tangling visual embeddings for attributes and objects

Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disen- tangling visual embeddings for attributes and objects. In IEEE Conference on Computer Vision and Pattern Recog- 10 nition (CVPR), 2022. 2

work page 2022

[30] [30]

Which shortcut cues will DNNs choose? a study from the parameter-space perspective

Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael Poli, and Sangdoo Yun. Which shortcut cues will DNNs choose? a study from the parameter-space perspective. InIn- ternational Conference on Learning Representations (ICLR),

work page

[31] [31]

Only time can tell: Discovering temporal data for temporal modeling

Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. InIEEE Winter Conference on Applications of Computer Vi- sion (WACV), 2021. 7, 17, 18

work page 2021

[32] [32]

Don’t judge an object by its context: learning to overcome con- textual bias

Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don’t judge an object by its context: learning to overcome con- textual bias. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3

work page 2020

[33] [33]

Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3, 5

work page 2021

[34] [34]

Learning conditional attributes for compositional zero-shot learning

Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, Peng Wang, and Chunhua Shen. Learning conditional attributes for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 2

work page 2023

[35] [35]

A-fast-rcnn: Hard positive generation via adversary for ob- ject detection

Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for ob- ject detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

work page 2017

[36] [36]

A conditional probability framework for compositional zero-shot learning

Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin, Xiankai Lu, and Wenguan Wang. A conditional probability framework for compositional zero-shot learning. InIEEE International Conference on Computer Vision (ICCV), 2025. 2

work page 2025

[37] [37]

Aim: Adapting image models for effi- cient video understanding

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for effi- cient video understanding. InInternational Conference on Learning Representations (ICLR), 2018. 5, 7, 8, 13, 14

work page 2018

[38] [38]

Cutmix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InIEEE International Conference on Computer Vi- sion (ICCV), 2019. 8, 9, 14

work page 2019

[39] [39]

Time is matter: Tem- poral self-supervision for video transformers

Sukmin Yun, Jaehyung Kim, Dongyoon Han, Hwanjun Song, Jung-Woo Ha, and Jinwoo Shin. Time is matter: Tem- poral self-supervision for video transformers. InInterna- tional Conference on Machine Learning (ICML), 2022. 7, 17

work page 2022

[40] [40]

Soar: Scene-debiasing open-set action recognition

Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, and Gang Hua. Soar: Scene-debiasing open-set action recognition. InIEEE Inter- national Conference on Computer Vision (ICCV), 2023. 2

work page 2023

[41] [41]

mixup: Beyond empirical risk minimiza- tion

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions (ICLR), 2018. 8, 9, 14

work page 2018

[42] [42]

Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, and Zhanyu Ma. Disentangling before composing: Learning invariant disentangled features for compositional zero-shot learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 47(2):1132–1147, 2024. 2

work page 2024

[43] [43]

Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022. 14 11 Appendix In this appendix, we provide comprehensive implementa- tion/dataset/method details and quantitative/qualitative re- sults to complement the main paper. We orga...

work page 2022

[44] [44]

Details on EK100-com dataset (Section A)

work page

[45] [45]

Complete implementation details (Section B)

work page

[46] [46]

Additional evidence of object-driven shortcuts (Sec- tion C)

work page

[47] [47]

Additional results (Section D). A. Details on EK100-com dataset In this section, we provide details about our curated ZS-CAR benchmark, EPIC-KITCHENS-100-composition (EK100-com). We construct EK100-com by repurpos- ing EPIC-KITCHENS-100 (EK100) [7] following the same protocol of constructing Sth-com [16]. In particular, we use the original training (67217...

work page

[48] [48]

(36.4% vs. 35.5%). This suggests that penalizing only the most confusing candidates—specifically those that frequently co-occur with the input components—effectively mitigates co-occurrence bias without compromising the model’s overall performance. For instance, given an input like ‘(Pretending to tear, Paper)’, our intention is to penalize plausible but ...

work page

[49] [49]

We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues

To balance the sample sizes between the Temporal and Static splits, we excluded a few verbs with the fewest sam- ples from the Temporal split. We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues. In Figure 10, we present the results for the Temporal/Static splits...

work page