Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
Pith reviewed 2026-05-16 11:39 UTC · model grok-4.3
The pith
Regularizing against frequent verb-object co-occurrences and enforcing temporal order reduces object-driven shortcuts in zero-shot compositional action recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Object-driven shortcuts in ZS-CAR arise from overfitting to training co-occurrence patterns and verb-object learning asymmetry; RCORE counters them by adding Co-occurrence Prior Regularization that supplies explicit supervision for unseen compositions through hard-negative treatment of frequent priors, together with Temporal Order Regularization for Composition that grounds verb representations in temporal sequence, thereby lowering shortcut diagnostics and raising compositional generalization on Sth-com and EK100-com.
What carries the argument
RCORE framework, whose CPR component treats frequent co-occurrences as hard negatives and whose TORC component enforces temporal-order sensitivity on verb features.
If this is right
- Models exhibit lower reliance on object class for verb prediction.
- Accuracy rises on novel verb-object pairs in both Sth-com and EK100-com.
- Temporal cues receive greater weight in the learned representations.
- Shortcut diagnostic metrics decrease consistently across the two datasets.
Where Pith is reading between the lines
- The same hard-negative and temporal-order regularizers could be tested on other video-text tasks that suffer from static co-occurrence bias.
- Applying the method to longer untrimmed videos would reveal whether the temporal-order term scales beyond short clips.
- If the regularizers prove stable, they could be combined with existing data-augmentation strategies to reduce the amount of labeled composition data needed.
Load-bearing premise
Treating frequent co-occurrences as hard negatives and enforcing temporal-order sensitivity will shift models to temporal verb cues without creating new biases or hurting performance on seen compositions.
What would settle it
An evaluation in which RCORE lowers the shortcut diagnostic scores yet fails to raise accuracy on unseen compositions, or in which accuracy on seen compositions drops by more than a few points.
Figures
read the original abstract
Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence. We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning. Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions. To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components. Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent co-occurrence priors by treating them as hard negatives. Temporal Order Regularization for Composition (TORC) enforces temporal-order sensitivity to learn temporally grounded verb representations. Across Sth-com and EK100-com, RCORE reduces shortcut diagnostics and consequently improves compositional generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies object-driven shortcuts in zero-shot compositional action recognition (ZS-CAR), where models rely on labeled object classes rather than temporal verb cues due to sparse compositional supervision and verb-object learning asymmetry. Diagnostic metrics reveal overfitting to training co-occurrence patterns. It proposes RCORE with two components: Co-occurrence Prior Regularization (CPR), which adds supervision for unseen compositions and treats frequent co-occurrences as hard negatives, and Temporal Order Regularization for Composition (TORC), which enforces temporal-order sensitivity. The work claims that RCORE reduces shortcut diagnostics and improves compositional generalization on the Sth-com and EK100-com benchmarks.
Significance. If the empirical claims hold with supporting results, this addresses a core limitation in compositional video understanding by shifting models toward temporally grounded representations, with potential benefits for robust generalization in applications like robotics and human activity analysis.
major comments (2)
- [Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.
- [CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.
minor comments (1)
- [Abstract] The abstract mentions diagnostic metrics but does not define them explicitly; including their formulations would improve clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that RCORE reduces shortcut diagnostics and improves compositional generalization is stated without any quantitative results, error bars, ablation studies, or baseline comparisons, leaving the effectiveness of CPR and TORC unverified.
Authors: We agree that the abstract, as a high-level summary, does not include specific numbers. The full manuscript reports quantitative results with error bars, ablations, and baseline comparisons in Sections 4 and 5. To address the concern, we will revise the abstract to incorporate key quantitative findings, such as the reduction in shortcut diagnostics and the gains on unseen compositions for Sth-com and EK100-com. revision: yes
-
Referee: [CPR] CPR description: Treating frequent training co-occurrences as hard negatives directly penalizes object-verb pairs that are valid in the training distribution; without reported accuracy on seen compositions, it is impossible to confirm that the regularization does not degrade performance on held-out seen pairs or introduce new biases.
Authors: CPR applies hard-negative regularization selectively to frequent co-occurrence priors to discourage object-driven shortcuts, while the primary cross-entropy loss continues to supervise all seen compositions. This does not indiscriminately penalize valid training pairs. We acknowledge that accuracy on seen compositions was not explicitly reported in the initial submission. We will add these results in the revised version, demonstrating that performance on held-out seen pairs is preserved while generalization to unseen compositions improves. revision: yes
Circularity Check
No significant circularity: RCORE regularization is explicitly constructed to target identified diagnostics rather than reducing claims to fitted inputs or self-citations.
full rationale
The paper first defines diagnostic metrics for object-driven shortcuts based on training co-occurrence patterns, then introduces CPR (treating frequent pairs as hard negatives) and TORC (enforcing temporal order) as targeted regularizers. These steps are forward-designed interventions, not tautological reductions where a 'prediction' equals the input fit by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided chain. The central claim of reduced diagnostics and improved unseen generalization is presented as an empirical outcome of the proposed terms, with independent content from the diagnostics themselves. This matches the default non-circular case.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video sequences contain distinguishable temporal order information usable for verb discrimination
Reference graph
Works this paper leans on
-
[1]
Don’t just assume; look and answer: Over- coming priors for visual question answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- dha Kembhavi. Don’t just assume; look and answer: Over- coming priors for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 3
work page 2018
-
[2]
Kyungho Bae, Geo Ahn, Youngrae Kim, and Jinwoo Choi. Devias: Learning disentangled video representations of ac- tion and scene for holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 3
work page 2024
-
[3]
Learning de-biased representations with biased representations
Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. InInternational Conference on Ma- chine Learning (ICML), 2020. 4
work page 2020
-
[4]
Evidential deep learn- 9 ing for open set action recognition
Wentao Bao, Qi Yu, and Yu Kong. Evidential deep learn- 9 ing for open set action recognition. InIEEE International Conference on Computer Vision (ICCV), 2021. 2
work page 2021
-
[5]
Masked autoencoders are scalable vision learners
Dibyadip Chatterjee, Fadime Sener, Shugao Ma, and An- gela Yao. Masked autoencoders are scalable vision learners
-
[6]
Why can’t i dance in the mall? learning to mitigate scene bias in action recognition
Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. 2019. 3, 4
work page 2019
-
[7]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 1, 2, 7, 12
work page 2022
-
[8]
Large scale holistic video understanding
Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, J¨urgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. InEuropean Conference on Computer Vision (ECCV), 2020. 1
work page 2020
-
[9]
Motion-aware contrastive video representation learning via foreground-background merging
Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Hao- hang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong. Motion-aware contrastive video representation learning via foreground-background merging. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 14
work page 2022
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....
work page 2021
-
[11]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increas- ing shape bias improves accuracy and robustness. InInter- national Conference on Learning Representations (ICLR),
-
[12]
The ”something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The ”something something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017. 1, 7, 12, 18
work page 2017
-
[13]
Dongyao Jiang, Haodong Jing, Yongqiang Ma, and Nanning Zheng. Beyond image classification: A video benchmark and dual-branch hybrid discrimination framework for com- positional zero-shot learning. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2, 6
work page 2025
-
[14]
Zero-shot compositional video learning with coding rate reduction
Heeseok Jung, Jun-Hyeon Bak, Yujin Jeong, Gyugeun Lee, Jinwoo Ahn, and Eun-Sol Kim. Zero-shot compositional video learning with coding rate reduction. InIEEE Inter- national Conference on Computer Vision (ICCV), 2025. 2, 6, 7
work page 2025
-
[15]
Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. InIEEE International Con- ference on Computer Vision (ICCV), 2017. 3
work page 2017
-
[16]
C2c: Component-to-composition learning for zero-shot compositional action recognition
Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiao- Jun Wu, Muhammad Awais, Sara Atito, and Josef Kit- tler. C2c: Component-to-composition learning for zero-shot compositional action recognition. InEuropean Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18
work page 2024
-
[17]
Resound: To- wards action recognition without representation bias
Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. InEu- ropean Conference on Computer Vision (ECCV), 2018. 2, 3, 4
work page 2018
-
[18]
Context-based and diversity-driven specificity in compositional zero-shot learning
Yun Li, Zhe Liu, Hang Chen, and Lina Yao. Context-based and diversity-driven specificity in compositional zero-shot learning. InIEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 2
work page 2024
-
[19]
Large-margin softmax loss for convolutional neural net- works
Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural net- works. InInternational Conference on Machine Learning (ICML), 2016. 9
work page 2016
-
[20]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 14
work page 2019
-
[21]
Open world compositional zero- shot learning
Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. Open world compositional zero- shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2
work page 2021
-
[22]
Something-else: Com- positional action recognition with spatial-temporal interac- tion networks
Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Com- positional action recognition with spatial-temporal interac- tion networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 1, 2, 7
work page 2020
-
[23]
From red wine to red tomato: Composition with context
Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2
work page 2017
-
[24]
Learning graph embeddings for compositional zero-shot learning
Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2
work page 2021
-
[25]
Learning from failure: De-biasing classifier from biased classifier
Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. 2020. 2, 3, 4
work page 2020
-
[26]
Nayak, Peilin Yu, and Stephen Bach
Nihal V . Nayak, Peilin Yu, and Stephen Bach. Learning to compose soft prompts for compositional zero-shot learn- ing. InInternational Conference on Learning Representa- tions (ICLR), 2023. 2
work page 2023
-
[27]
Task-driven modular networks for zero-shot compositional learning
Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE International Conference on Computer Vision (ICCV), 2019. 2
work page 2019
-
[28]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning (ICML), 2021. 3, 4, 5, 7, 8, 13, 14, 16, 18
work page 2021
-
[29]
Disen- tangling visual embeddings for attributes and objects
Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disen- tangling visual embeddings for attributes and objects. In IEEE Conference on Computer Vision and Pattern Recog- 10 nition (CVPR), 2022. 2
work page 2022
-
[30]
Which shortcut cues will DNNs choose? a study from the parameter-space perspective
Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael Poli, and Sangdoo Yun. Which shortcut cues will DNNs choose? a study from the parameter-space perspective. InIn- ternational Conference on Learning Representations (ICLR),
-
[31]
Only time can tell: Discovering temporal data for temporal modeling
Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. InIEEE Winter Conference on Applications of Computer Vi- sion (WACV), 2021. 7, 17, 18
work page 2021
-
[32]
Don’t judge an object by its context: learning to overcome con- textual bias
Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don’t judge an object by its context: learning to overcome con- textual bias. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3
work page 2020
-
[33]
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video represen- tation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3, 5
work page 2021
-
[34]
Learning conditional attributes for compositional zero-shot learning
Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, Peng Wang, and Chunhua Shen. Learning conditional attributes for compositional zero-shot learning. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023. 2
work page 2023
-
[35]
A-fast-rcnn: Hard positive generation via adversary for ob- ject detection
Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for ob- ject detection. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3
work page 2017
-
[36]
A conditional probability framework for compositional zero-shot learning
Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin, Xiankai Lu, and Wenguan Wang. A conditional probability framework for compositional zero-shot learning. InIEEE International Conference on Computer Vision (ICCV), 2025. 2
work page 2025
-
[37]
Aim: Adapting image models for effi- cient video understanding
Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for effi- cient video understanding. InInternational Conference on Learning Representations (ICLR), 2018. 5, 7, 8, 13, 14
work page 2018
-
[38]
Cutmix: Regu- larization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InIEEE International Conference on Computer Vi- sion (ICCV), 2019. 8, 9, 14
work page 2019
-
[39]
Time is matter: Tem- poral self-supervision for video transformers
Sukmin Yun, Jaehyung Kim, Dongyoon Han, Hwanjun Song, Jung-Woo Ha, and Jinwoo Shin. Time is matter: Tem- poral self-supervision for video transformers. InInterna- tional Conference on Machine Learning (ICML), 2022. 7, 17
work page 2022
-
[40]
Soar: Scene-debiasing open-set action recognition
Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, and Gang Hua. Soar: Scene-debiasing open-set action recognition. InIEEE Inter- national Conference on Computer Vision (ICCV), 2023. 2
work page 2023
-
[41]
mixup: Beyond empirical risk minimiza- tion
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InInternational Conference on Learning Representa- tions (ICLR), 2018. 8, 9, 14
work page 2018
-
[42]
Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, and Zhanyu Ma. Disentangling before composing: Learning invariant disentangled features for compositional zero-shot learning.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 47(2):1132–1147, 2024. 2
work page 2024
-
[43]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Inter- national Journal of Computer Vision (IJCV), 130(9):2337– 2348, 2022. 14 11 Appendix In this appendix, we provide comprehensive implementa- tion/dataset/method details and quantitative/qualitative re- sults to complement the main paper. We orga...
work page 2022
-
[44]
Details on EK100-com dataset (Section A)
-
[45]
Complete implementation details (Section B)
-
[46]
Additional evidence of object-driven shortcuts (Sec- tion C)
-
[47]
Additional results (Section D). A. Details on EK100-com dataset In this section, we provide details about our curated ZS-CAR benchmark, EPIC-KITCHENS-100-composition (EK100-com). We construct EK100-com by repurpos- ing EPIC-KITCHENS-100 (EK100) [7] following the same protocol of constructing Sth-com [16]. In particular, we use the original training (67217...
-
[48]
(36.4% vs. 35.5%). This suggests that penalizing only the most confusing candidates—specifically those that frequently co-occur with the input components—effectively mitigates co-occurrence bias without compromising the model’s overall performance. For instance, given an input like ‘(Pretending to tear, Paper)’, our intention is to penalize plausible but ...
-
[49]
To balance the sample sizes between the Temporal and Static splits, we excluded a few verbs with the fewest sam- ples from the Temporal split. We then utilize both original and temporally shuffled inputs to assess the model’s tempo- ral modeling capability and its reliance on static cues. In Figure 10, we present the results for the Temporal/Static splits...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.