Recognition: no theorem link
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3
The pith
Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScriptHOI represents each interaction phrase as a soft scripted state transition decomposed into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict to calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations through interval partial-label learning and counterfactual script contrast loss.
What carries the argument
The soft scripted state transition, which breaks an interaction phrase into six slots to jointly check whether visual evidence supports the action rather than relying on object affordance alone.
If this is right
- Higher accuracy on rare and unseen interaction classes in benchmarks like HICO-DET and V-COCO.
- Fewer false positives from cases where object affordance suggests an action the visual states do not support.
- Logit calibration that raises or lowers scores according to how completely a script is visually realized.
- Training signals that bound probabilities for unannotated candidates instead of treating them as strict negatives.
- Reduced reliance on object-only cues through losses that swap individual script slots.
Where Pith is reading between the lines
- The slot structure could transfer to video settings by adding temporal consistency checks across frames.
- Real-world deployment might benefit from scripts that also encode scene context to handle cluttered environments.
- Extending coverage to multi-person or tool-use sequences would test whether the same calibration logic scales.
Load-bearing premise
The visual state tokenizer can reliably parse human-object pairs into accurate tokens across all six slots, and the resulting coverage and conflict scores give valid calibration without new biases or overlooked cues.
What would settle it
A dataset of human-object pairs with expert-annotated states for each slot where the tokenizer matches labels at chance level and rare-class gains disappear on held-out splits.
Figures
read the original abstract
Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ScriptHOI, a framework for open-vocabulary HOI detection that represents each interaction as a soft scripted state transition decomposed into six slots (body-role, contact, geometry, affordance, motion, object-state). A visual state tokenizer parses detected human-object pairs into state tokens; a slot-wise matcher then computes script coverage and script conflict to calibrate logits, expose missing evidence, and supply training constraints. Interval partial-label learning replaces closed-world negatives with script-derived probability bounds, and a counterfactual script contrast loss discourages object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary splits report gains on rare/unseen classes and fewer affordance-conflict false positives.
Significance. If the structured components prove load-bearing, the work offers a concrete mechanism for injecting state-transition logic into vision-language HOI detectors, addressing the well-known problem of affordance and co-occurrence shortcuts. The interval partial-label learning and script-derived bounds directly target incomplete annotation, a persistent issue in HOI benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the explicit decomposition into slots and the counterfactual contrast loss constitute reproducible design choices that could be tested on other structured-prediction tasks.
major comments (3)
- [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
- [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
- [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.
minor comments (2)
- [§3.1] Notation for the six state slots is introduced in the abstract but the precise token vocabulary size and embedding dimension for each slot are not stated until the implementation details; moving this information to §3.1 would improve readability.
- [Figure 2] Figure 2 (slot-wise matcher diagram) uses the same color for 'coverage' and 'conflict' arrows; distinct colors or hatching would reduce visual ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
Authors: We agree that an independent quantitative evaluation of the tokenizer on the six slots would strengthen the claims regarding the reliability of coverage and conflict estimates. The current manuscript does not include held-out state annotations for these slots, as generating them would require substantial new labeling effort outside the paper's scope. The tokenizer is trained end-to-end, and its utility is shown through overall gains on rare/unseen HOI classes plus qualitative reductions in affordance conflicts. We will add qualitative visualizations of tokenizer outputs on example pairs in the revision. revision: partial
-
Referee: [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
Authors: The referee correctly notes the missing ablation. We will add an experiment that removes only the script coverage and conflict calibration while retaining the visual state tokenizer and counterfactual contrast loss. This will isolate whether gains arise from state-transition logic versus general regularization. revision: yes
-
Referee: [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.
Authors: The bounds are derived conservatively from script coverage to avoid hard negatives on unannotated candidates. We do not provide a formal proof of unbiasedness, but the design uses loose intervals to accommodate potential unscripted interactions, and experiments show gains on rare classes without suppressing annotated positives. We will expand §3.3 with this rationale and empirical support. revision: partial
- Independent quantitative per-slot evaluation of the visual state tokenizer, due to absence of held-out state annotations in the current experimental setup.
Circularity Check
Low circularity: script coverage/conflict computed from visual tokenizer rather than target labels
full rationale
The framework decomposes phrases into slots, uses a visual state tokenizer on detected pairs to produce tokens, then computes coverage and conflict from those tokens to calibrate logits. These quantities are derived from visual inputs and the proposed tokenizer, not defined directly from HOI class labels by construction. Interval partial-label learning and counterfactual contrast loss add constraints without reducing the central claims to fitted inputs or self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The paper is self-contained against external benchmarks with independent visual processing steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human-object interactions can be decomposed into the six slots of body-role, contact, geometry, affordance, motion, and object-state.
- domain assumption A visual state tokenizer can parse detected human-object pairs into corresponding state tokens.
invented entities (3)
-
script coverage
no independent evidence
-
script conflict
no independent evidence
-
interval partial-label learning
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Visual semantic role la- beling
Saurabh Gupta and Jitendra Malik. Visual semantic role la- beling. InarXiv preprint arXiv:1505.04474, 2015. 1, 2, 5
-
[2]
Learning to detect human-object interactions
Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vi- sion, 2018. 2, 5
work page 2018
-
[3]
Detecting and recognizing human-object interactions
Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaiming He. Detecting and recognizing human-object interactions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018
work page 2018
-
[4]
ican: Instance- centric attention network for human-object interaction detec- tion
Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance- centric attention network for human-object interaction detec- tion. InBritish Machine Vision Conference, 2018. 1, 2, 5
work page 2018
-
[5]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017. 1, 3
work page 2017
-
[6]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InIEEE/CVF In- ternational Conference on Computer Vision, 2015
work page 2015
-
[7]
Image retrieval using scene graphs
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015
work page 2015
-
[8]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei- Fei. Scene graph generation by iterative message passing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[9]
T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021. 2
work page 2021
-
[10]
T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023. 2
work page 2023
-
[11]
R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 1
work page 2025
-
[12]
Pairwise body-part attention for recognizing human-object interactions
Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. InEuropean Conference on Computer Vision,
-
[13]
Learning human-object interactions by graph parsing neural networks
Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. InEuropean Conference on Computer Vision, 2018
work page 2018
-
[14]
Transferable interactiveness knowledge for human-object interaction detection
Yong-Lu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2019. 5
work page 2019
-
[15]
Oytun Ulutan, ASM Iftekhar, and B. S. Manjunath. Vsgnet: Spatial attention network for detecting human object interac- tions using graph convolutions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 5
work page 2020
-
[16]
Visual compositional learning for human- object interaction detection
Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020. 5
work page 2020
-
[17]
Drg: Dual relation graph for human-object interaction detection
Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. InEuropean Conference on Computer Vision, 2020. 1, 2, 5
work page 2020
-
[18]
Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. Qpic: Query-based pairwise human-object interaction detec- tion with image-wide contextual information. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,
-
[19]
Bumsoo Kim, Junhyug Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J. Kim. Hotr: End-to-end human-object in- teraction detection with transformers. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 5
work page 2021
-
[20]
End-to-end human object interaction detection with hoi transformer
Xueyan Zou, Changxing Liu, Jianwei Zhang, and Yongsheng Gao. End-to-end human object interaction detection with hoi transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[21]
Zhang, Dylan Campbell, and Stephen Gould
Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 5
work page 2021
-
[22]
Mining the benefits of two-stage and one-stage hoi detection
Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. Mining the benefits of two-stage and one-stage hoi detection. InAdvances in Neural Informa- tion Processing Systems, 2021. 5
work page 2021
-
[23]
Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection
Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,
-
[24]
Junyoung Park, Seong Tae Lee, Hyeong Kyu Lee, and Eun- Sol Kim. Consistency learning via decoding path augmenta- tion for transformers in human object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 1, 2
work page 2022
-
[25]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1, 2
work page 2021
-
[26]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Con- ference on Machine Learning, 2021
work page 2021
-
[27]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, 2022. 2
work page 2022
-
[28]
Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection
Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, and Houqiang Li. Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection. InAd- vances in Neural Information Processing Systems, 2023. 2, 5
work page 2023
-
[29]
Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023
Yichao Cao, Zhi Tang, Qing Wang, Long Wang, and Cewu Lu. Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023. 5
-
[30]
Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024. 1, 2, 5
work page 2024
- [31]
-
[32]
Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection
Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 2
work page 2021
-
[33]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean Confer- ence on Computer Vision, 2020. 2
work page 2020
-
[34]
Zhang, Dylan Campbell, and Stephen Gould
Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Ef- ficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022. 2
work page 2022
-
[35]
X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026. 2
work page 2026
- [36]
-
[37]
Filip: Fine-grained interactive language-image pre-training
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Rep- resentations, 2022. 2
work page 2022
-
[38]
Rae, Koray Kavukcuoglu, and Karen Si- monyan
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Shar- ifzadeh, Mikolaj Bi, Mia ...
work page 2022
-
[39]
Coca: Contrastive captioners are image-text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. InTransac- tions on Machine Learning Research, 2022
work page 2022
-
[40]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023
work page 2023
-
[41]
Open-vocabulary object detection via vision and language knowledge distillation
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. InInternational Conference on Learning Representations, 2022
work page 2022
-
[42]
Regionclip: Region-based language-image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[43]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[44]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection.arXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision, 2023. 2
work page 2023
-
[46]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In International Journal of Computer Vision, 2022. 2
work page 2022
-
[47]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi- wei Liu. Conditional prompt learning for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 2
work page 2022
-
[48]
T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020. 2
work page 2020
-
[49]
T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021. 2
work page 2021
-
[50]
T. He, L. Gao, J. Song, and Y .-F. Li. Transferable and differ- entiable discrete network embedding for multi-domains with hierarchical knowledge distillation.Information Sciences,
-
[51]
Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions
Ishan Misra, Abhinav Gupta, and Martial Hebert. Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 3
work page 2017
-
[52]
Learning graph embeddings for compositional zero-shot learning
Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021
work page 2021
-
[53]
Task-driven modular networks for zero-shot compositional learning
Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE/CVF Interna- tional Conference on Computer Vision, 2019
work page 2019
-
[54]
Compositional zero-shot learning with context-based and domain-aware prompting
Xiang Li, Yixin Yang, Kun Wei, Cheng Deng, and Yi Yang. Compositional zero-shot learning with context-based and domain-aware prompting. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3
work page 2023
-
[55]
Neural motifs: Scene graph parsing with global con- text
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018. 3
work page 2018
-
[56]
Unbiased scene graph generation from bi- ased training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 3
work page 2020
-
[57]
T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020
work page 2020
- [58]
-
[59]
T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022
work page 2022
-
[60]
T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022
work page 2022
- [61]
-
[62]
X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[63]
Cambridge University Press, 2009
Judea Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, 2009. 3
work page 2009
-
[64]
Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[65]
Visual compositional learning for human- object interaction detection
Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020
work page 2020
-
[66]
W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3
work page 2025
-
[67]
Learning classifiers from only positive and unlabeled data
Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. 2008. 3
work page 2008
-
[68]
du Plessis, and Masashi Sugiyama
Ryuichi Kiryo, Gang Niu, Marthinus C. du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non- negative risk estimator. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[69]
Provably consistent partial- label learning
Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin Geng, Bo An, and Masashi Sugiyama. Provably consistent partial- label learning. InAdvances in Neural Information Process- ing Systems, 2020
work page 2020
-
[70]
Progressive identification of true labels for partial-label learning
Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. InInternational Conference on Machine Learning, 2020. 3
work page 2020
-
[71]
R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026. 3
work page 2026
- [72]
-
[73]
R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025
work page 2025
-
[74]
Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025
work page 2025
-
[75]
R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025
work page 2025
-
[76]
R. Dai, X. Gao, L. Mo, Z. Li, T. He, and Z. Xu. Towards incomplete multimodal learning with prompt-based hierar- chical knowledge distillation. Available at SSRN 5169142. 3
- [77]
-
[78]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 5
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.