arxiv: 2605.05057 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: no theorem link

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Minh Anh Nguyen , Quang Huy Tran , Bao Ngoc Le , Suiyang Guang , Tuan Kiet Pham , Linh Chi Vo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interaction detectionopen-vocabulary learningstate transitionsscripted interactionspartial label learningvision-language modelsaffordance modeling

0 comments

The pith

Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current open-vocabulary detectors often predict actions from object presence and typical co-occurrences, such as guessing cut cake from a knife and cake without checking hand position or actual contact. ScriptHOI instead models each phrase as a soft scripted state transition split across six slots for body-role, contact, geometry, affordance, motion, and object state. A tokenizer turns visual human-object pairs into state tokens, while a matcher computes coverage and conflict scores to adjust logits and add training constraints. Interval partial-label learning handles missing annotations, and a contrast loss prevents object-only shortcuts. The approach yields gains on infrequent and novel interactions plus fewer false positives driven by affordance mismatches.

Core claim

ScriptHOI represents each interaction phrase as a soft scripted state transition decomposed into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict to calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations through interval partial-label learning and counterfactual script contrast loss.

What carries the argument

The soft scripted state transition, which breaks an interaction phrase into six slots to jointly check whether visual evidence supports the action rather than relying on object affordance alone.

If this is right

Higher accuracy on rare and unseen interaction classes in benchmarks like HICO-DET and V-COCO.
Fewer false positives from cases where object affordance suggests an action the visual states do not support.
Logit calibration that raises or lowers scores according to how completely a script is visually realized.
Training signals that bound probabilities for unannotated candidates instead of treating them as strict negatives.
Reduced reliance on object-only cues through losses that swap individual script slots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The slot structure could transfer to video settings by adding temporal consistency checks across frames.
Real-world deployment might benefit from scripts that also encode scene context to handle cluttered environments.
Extending coverage to multi-person or tool-use sequences would test whether the same calibration logic scales.

Load-bearing premise

The visual state tokenizer can reliably parse human-object pairs into accurate tokens across all six slots, and the resulting coverage and conflict scores give valid calibration without new biases or overlooked cues.

What would settle it

A dataset of human-object pairs with expert-annotated states for each slot where the tokenizer matches labels at chance level and rare-class gains disappear on held-out splits.

Figures

Figures reproduced from arXiv: 2605.05057 by Bao Ngoc Le, Linh Chi Vo, Minh Anh Nguyen, Quang Huy Tran, Suiyang Guang, Tuan Kiet Pham.

**Figure 1.** Figure 1: Overall framework of ScriptHOI. The visual branch parses a detected human-object pair into state tokens, while the language view at source ↗

read the original abstract

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScriptHOI decomposes HOI phrases into six state slots with a visual tokenizer and script coverage/conflict to reduce affordance shortcuts, but the gains likely trace more to the new losses than to the scripts themselves.

read the letter

ScriptHOI represents each interaction as a soft scripted state transition broken into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer turns detected human-object pairs into tokens, a slot-wise matcher computes coverage and conflict to adjust logits, and two new losses handle partial labels and discourage object-only shortcuts. The abstract and results claim clearer gains on rare and unseen classes plus fewer affordance false positives on HICO-DET, V-COCO, and open-vocab splits.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ScriptHOI, a framework for open-vocabulary HOI detection that represents each interaction as a soft scripted state transition decomposed into six slots (body-role, contact, geometry, affordance, motion, object-state). A visual state tokenizer parses detected human-object pairs into state tokens; a slot-wise matcher then computes script coverage and script conflict to calibrate logits, expose missing evidence, and supply training constraints. Interval partial-label learning replaces closed-world negatives with script-derived probability bounds, and a counterfactual script contrast loss discourages object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary splits report gains on rare/unseen classes and fewer affordance-conflict false positives.

Significance. If the structured components prove load-bearing, the work offers a concrete mechanism for injecting state-transition logic into vision-language HOI detectors, addressing the well-known problem of affordance and co-occurrence shortcuts. The interval partial-label learning and script-derived bounds directly target incomplete annotation, a persistent issue in HOI benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the explicit decomposition into slots and the counterfactual contrast loss constitute reproducible design choices that could be tested on other structured-prediction tasks.

major comments (3)

[§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
[§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
[§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.

minor comments (2)

[§3.1] Notation for the six state slots is introduced in the abstract but the precise token vocabulary size and embedding dimension for each slot are not stated until the implementation details; moving this information to §3.1 would improve readability.
[Figure 2] Figure 2 (slot-wise matcher diagram) uses the same color for 'coverage' and 'conflict' arrows; distinct colors or hatching would reduce visual ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.

Authors: We agree that an independent quantitative evaluation of the tokenizer on the six slots would strengthen the claims regarding the reliability of coverage and conflict estimates. The current manuscript does not include held-out state annotations for these slots, as generating them would require substantial new labeling effort outside the paper's scope. The tokenizer is trained end-to-end, and its utility is shown through overall gains on rare/unseen HOI classes plus qualitative reductions in affordance conflicts. We will add qualitative visualizations of tokenizer outputs on example pairs in the revision. revision: partial
Referee: [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.

Authors: The referee correctly notes the missing ablation. We will add an experiment that removes only the script coverage and conflict calibration while retaining the visual state tokenizer and counterfactual contrast loss. This will isolate whether gains arise from state-transition logic versus general regularization. revision: yes
Referee: [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.

Authors: The bounds are derived conservatively from script coverage to avoid hard negatives on unannotated candidates. We do not provide a formal proof of unbiasedness, but the design uses loose intervals to accommodate potential unscripted interactions, and experiments show gains on rare classes without suppressing annotated positives. We will expand §3.3 with this rationale and empirical support. revision: partial

standing simulated objections not resolved

Independent quantitative per-slot evaluation of the visual state tokenizer, due to absence of held-out state annotations in the current experimental setup.

Circularity Check

0 steps flagged

Low circularity: script coverage/conflict computed from visual tokenizer rather than target labels

full rationale

The framework decomposes phrases into slots, uses a visual state tokenizer on detected pairs to produce tokens, then computes coverage and conflict from those tokens to calibrate logits. These quantities are derived from visual inputs and the proposed tokenizer, not defined directly from HOI class labels by construction. Interval partial-label learning and counterfactual contrast loss add constraints without reducing the central claims to fitted inputs or self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The paper is self-contained against external benchmarks with independent visual processing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about interaction decomposability and introduces several new entities without independent evidence provided in the abstract.

axioms (2)

domain assumption Human-object interactions can be decomposed into the six slots of body-role, contact, geometry, affordance, motion, and object-state.
This decomposition is the foundation of the ScriptHOI representation.
domain assumption A visual state tokenizer can parse detected human-object pairs into corresponding state tokens.
Required for the slot-wise matcher to operate.

invented entities (3)

script coverage no independent evidence
purpose: Estimates the degree to which visual evidence supports the interaction script
New quantity used to calibrate HOI logits and expose missing evidence.
script conflict no independent evidence
purpose: Identifies inconsistencies or missing visual support for the script
Used alongside coverage to adjust predictions.
interval partial-label learning no independent evidence
purpose: Provides probability bounds for unannotated interaction candidates during training
New constraint to avoid treating valid but unlabeled interactions as negatives.

pith-pipeline@v0.9.0 · 5594 in / 1522 out tokens · 52545 ms · 2026-05-13T01:40:01.161646+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 2 internal anchors

[1]

Visual semantic role la- beling

Saurabh Gupta and Jitendra Malik. Visual semantic role la- beling. InarXiv preprint arXiv:1505.04474, 2015. 1, 2, 5

work page arXiv 2015
[2]

Learning to detect human-object interactions

Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vi- sion, 2018. 2, 5

work page 2018
[3]

Detecting and recognizing human-object interactions

Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaiming He. Detecting and recognizing human-object interactions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[4]

ican: Instance- centric attention network for human-object interaction detec- tion

Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance- centric attention network for human-object interaction detec- tion. InBritish Machine Vision Conference, 2018. 1, 2, 5

work page 2018
[5]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017. 1, 3

work page 2017
[6]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InIEEE/CVF In- ternational Conference on Computer Vision, 2015

work page 2015
[7]

Image retrieval using scene graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015

work page 2015
[8]

Choy, and Li Fei- Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei- Fei. Scene graph generation by iterative message passing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[9]

T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021. 2

work page 2021
[10]

T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023. 2

work page 2023
[11]

R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 1

work page 2025
[12]

Pairwise body-part attention for recognizing human-object interactions

Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. InEuropean Conference on Computer Vision,

work page
[13]

Learning human-object interactions by graph parsing neural networks

Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. InEuropean Conference on Computer Vision, 2018

work page 2018
[14]

Transferable interactiveness knowledge for human-object interaction detection

Yong-Lu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2019. 5

work page 2019
[15]

Oytun Ulutan, ASM Iftekhar, and B. S. Manjunath. Vsgnet: Spatial attention network for detecting human object interac- tions using graph convolutions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 5

work page 2020
[16]

Visual compositional learning for human- object interaction detection

Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020. 5

work page 2020
[17]

Drg: Dual relation graph for human-object interaction detection

Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. InEuropean Conference on Computer Vision, 2020. 1, 2, 5

work page 2020
[18]

Qpic: Query-based pairwise human-object interaction detec- tion with image-wide contextual information

Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. Qpic: Query-based pairwise human-object interaction detec- tion with image-wide contextual information. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[19]

Bumsoo Kim, Junhyug Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J. Kim. Hotr: End-to-end human-object in- teraction detection with transformers. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 5

work page 2021
[20]

End-to-end human object interaction detection with hoi transformer

Xueyan Zou, Changxing Liu, Jianwei Zhang, and Yongsheng Gao. End-to-end human object interaction detection with hoi transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[21]

Zhang, Dylan Campbell, and Stephen Gould

Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 5

work page 2021
[22]

Mining the benefits of two-stage and one-stage hoi detection

Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. Mining the benefits of two-stage and one-stage hoi detection. InAdvances in Neural Informa- tion Processing Systems, 2021. 5

work page 2021
[23]

Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[24]

Consistency learning via decoding path augmenta- tion for transformers in human object interaction detection

Junyoung Park, Seong Tae Lee, Hyeong Kyu Lee, and Eun- Sol Kim. Consistency learning via decoding path augmenta- tion for transformers in human object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 1, 2

work page 2022
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1, 2

work page 2021
[26]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Con- ference on Machine Learning, 2021

work page 2021
[27]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, 2022. 2

work page 2022
[28]

Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection

Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, and Houqiang Li. Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection. InAd- vances in Neural Information Processing Systems, 2023. 2, 5

work page 2023
[29]

Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023

Yichao Cao, Zhi Tang, Qing Wang, Long Wang, and Cewu Lu. Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023. 5

work page arXiv 2023
[30]

Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024. 1, 2, 5

work page 2024
[31]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Muap: Multi-step adaptive prompt learning for vision- language model with missing modality.arXiv preprint arXiv:2409.04693, 2024. 1, 3

work page arXiv 2024
[32]

Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection

Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 2

work page 2021
[33]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean Confer- ence on Computer Vision, 2020. 2

work page 2020
[34]

Zhang, Dylan Campbell, and Stephen Gould

Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Ef- ficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022. 2

work page 2022
[35]

X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026. 2

work page 2026
[36]

Zhang, S

D. Zhang, S. Liang, T. He, J. Shao, and K. Qin. Cviformer: Cross-view interactive transformer for efficient stereoscopic image super-resolution.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(2), 2024. 2

work page 2024
[37]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Rep- resentations, 2022. 2

work page 2022
[38]

Rae, Koray Kavukcuoglu, and Karen Si- monyan

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Shar- ifzadeh, Mikolaj Bi, Mia ...

work page 2022
[39]

Coca: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. InTransac- tions on Machine Learning Research, 2022

work page 2022
[40]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023

work page 2023
[41]

Open-vocabulary object detection via vision and language knowledge distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. InInternational Conference on Learning Representations, 2022

work page 2022
[42]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[43]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[44]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision, 2023. 2

work page 2023
[46]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In International Journal of Computer Vision, 2022. 2

work page 2022
[47]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi- wei Liu. Conditional prompt learning for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 2

work page 2022
[48]

T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020. 2

work page 2020
[49]

T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021. 2

work page 2021
[50]

T. He, L. Gao, J. Song, and Y .-F. Li. Transferable and differ- entiable discrete network embedding for multi-domains with hierarchical knowledge distillation.Information Sciences,

work page
[51]

Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions

Ishan Misra, Abhinav Gupta, and Martial Hebert. Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 3

work page 2017
[52]

Learning graph embeddings for compositional zero-shot learning

Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[53]

Task-driven modular networks for zero-shot compositional learning

Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE/CVF Interna- tional Conference on Computer Vision, 2019

work page 2019
[54]

Compositional zero-shot learning with context-based and domain-aware prompting

Xiang Li, Yixin Yang, Kun Wei, Cheng Deng, and Yi Yang. Compositional zero-shot learning with context-based and domain-aware prompting. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3

work page 2023
[55]

Neural motifs: Scene graph parsing with global con- text

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018. 3

work page 2018
[56]

Unbiased scene graph generation from bi- ased training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 3

work page 2020
[57]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

work page 2020
[58]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021

work page arXiv 2021
[59]

T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022

work page 2022
[60]

T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

work page 2022
[61]

T. He, T. Wu, D. Zhang, G. Duan, K. Qin, and Y .- F. Li. Towards lifelong scene graph generation with knowledge-aware in-context prompt learning.arXiv preprint arXiv:2401.14626, 2024

work page arXiv 2024
[62]

X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[63]

Cambridge University Press, 2009

Judea Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, 2009. 3

work page 2009
[64]

Invariant Risk Minimization

Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[65]

Visual compositional learning for human- object interaction detection

Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020

work page 2020
[66]

W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3

work page 2025
[67]

Learning classifiers from only positive and unlabeled data

Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. 2008. 3

work page 2008
[68]

du Plessis, and Masashi Sugiyama

Ryuichi Kiryo, Gang Niu, Marthinus C. du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non- negative risk estimator. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[69]

Provably consistent partial- label learning

Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin Geng, Bo An, and Masashi Sugiyama. Provably consistent partial- label learning. InAdvances in Neural Information Process- ing Systems, 2020

work page 2020
[70]

Progressive identification of true labels for partial-label learning

Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. InInternational Conference on Machine Learning, 2020. 3

work page 2020
[71]

R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026. 3

work page 2026
[72]

S. Wei, K. Zhang, L. Chen, T. He, and G. Duan. Unbiased dynamic multimodal fusion.arXiv preprint arXiv:2603.19681, 2026

work page arXiv 2026
[73]

R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025

work page 2025
[74]

Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025

work page 2025
[75]

R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025

work page 2025
[76]

R. Dai, X. Gao, L. Mo, Z. Li, T. He, and Z. Xu. Towards incomplete multimodal learning with prompt-based hierar- chical knowledge distillation. Available at SSRN 5169142. 3

work page
[77]

M. Li, H. Gou, Y . Ma, R. Wang, K. Qin, and T. He. Fixed anchors are not enough: Dynamic retrieval and per- sistent homology for dataset distillation.arXiv preprint arXiv:2602.24144, 2026. 3

work page arXiv 2026
[78]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 5

work page 2014