pith. machine review for the scientific record. sign in

arxiv: 2605.05057 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: no theorem link

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords human-object interaction detectionopen-vocabulary learningstate transitionsscripted interactionspartial label learningvision-language modelsaffordance modeling
0
0 comments X

The pith

Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current open-vocabulary detectors often predict actions from object presence and typical co-occurrences, such as guessing cut cake from a knife and cake without checking hand position or actual contact. ScriptHOI instead models each phrase as a soft scripted state transition split across six slots for body-role, contact, geometry, affordance, motion, and object state. A tokenizer turns visual human-object pairs into state tokens, while a matcher computes coverage and conflict scores to adjust logits and add training constraints. Interval partial-label learning handles missing annotations, and a contrast loss prevents object-only shortcuts. The approach yields gains on infrequent and novel interactions plus fewer false positives driven by affordance mismatches.

Core claim

ScriptHOI represents each interaction phrase as a soft scripted state transition decomposed into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict to calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations through interval partial-label learning and counterfactual script contrast loss.

What carries the argument

The soft scripted state transition, which breaks an interaction phrase into six slots to jointly check whether visual evidence supports the action rather than relying on object affordance alone.

If this is right

  • Higher accuracy on rare and unseen interaction classes in benchmarks like HICO-DET and V-COCO.
  • Fewer false positives from cases where object affordance suggests an action the visual states do not support.
  • Logit calibration that raises or lowers scores according to how completely a script is visually realized.
  • Training signals that bound probabilities for unannotated candidates instead of treating them as strict negatives.
  • Reduced reliance on object-only cues through losses that swap individual script slots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The slot structure could transfer to video settings by adding temporal consistency checks across frames.
  • Real-world deployment might benefit from scripts that also encode scene context to handle cluttered environments.
  • Extending coverage to multi-person or tool-use sequences would test whether the same calibration logic scales.

Load-bearing premise

The visual state tokenizer can reliably parse human-object pairs into accurate tokens across all six slots, and the resulting coverage and conflict scores give valid calibration without new biases or overlooked cues.

What would settle it

A dataset of human-object pairs with expert-annotated states for each slot where the tokenizer matches labels at chance level and rare-class gains disappear on held-out splits.

Figures

Figures reproduced from arXiv: 2605.05057 by Bao Ngoc Le, Linh Chi Vo, Minh Anh Nguyen, Quang Huy Tran, Suiyang Guang, Tuan Kiet Pham.

Figure 1
Figure 1. Figure 1: Overall framework of ScriptHOI. The visual branch parses a detected human-object pair into state tokens, while the language view at source ↗
read the original abstract

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ScriptHOI, a framework for open-vocabulary HOI detection that represents each interaction as a soft scripted state transition decomposed into six slots (body-role, contact, geometry, affordance, motion, object-state). A visual state tokenizer parses detected human-object pairs into state tokens; a slot-wise matcher then computes script coverage and script conflict to calibrate logits, expose missing evidence, and supply training constraints. Interval partial-label learning replaces closed-world negatives with script-derived probability bounds, and a counterfactual script contrast loss discourages object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary splits report gains on rare/unseen classes and fewer affordance-conflict false positives.

Significance. If the structured components prove load-bearing, the work offers a concrete mechanism for injecting state-transition logic into vision-language HOI detectors, addressing the well-known problem of affordance and co-occurrence shortcuts. The interval partial-label learning and script-derived bounds directly target incomplete annotation, a persistent issue in HOI benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the explicit decomposition into slots and the counterfactual contrast loss constitute reproducible design choices that could be tested on other structured-prediction tasks.

major comments (3)
  1. [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
  2. [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
  3. [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.
minor comments (2)
  1. [§3.1] Notation for the six state slots is introduced in the abstract but the precise token vocabulary size and embedding dimension for each slot are not stated until the implementation details; moving this information to §3.1 would improve readability.
  2. [Figure 2] Figure 2 (slot-wise matcher diagram) uses the same color for 'coverage' and 'conflict' arrows; distinct colors or hatching would reduce visual ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.

    Authors: We agree that an independent quantitative evaluation of the tokenizer on the six slots would strengthen the claims regarding the reliability of coverage and conflict estimates. The current manuscript does not include held-out state annotations for these slots, as generating them would require substantial new labeling effort outside the paper's scope. The tokenizer is trained end-to-end, and its utility is shown through overall gains on rare/unseen HOI classes plus qualitative reductions in affordance conflicts. We will add qualitative visualizations of tokenizer outputs on example pairs in the revision. revision: partial

  2. Referee: [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.

    Authors: The referee correctly notes the missing ablation. We will add an experiment that removes only the script coverage and conflict calibration while retaining the visual state tokenizer and counterfactual contrast loss. This will isolate whether gains arise from state-transition logic versus general regularization. revision: yes

  3. Referee: [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.

    Authors: The bounds are derived conservatively from script coverage to avoid hard negatives on unannotated candidates. We do not provide a formal proof of unbiasedness, but the design uses loose intervals to accommodate potential unscripted interactions, and experiments show gains on rare classes without suppressing annotated positives. We will expand §3.3 with this rationale and empirical support. revision: partial

standing simulated objections not resolved
  • Independent quantitative per-slot evaluation of the visual state tokenizer, due to absence of held-out state annotations in the current experimental setup.

Circularity Check

0 steps flagged

Low circularity: script coverage/conflict computed from visual tokenizer rather than target labels

full rationale

The framework decomposes phrases into slots, uses a visual state tokenizer on detected pairs to produce tokens, then computes coverage and conflict from those tokens to calibrate logits. These quantities are derived from visual inputs and the proposed tokenizer, not defined directly from HOI class labels by construction. Interval partial-label learning and counterfactual contrast loss add constraints without reducing the central claims to fitted inputs or self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The paper is self-contained against external benchmarks with independent visual processing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about interaction decomposability and introduces several new entities without independent evidence provided in the abstract.

axioms (2)
  • domain assumption Human-object interactions can be decomposed into the six slots of body-role, contact, geometry, affordance, motion, and object-state.
    This decomposition is the foundation of the ScriptHOI representation.
  • domain assumption A visual state tokenizer can parse detected human-object pairs into corresponding state tokens.
    Required for the slot-wise matcher to operate.
invented entities (3)
  • script coverage no independent evidence
    purpose: Estimates the degree to which visual evidence supports the interaction script
    New quantity used to calibrate HOI logits and expose missing evidence.
  • script conflict no independent evidence
    purpose: Identifies inconsistencies or missing visual support for the script
    Used alongside coverage to adjust predictions.
  • interval partial-label learning no independent evidence
    purpose: Provides probability bounds for unannotated interaction candidates during training
    New constraint to avoid treating valid but unlabeled interactions as negatives.

pith-pipeline@v0.9.0 · 5594 in / 1522 out tokens · 52545 ms · 2026-05-13T01:40:01.161646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 2 internal anchors

  1. [1]

    Visual semantic role la- beling

    Saurabh Gupta and Jitendra Malik. Visual semantic role la- beling. InarXiv preprint arXiv:1505.04474, 2015. 1, 2, 5

  2. [2]

    Learning to detect human-object interactions

    Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vi- sion, 2018. 2, 5

  3. [3]

    Detecting and recognizing human-object interactions

    Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaiming He. Detecting and recognizing human-object interactions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

  4. [4]

    ican: Instance- centric attention network for human-object interaction detec- tion

    Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance- centric attention network for human-object interaction detec- tion. InBritish Machine Vision Conference, 2018. 1, 2, 5

  5. [5]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017. 1, 3

  6. [6]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InIEEE/CVF In- ternational Conference on Computer Vision, 2015

  7. [7]

    Image retrieval using scene graphs

    Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015

  8. [8]

    Choy, and Li Fei- Fei

    Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei- Fei. Scene graph generation by iterative message passing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

  9. [9]

    T. He, L. Gao, J. Song, and Y .-F. Li. Exploiting scene graphs for human-object interaction detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15984–15993, 2021. 2

  10. [10]

    T. He, L. Gao, J. Song, and Y .-F. Li. Toward a unified transformer-based framework for scene graph generation and human-object interaction detection.IEEE Transactions on Image Processing, 32:6274–6288, 2023. 2

  11. [11]

    R. Y . Zakari, J. W. Owusu, K. Qin, H. Wang, Z. K. Lawal, and T. He. Vqa and visual reasoning: An overview of ap- proaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 1

  12. [12]

    Pairwise body-part attention for recognizing human-object interactions

    Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. InEuropean Conference on Computer Vision,

  13. [13]

    Learning human-object interactions by graph parsing neural networks

    Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. InEuropean Conference on Computer Vision, 2018

  14. [14]

    Transferable interactiveness knowledge for human-object interaction detection

    Yong-Lu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2019. 5

  15. [15]

    Oytun Ulutan, ASM Iftekhar, and B. S. Manjunath. Vsgnet: Spatial attention network for detecting human object interac- tions using graph convolutions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 5

  16. [16]

    Visual compositional learning for human- object interaction detection

    Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020. 5

  17. [17]

    Drg: Dual relation graph for human-object interaction detection

    Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. InEuropean Conference on Computer Vision, 2020. 1, 2, 5

  18. [18]

    Qpic: Query-based pairwise human-object interaction detec- tion with image-wide contextual information

    Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. Qpic: Query-based pairwise human-object interaction detec- tion with image-wide contextual information. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

  19. [19]

    Bumsoo Kim, Junhyug Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J. Kim. Hotr: End-to-end human-object in- teraction detection with transformers. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 5

  20. [20]

    End-to-end human object interaction detection with hoi transformer

    Xueyan Zou, Changxing Liu, Jianwei Zhang, and Yongsheng Gao. End-to-end human object interaction detection with hoi transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  21. [21]

    Zhang, Dylan Campbell, and Stephen Gould

    Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 5

  22. [22]

    Mining the benefits of two-stage and one-stage hoi detection

    Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. Mining the benefits of two-stage and one-stage hoi detection. InAdvances in Neural Informa- tion Processing Systems, 2021. 5

  23. [23]

    Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection

    Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

  24. [24]

    Consistency learning via decoding path augmenta- tion for transformers in human object interaction detection

    Junyoung Park, Seong Tae Lee, Hyeong Kyu Lee, and Eun- Sol Kim. Consistency learning via decoding path augmenta- tion for transformers in human object interaction detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 1, 2

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1, 2

  26. [26]

    Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Con- ference on Machine Learning, 2021

  27. [27]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, 2022. 2

  28. [28]

    Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection

    Yunyao Mao, Jiajun Deng, Wengang Zhou, Li Li, Yao Fang, and Houqiang Li. Clip4hoi: Towards adapting clip for prac- tical zero-shot human-object interaction detection. InAd- vances in Neural Information Processing Systems, 2023. 2, 5

  29. [29]

    Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023

    Yichao Cao, Zhi Tang, Qing Wang, Long Wang, and Cewu Lu. Universal human-object interaction detection.arXiv preprint arXiv:2304.09432, 2023. 5

  30. [30]

    Z. Yang, X. Liu, D. Ouyang, G. Duan, D. Zhang, T. He, and Y .-F. Li. Towards open-vocabulary hoi detection with cal- ibrated vision-language models and locality-aware queries. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1495–1504, 2024. 1, 2, 5

  31. [31]

    R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Muap: Multi-step adaptive prompt learning for vision- language model with missing modality.arXiv preprint arXiv:2409.04693, 2024. 1, 3

  32. [32]

    Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection

    Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021. 2

  33. [33]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean Confer- ence on Computer Vision, 2020. 2

  34. [34]

    Zhang, Dylan Campbell, and Stephen Gould

    Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Ef- ficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022. 2

  35. [35]

    X. Hu, K. Qin, T. He, and G. Luo. Exploring hierarchical tuple-based contextual correlations for human-object inter- action detection.Tsinghua Science and Technology, 2026. 2

  36. [36]

    Zhang, S

    D. Zhang, S. Liang, T. He, J. Shao, and K. Qin. Cviformer: Cross-view interactive transformer for efficient stereoscopic image super-resolution.IEEE Transactions on Emerging Topics in Computational Intelligence, 9(2), 2024. 2

  37. [37]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Rep- resentations, 2022. 2

  38. [38]

    Rae, Koray Kavukcuoglu, and Karen Si- monyan

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Shar- ifzadeh, Mikolaj Bi, Mia ...

  39. [39]

    Coca: Contrastive captioners are image-text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. InTransac- tions on Machine Learning Research, 2022

  40. [40]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023

  41. [41]

    Open-vocabulary object detection via vision and language knowledge distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. InInternational Conference on Learning Representations, 2022

  42. [42]

    Regionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  43. [43]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  44. [44]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection.arXiv preprint arXiv:2303.05499, 2023

  45. [45]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision, 2023. 2

  46. [46]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In International Journal of Computer Vision, 2022. 2

  47. [47]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi- wei Liu. Conditional prompt learning for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 2

  48. [48]

    T. He, L. Gao, J. Song, X. Wang, K. Huang, and Y . Li. Sneq: Semi-supervised attributed network embedding with attention-based quantisation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4091–4098, 2020. 2

  49. [49]

    T. He, L. Gao, J. Song, and Y .-F. Li. Semisupervised net- work embedding with differentiable deep quantization.IEEE Transactions on Neural Networks and Learning Systems, 34 (8):4791–4802, 2021. 2

  50. [50]

    T. He, L. Gao, J. Song, and Y .-F. Li. Transferable and differ- entiable discrete network embedding for multi-domains with hierarchical knowledge distillation.Information Sciences,

  51. [51]

    Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions

    Ishan Misra, Abhinav Gupta, and Martial Hebert. Red wine and blue cheese: Building a recognizable image dataset from object-attribute compositions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 3

  52. [52]

    Learning graph embeddings for compositional zero-shot learning

    Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. Learning graph embeddings for compositional zero-shot learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  53. [53]

    Task-driven modular networks for zero-shot compositional learning

    Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. Task-driven modular networks for zero-shot compositional learning. InIEEE/CVF Interna- tional Conference on Computer Vision, 2019

  54. [54]

    Compositional zero-shot learning with context-based and domain-aware prompting

    Xiang Li, Yixin Yang, Kun Wei, Cheng Deng, and Yi Yang. Compositional zero-shot learning with context-based and domain-aware prompting. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3

  55. [55]

    Neural motifs: Scene graph parsing with global con- text

    Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018. 3

  56. [56]

    Unbiased scene graph generation from bi- ased training

    Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 3

  57. [57]

    T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

  58. [58]

    T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021

  59. [59]

    T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56, 2022

  60. [60]

    T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

  61. [61]

    T. He, T. Wu, D. Zhang, G. Duan, K. Qin, and Y .- F. Li. Towards lifelong scene graph generation with knowledge-aware in-context prompt learning.arXiv preprint arXiv:2401.14626, 2024

  62. [62]

    X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

  63. [63]

    Cambridge University Press, 2009

    Judea Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, 2009. 3

  64. [64]

    Invariant Risk Minimization

    Martin Arjovsky, L ´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019

  65. [65]

    Visual compositional learning for human- object interaction detection

    Zhi Hou, Baosheng Yu, Yu Qiao, Dacheng Peng, and Dacheng Tao. Visual compositional learning for human- object interaction detection. InEuropean Conference on Computer Vision, 2020

  66. [66]

    W. Yin, Y . Wang, G. Duan, D. Zhang, X. Hu, Y .-F. Li, and T. He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsupervised cross-domain visual emotion recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3888–3898, 2025. 3

  67. [67]

    Learning classifiers from only positive and unlabeled data

    Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. 2008. 3

  68. [68]

    du Plessis, and Masashi Sugiyama

    Ryuichi Kiryo, Gang Niu, Marthinus C. du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non- negative risk estimator. InAdvances in Neural Information Processing Systems, 2017

  69. [69]

    Provably consistent partial- label learning

    Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin Geng, Bo An, and Masashi Sugiyama. Provably consistent partial- label learning. InAdvances in Neural Information Process- ing Systems, 2020

  70. [70]

    Progressive identification of true labels for partial-label learning

    Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. InInternational Conference on Machine Learning, 2020. 3

  71. [71]

    R. Dai, Z. Cai, L. Mo, G. Duan, K. Shi, and T. He. Anchor drift no more: Hierarchical consistency-guided prompt dis- tillation for incomplete multimodal learning. InProceedings of the ACM Web Conference, pages 7330–7341, 2026. 3

  72. [72]

    S. Wei, K. Zhang, L. Chen, T. He, and G. Duan. Unbiased dynamic multimodal fusion.arXiv preprint arXiv:2603.19681, 2026

  73. [73]

    R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025

  74. [74]

    Q. Dong, R. Dai, G. Duan, K. Qin, Y . Zhang, and T. He. Un- biased multimodal intent recognition with auxiliary rationale generation.Neurocomputing, page 131197, 2025

  75. [75]

    R. Dai, Y . Tan, L. Mo, T. He, K. Qin, and S. Liang. Robustpt: Dynamic disentanglement prompt tuning in vision-language models with missing modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval, 2025

  76. [76]

    R. Dai, X. Gao, L. Mo, Z. Li, T. He, and Z. Xu. Towards incomplete multimodal learning with prompt-based hierar- chical knowledge distillation. Available at SSRN 5169142. 3

  77. [77]

    M. Li, H. Gou, Y . Ma, R. Wang, K. Qin, and T. He. Fixed anchors are not enough: Dynamic retrieval and per- sistent homology for dataset distillation.arXiv preprint arXiv:2602.24144, 2026. 3

  78. [78]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 5