ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

Amir Hosseini; Sara Farahani; Suiyang Guang; Xinyi Li

arxiv: 2604.22546 · v5 · submitted 2026-04-24 · 💻 cs.CV

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

Amir Hosseini , Sara Farahani , Xinyi Li , Suiyang Guang This is my paper

Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene graph generationopen-vocabulary learningrelation latticeincomplete annotationspositive-unlabeled learningsemantic consistencyvisual language models

0 comments

The pith

ReLIC-SGG infers missing relations in open-vocabulary scene graphs using a semantic predicate lattice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses incomplete annotations in scene graph generation for open vocabularies, where many true relations go unlabeled and predicates can describe the same interaction at different levels. It introduces a framework that models unannotated object-pair relations as latent variables instead of assuming they are negatives. A semantic relation lattice is constructed to capture how predicates relate through similarity, entailment, and contradiction. This lattice helps infer which missing relations are likely true based on visual-language matches, surrounding graph context, and consistency rules, leading to improved handling of rare and unseen predicates.

Core claim

ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs.

What carries the argument

The semantic relation lattice that encodes relationships like similarity, entailment, and contradiction between predicates to infer and complete missing positive relations.

If this is right

Improves recognition of rare and unseen predicates on benchmarks.
Recovers more missing relations in conventional, open-vocabulary, and panoptic SGG tasks.
Reduces the impact of false-negative labels during training via positive-unlabeled learning.
Generates scene graphs that are more compact and semantically consistent through guided decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may generalize to other vision tasks where labels are incomplete, such as action recognition or visual question answering.
Semantic lattices could be combined with large language models to dynamically expand relation vocabularies.
Testing on datasets with varying annotation densities would reveal how much the lattice compensates for human annotation gaps.

Load-bearing premise

That the semantic relation lattice plus visual-language compatibility and graph context can reliably distinguish true missing positives from true negatives without introducing more errors than it removes.

What would settle it

A manual inspection or human evaluation of inferred relations showing that more than half of the newly added relations are incorrect or that overall graph accuracy decreases.

Figures

Figures reproduced from arXiv: 2604.22546 by Amir Hosseini, Sara Farahani, Suiyang Guang, Xinyi Li.

**Figure 1.** Figure 1: Overall framework of ReLIC-SGG. Given detected objects, the model first proposes open-vocabulary relation candidates for each view at source ↗

read the original abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReLIC-SGG tries to complete missing relations in open-vocab SGG with a semantic lattice and positive-unlabeled learning, but the abstract gives no direct check that the inferences add more true positives than errors.

read the letter

ReLIC-SGG starts from the real problem that scene graph annotations leave many valid relations unlabeled, and this gets worse with open vocabularies where the same interaction can be described in multiple ways. The paper builds a semantic relation lattice to capture entailment, similarity, and contradiction among predicates, then treats unannotated pairs as latent variables. It combines visual-language compatibility, graph context, and lattice consistency to infer positives, adds a positive-unlabeled objective, and uses lattice-guided decoding for consistent output graphs.

Referee Report

2 major / 2 minor

Summary. ReLIC-SGG proposes a relation-incompleteness-aware framework for open-vocabulary scene graph generation. It treats unannotated object-pair relations as latent variables rather than negatives, constructs a semantic relation lattice to model predicate similarity/entailment/contradiction, and infers missing positives from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled learning objective is introduced, and lattice-guided decoding is used to produce consistent graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks are reported to show gains on rare and unseen predicates plus better recovery of missing relations.

Significance. If the lattice-based inference is shown to add more true positives than false positives, the work would meaningfully address a core limitation of SGG datasets (incomplete annotations) and improve open-vocabulary predicate recognition. The combination of semantic lattice modeling with positive-unlabeled learning is a targeted response to granularity and missing-relation issues that prior methods largely ignore.

major comments (2)

[§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.
[§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.

minor comments (2)

[§3.1] Notation for the lattice construction (e.g., how entailment scores are computed for arbitrary open-vocabulary phrases) should be formalized with explicit equations rather than descriptive text.
[§4] Figure captions and table headers would benefit from explicit statements of which metrics are computed only on annotated positives versus on the full inferred graph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications on our design choices and outlining the revisions we plan to incorporate.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.

Authors: We agree that a direct quantitative validation of the lattice inference precision on a held-out verified subset would provide stronger support for the claim of net error reduction. While the reported gains on rare and unseen predicates across multiple benchmarks provide indirect evidence that the combination of lattice similarity, visual-language compatibility, and graph context effectively recovers true positives, we acknowledge the value of explicit verification. In the revised manuscript, we will add an analysis on a manually verified subset of inferred relations, reporting precision to demonstrate that the lattice inference yields more true positives than false positives. revision: yes
Referee: [§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.

Authors: We thank the referee for highlighting the need for clearer component isolation. The current results show overall improvements from the full framework, but we agree that targeted ablations would better attribute the gains to the lattice inference. In the revision, we will include additional ablation experiments that disable the lattice-based inference (while keeping the positive-unlabeled objective and lattice-guided decoding active) to isolate and quantify its specific contribution to recovering missing relations on rare and unseen predicates. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces ReLIC-SGG by constructing a semantic relation lattice to model entailment and contradiction among open-vocabulary predicates, then infers latent positives via visual-language compatibility, graph context, and a positive-unlabeled learning objective. No equations, derivations, or self-citations are shown that reduce the claimed improvements in rare/unseen predicate recognition or missing-relation recovery to quantities defined by the method's own fitted inputs or prior self-referential results. The framework relies on independently motivated components evaluated against external benchmarks, rendering the central claims self-contained rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on the assumption that a hand-constructed or learned semantic lattice accurately encodes entailment and contradiction among arbitrary predicates; no independent verification of lattice quality is described.

axioms (1)

domain assumption Unannotated object-pair relations can be treated as latent variables whose positive status is inferable from visual-language compatibility and lattice semantics.
Core modeling choice stated in abstract; if false, the positive-unlabeled objective collapses.

invented entities (1)

semantic relation lattice no independent evidence
purpose: Model similarity, entailment, and contradiction among open-vocabulary predicates to infer missing positives.
New structure introduced to guide inference; independent evidence of its accuracy is not provided in abstract.

pith-pipeline@v0.9.0 · 5527 in / 1139 out tokens · 31234 ms · 2026-05-14T21:27:15.145639+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Shamma, Michael S

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3668–3678, 2015. 1, 2

work page 2015
[2]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. 1, 7

work page 2017
[3]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, pages 852–869, 2016. 2

work page 2016
[4]

Neural motifs: Scene graph parsing with global con- text

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. 1, 7, 8

work page 2018
[5]

Graph r-cnn for scene graph generation

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, pages 670–685, 2018. 1, 2

work page 2018
[6]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. 1, 2

work page 2017
[7]

Scene graph generation from objects, phrases and region captions

Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiaoou Tang. Scene graph generation from objects, phrases and region captions. InProceedings of the IEEE International Conference on Computer Vision, pages 1261–1270, 2017

work page 2017
[8]

Knowledge-embedded routing network for scene graph gen- eration

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163– 6171, 2019. 1, 2

work page 2019
[9]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the International Conference on Machine Learning, pages 4904–4916, 2021. 1, 2

work page 2021
[10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021. 2, 7, 8

work page 2021
[11]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900, 2022. 2

work page 2022
[12]

To- wards open-vocabulary scene graph generation with prompt- based finetuning

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InProceedings of the European Confer- ence on Computer Vision, 2022. 1, 7, 8

work page 2022
[13]

T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

work page 2022
[14]

Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention

Zuyao Chen, Jinlin Wu, Zhen Lei, and Chang Wen Chen. Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention. InProceedings of the European Conference on Com- puter Vision, 2024. 1, 8

work page 2024
[15]

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

Rongjie Li, Songyang Zhang, Bo Wan, Dahua Lin, and Xum- ing He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28076–28086, 2024. 2, 7, 8

work page 2024
[16]

Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

Chufeng Min, Yifan Liu, and Hanwang Zhang. Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

work page
[17]

Please verify final arXiv identifier/proceedings infor- mation. 7, 8

work page
[18]

Openpsg: Open-set panoptic scene graph generation via large multimodal models

Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, pages 212–229, 2024. 1, 2, 3, 7, 8

work page 2024
[19]

X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[20]

Learning to compose dynamic tree structures for visual contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6619–6628, 2019. 1, 2, 8

work page 2019
[21]

Unbiased scene graph generation from bi- ased training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716– 3725, 2020. 1, 2, 7, 8

work page 2020
[22]

Bipartite graph network with adaptive message passing for unbiased scene graph generation

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021. 8

work page 2021
[23]

Prototype-based embedding network for scene graph generation

Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Prototype-based embedding network for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792, 2023. 1, 2, 7, 8

work page 2023
[24]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

work page 2020
[25]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021

work page arXiv 2021
[26]

T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56,

work page
[27]

Fine-grained predicates learning for scene graph generation

Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19467–19475, 2022. 1, 2

work page 2022
[28]

Panoptic scene graph gener- ation

Jingkang Yang, Yi Zhe Ang, Zhe Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, pages 178–196, 2022. 3, 7, 8

work page 2022
[29]

Pair-net: Panoptic scene graph generation with pairwise re- lation learning

Yu Wang, Jiang Liu, Yong-Lu Li, Chang Xu, and Cewu Lu. Pair-net: Panoptic scene graph generation with pairwise re- lation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. Please verify exact authors/pages. 7, 8

work page 2024
[30]

Panoptic video scene graph generation

Jingkang Yang, Kaiyang Peng, Yuxin Li, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[31]

Multi-label video scene graph generation from single-frame weak super- vision

Shuo Chen, Qin Jin, Peng Wang, and Qi Wu. Multi-label video scene graph generation from single-frame weak super- vision. InProceedings of the International Conference on Learning Representations, 2023

work page 2023
[32]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 5664–5673, 2019. 3

work page 2019
[33]

Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024

Sebastian Koch, Narunas Vaskevicius, Brian Coltin, and Marija Popovi ´c. Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024. Please verify final arXiv identi- fier/proceedings information

work page 2024
[34]

Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning

Francesco Rotondi, Emanuele Bastianelli, Danilo Avola, and Luigi Cinque. Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Please verify exact au- thors/pages

work page 2025
[35]

Moma: Mobile manipulation in 3d scenes with task relevant scene graphs

Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Moma: Mobile manipulation in 3d scenes with task relevant scene graphs. InProceedings of the IEEE International Con- ference on Robotics and Automation, 2024. Please verify exact venue metadata

work page 2024
[36]

Mo- magraph: State-aware unified scene graphs for embodied task planning

Yuanchen Ju, Jiayuan Han, Chen Wang, and Jia Deng. Mo- magraph: State-aware unified scene graphs for embodied task planning. InProceedings of the International Confer- ence on Learning Representations, 2026. Please verify final author list/proceedings information. 1, 3

work page 2026
[37]

Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008

Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008. 2, 3

work page 2008
[38]

Positive-unlabeled learning with non-negative risk estimator

Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

work page 2017
[39]

Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

Timothee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

work page 2011
[40]

Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

work page 2022
[41]

R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 2, 3

work page 2025
[42]

Sgtr: End-to- end scene graph generation with transformer

Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19486–19496, 2022. 2

work page 2022
[43]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, volume 28, 2015. 7

work page 2015
[44]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492– 1500, 2017

work page 2017
[45]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

work page 2017
[46]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 7

work page 2022
[47]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 7

work page 2019

[1] [1]

Shamma, Michael S

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3668–3678, 2015. 1, 2

work page 2015

[2] [2]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. 1, 7

work page 2017

[3] [3]

Visual relationship detection with language priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, pages 852–869, 2016. 2

work page 2016

[4] [4]

Neural motifs: Scene graph parsing with global con- text

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. 1, 7, 8

work page 2018

[5] [5]

Graph r-cnn for scene graph generation

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, pages 670–685, 2018. 1, 2

work page 2018

[6] [6]

Choy, and Li Fei-Fei

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. 1, 2

work page 2017

[7] [7]

Scene graph generation from objects, phrases and region captions

Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiaoou Tang. Scene graph generation from objects, phrases and region captions. InProceedings of the IEEE International Conference on Computer Vision, pages 1261–1270, 2017

work page 2017

[8] [8]

Knowledge-embedded routing network for scene graph gen- eration

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163– 6171, 2019. 1, 2

work page 2019

[9] [9]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the International Conference on Machine Learning, pages 4904–4916, 2021. 1, 2

work page 2021

[10] [10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021. 2, 7, 8

work page 2021

[11] [11]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900, 2022. 2

work page 2022

[12] [12]

To- wards open-vocabulary scene graph generation with prompt- based finetuning

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InProceedings of the European Confer- ence on Computer Vision, 2022. 1, 7, 8

work page 2022

[13] [13]

T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

work page 2022

[14] [14]

Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention

Zuyao Chen, Jinlin Wu, Zhen Lei, and Chang Wen Chen. Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention. InProceedings of the European Conference on Com- puter Vision, 2024. 1, 8

work page 2024

[15] [15]

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

Rongjie Li, Songyang Zhang, Bo Wan, Dahua Lin, and Xum- ing He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28076–28086, 2024. 2, 7, 8

work page 2024

[16] [16]

Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

Chufeng Min, Yifan Liu, and Hanwang Zhang. Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

work page

[17] [17]

Please verify final arXiv identifier/proceedings infor- mation. 7, 8

work page

[18] [18]

Openpsg: Open-set panoptic scene graph generation via large multimodal models

Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, pages 212–229, 2024. 1, 2, 3, 7, 8

work page 2024

[19] [19]

X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[20] [20]

Learning to compose dynamic tree structures for visual contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6619–6628, 2019. 1, 2, 8

work page 2019

[21] [21]

Unbiased scene graph generation from bi- ased training

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716– 3725, 2020. 1, 2, 7, 8

work page 2020

[22] [22]

Bipartite graph network with adaptive message passing for unbiased scene graph generation

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021. 8

work page 2021

[23] [23]

Prototype-based embedding network for scene graph generation

Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Prototype-based embedding network for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792, 2023. 1, 2, 7, 8

work page 2023

[24] [24]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

work page 2020

[25] [25]

T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021

work page arXiv 2021

[26] [26]

T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56,

work page

[27] [27]

Fine-grained predicates learning for scene graph generation

Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19467–19475, 2022. 1, 2

work page 2022

[28] [28]

Panoptic scene graph gener- ation

Jingkang Yang, Yi Zhe Ang, Zhe Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, pages 178–196, 2022. 3, 7, 8

work page 2022

[29] [29]

Pair-net: Panoptic scene graph generation with pairwise re- lation learning

Yu Wang, Jiang Liu, Yong-Lu Li, Chang Xu, and Cewu Lu. Pair-net: Panoptic scene graph generation with pairwise re- lation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. Please verify exact authors/pages. 7, 8

work page 2024

[30] [30]

Panoptic video scene graph generation

Jingkang Yang, Kaiyang Peng, Yuxin Li, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[31] [31]

Multi-label video scene graph generation from single-frame weak super- vision

Shuo Chen, Qin Jin, Peng Wang, and Qi Wu. Multi-label video scene graph generation from single-frame weak super- vision. InProceedings of the International Conference on Learning Representations, 2023

work page 2023

[32] [32]

Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 5664–5673, 2019. 3

work page 2019

[33] [33]

Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024

Sebastian Koch, Narunas Vaskevicius, Brian Coltin, and Marija Popovi ´c. Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024. Please verify final arXiv identi- fier/proceedings information

work page 2024

[34] [34]

Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning

Francesco Rotondi, Emanuele Bastianelli, Danilo Avola, and Luigi Cinque. Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Please verify exact au- thors/pages

work page 2025

[35] [35]

Moma: Mobile manipulation in 3d scenes with task relevant scene graphs

Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Moma: Mobile manipulation in 3d scenes with task relevant scene graphs. InProceedings of the IEEE International Con- ference on Robotics and Automation, 2024. Please verify exact venue metadata

work page 2024

[36] [36]

Mo- magraph: State-aware unified scene graphs for embodied task planning

Yuanchen Ju, Jiayuan Han, Chen Wang, and Jia Deng. Mo- magraph: State-aware unified scene graphs for embodied task planning. InProceedings of the International Confer- ence on Learning Representations, 2026. Please verify final author list/proceedings information. 1, 3

work page 2026

[37] [37]

Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008

Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008. 2, 3

work page 2008

[38] [38]

Positive-unlabeled learning with non-negative risk estimator

Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

work page 2017

[39] [39]

Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

Timothee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

work page 2011

[40] [40]

Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

work page 2022

[41] [41]

R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 2, 3

work page 2025

[42] [42]

Sgtr: End-to- end scene graph generation with transformer

Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19486–19496, 2022. 2

work page 2022

[43] [43]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, volume 28, 2015. 7

work page 2015

[44] [44]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492– 1500, 2017

work page 2017

[45] [45]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

work page 2017

[46] [46]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 7

work page 2022

[47] [47]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 7

work page 2019