ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3
The pith
ReLIC-SGG infers missing relations in open-vocabulary scene graphs using a semantic predicate lattice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs.
What carries the argument
The semantic relation lattice that encodes relationships like similarity, entailment, and contradiction between predicates to infer and complete missing positive relations.
If this is right
- Improves recognition of rare and unseen predicates on benchmarks.
- Recovers more missing relations in conventional, open-vocabulary, and panoptic SGG tasks.
- Reduces the impact of false-negative labels during training via positive-unlabeled learning.
- Generates scene graphs that are more compact and semantically consistent through guided decoding.
Where Pith is reading between the lines
- This method may generalize to other vision tasks where labels are incomplete, such as action recognition or visual question answering.
- Semantic lattices could be combined with large language models to dynamically expand relation vocabularies.
- Testing on datasets with varying annotation densities would reveal how much the lattice compensates for human annotation gaps.
Load-bearing premise
That the semantic relation lattice plus visual-language compatibility and graph context can reliably distinguish true missing positives from true negatives without introducing more errors than it removes.
What would settle it
A manual inspection or human evaluation of inferred relations showing that more than half of the newly added relations are incorrect or that overall graph accuracy decreases.
Figures
read the original abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. ReLIC-SGG proposes a relation-incompleteness-aware framework for open-vocabulary scene graph generation. It treats unannotated object-pair relations as latent variables rather than negatives, constructs a semantic relation lattice to model predicate similarity/entailment/contradiction, and infers missing positives from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled learning objective is introduced, and lattice-guided decoding is used to produce consistent graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks are reported to show gains on rare and unseen predicates plus better recovery of missing relations.
Significance. If the lattice-based inference is shown to add more true positives than false positives, the work would meaningfully address a core limitation of SGG datasets (incomplete annotations) and improve open-vocabulary predicate recognition. The combination of semantic lattice modeling with positive-unlabeled learning is a targeted response to granularity and missing-relation issues that prior methods largely ignore.
major comments (2)
- [§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.
- [§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.
minor comments (2)
- [§3.1] Notation for the lattice construction (e.g., how entailment scores are computed for arbitrary open-vocabulary phrases) should be formalized with explicit equations rather than descriptive text.
- [§4] Figure captions and table headers would benefit from explicit statements of which metrics are computed only on annotated positives versus on the full inferred graph.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications on our design choices and outlining the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.
Authors: We agree that a direct quantitative validation of the lattice inference precision on a held-out verified subset would provide stronger support for the claim of net error reduction. While the reported gains on rare and unseen predicates across multiple benchmarks provide indirect evidence that the combination of lattice similarity, visual-language compatibility, and graph context effectively recovers true positives, we acknowledge the value of explicit verification. In the revised manuscript, we will add an analysis on a manually verified subset of inferred relations, reporting precision to demonstrate that the lattice inference yields more true positives than false positives. revision: yes
-
Referee: [§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.
Authors: We thank the referee for highlighting the need for clearer component isolation. The current results show overall improvements from the full framework, but we agree that targeted ablations would better attribute the gains to the lattice inference. In the revision, we will include additional ablation experiments that disable the lattice-based inference (while keeping the positive-unlabeled objective and lattice-guided decoding active) to isolate and quantify its specific contribution to recovering missing relations on rare and unseen predicates. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces ReLIC-SGG by constructing a semantic relation lattice to model entailment and contradiction among open-vocabulary predicates, then infers latent positives via visual-language compatibility, graph context, and a positive-unlabeled learning objective. No equations, derivations, or self-citations are shown that reduce the claimed improvements in rare/unseen predicate recognition or missing-relation recovery to quantities defined by the method's own fitted inputs or prior self-referential results. The framework relies on independently motivated components evaluated against external benchmarks, rendering the central claims self-contained rather than circular by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Unannotated object-pair relations can be treated as latent variables whose positive status is inferable from visual-language compatibility and lattice semantics.
invented entities (1)
-
semantic relation lattice
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Cost/FunctionalEquation.leanreality_from_one_distinction, washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3668–3678, 2015. 1, 2
work page 2015
-
[2]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. 1, 7
work page 2017
-
[3]
Visual relationship detection with language priors
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, pages 852–869, 2016. 2
work page 2016
-
[4]
Neural motifs: Scene graph parsing with global con- text
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. 1, 7, 8
work page 2018
-
[5]
Graph r-cnn for scene graph generation
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, pages 670–685, 2018. 1, 2
work page 2018
-
[6]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. 1, 2
work page 2017
-
[7]
Scene graph generation from objects, phrases and region captions
Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiaoou Tang. Scene graph generation from objects, phrases and region captions. InProceedings of the IEEE International Conference on Computer Vision, pages 1261–1270, 2017
work page 2017
-
[8]
Knowledge-embedded routing network for scene graph gen- eration
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163– 6171, 2019. 1, 2
work page 2019
-
[9]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the International Conference on Machine Learning, pages 4904–4916, 2021. 1, 2
work page 2021
-
[10]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021. 2, 7, 8
work page 2021
-
[11]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900, 2022. 2
work page 2022
-
[12]
To- wards open-vocabulary scene graph generation with prompt- based finetuning
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InProceedings of the European Confer- ence on Computer Vision, 2022. 1, 7, 8
work page 2022
-
[13]
T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022
work page 2022
-
[14]
Zuyao Chen, Jinlin Wu, Zhen Lei, and Chang Wen Chen. Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention. InProceedings of the European Conference on Com- puter Vision, 2024. 1, 8
work page 2024
-
[15]
From pixels to graphs: Open-vocabulary scene graph generation with vision-language models
Rongjie Li, Songyang Zhang, Bo Wan, Dahua Lin, and Xum- ing He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28076–28086, 2024. 2, 7, 8
work page 2024
-
[16]
Chufeng Min, Yifan Liu, and Hanwang Zhang. Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,
-
[17]
Please verify final arXiv identifier/proceedings infor- mation. 7, 8
-
[18]
Openpsg: Open-set panoptic scene graph generation via large multimodal models
Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, pages 212–229, 2024. 1, 2, 3, 7, 8
work page 2024
-
[19]
X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[20]
Learning to compose dynamic tree structures for visual contexts
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6619–6628, 2019. 1, 2, 8
work page 2019
-
[21]
Unbiased scene graph generation from bi- ased training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716– 3725, 2020. 1, 2, 7, 8
work page 2020
-
[22]
Bipartite graph network with adaptive message passing for unbiased scene graph generation
Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021. 8
work page 2021
-
[23]
Prototype-based embedding network for scene graph generation
Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Prototype-based embedding network for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792, 2023. 1, 2, 7, 8
work page 2023
-
[24]
T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020
work page 2020
- [25]
-
[26]
T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56,
-
[27]
Fine-grained predicates learning for scene graph generation
Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19467–19475, 2022. 1, 2
work page 2022
-
[28]
Panoptic scene graph gener- ation
Jingkang Yang, Yi Zhe Ang, Zhe Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, pages 178–196, 2022. 3, 7, 8
work page 2022
-
[29]
Pair-net: Panoptic scene graph generation with pairwise re- lation learning
Yu Wang, Jiang Liu, Yong-Lu Li, Chang Xu, and Cewu Lu. Pair-net: Panoptic scene graph generation with pairwise re- lation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. Please verify exact authors/pages. 7, 8
work page 2024
-
[30]
Panoptic video scene graph generation
Jingkang Yang, Kaiyang Peng, Yuxin Li, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[31]
Multi-label video scene graph generation from single-frame weak super- vision
Shuo Chen, Qin Jin, Peng Wang, and Qi Wu. Multi-label video scene graph generation from single-frame weak super- vision. InProceedings of the International Conference on Learning Representations, 2023
work page 2023
-
[32]
Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese
Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 5664–5673, 2019. 3
work page 2019
-
[33]
Sebastian Koch, Narunas Vaskevicius, Brian Coltin, and Marija Popovi ´c. Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024. Please verify final arXiv identi- fier/proceedings information
work page 2024
-
[34]
Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning
Francesco Rotondi, Emanuele Bastianelli, Danilo Avola, and Luigi Cinque. Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Please verify exact au- thors/pages
work page 2025
-
[35]
Moma: Mobile manipulation in 3d scenes with task relevant scene graphs
Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Moma: Mobile manipulation in 3d scenes with task relevant scene graphs. InProceedings of the IEEE International Con- ference on Robotics and Automation, 2024. Please verify exact venue metadata
work page 2024
-
[36]
Mo- magraph: State-aware unified scene graphs for embodied task planning
Yuanchen Ju, Jiayuan Han, Chen Wang, and Jia Deng. Mo- magraph: State-aware unified scene graphs for embodied task planning. InProceedings of the International Confer- ence on Learning Representations, 2026. Please verify final author list/proceedings information. 1, 3
work page 2026
-
[37]
Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008. 2, 3
work page 2008
-
[38]
Positive-unlabeled learning with non-negative risk estimator
Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017
work page 2017
-
[39]
Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011
Timothee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011
work page 2011
-
[40]
Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022
work page 2022
-
[41]
R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 2, 3
work page 2025
-
[42]
Sgtr: End-to- end scene graph generation with transformer
Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19486–19496, 2022. 2
work page 2022
-
[43]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, volume 28, 2015. 7
work page 2015
-
[44]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492– 1500, 2017
work page 2017
-
[45]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7
work page 2017
-
[46]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 7
work page 2022
-
[47]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 7
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.