pith. sign in

arxiv: 2604.22546 · v5 · submitted 2026-04-24 · 💻 cs.CV

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene graph generationopen-vocabulary learningrelation latticeincomplete annotationspositive-unlabeled learningsemantic consistencyvisual language models
0
0 comments X

The pith

ReLIC-SGG infers missing relations in open-vocabulary scene graphs using a semantic predicate lattice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses incomplete annotations in scene graph generation for open vocabularies, where many true relations go unlabeled and predicates can describe the same interaction at different levels. It introduces a framework that models unannotated object-pair relations as latent variables instead of assuming they are negatives. A semantic relation lattice is constructed to capture how predicates relate through similarity, entailment, and contradiction. This lattice helps infer which missing relations are likely true based on visual-language matches, surrounding graph context, and consistency rules, leading to improved handling of rare and unseen predicates.

Core claim

ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs.

What carries the argument

The semantic relation lattice that encodes relationships like similarity, entailment, and contradiction between predicates to infer and complete missing positive relations.

If this is right

  • Improves recognition of rare and unseen predicates on benchmarks.
  • Recovers more missing relations in conventional, open-vocabulary, and panoptic SGG tasks.
  • Reduces the impact of false-negative labels during training via positive-unlabeled learning.
  • Generates scene graphs that are more compact and semantically consistent through guided decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method may generalize to other vision tasks where labels are incomplete, such as action recognition or visual question answering.
  • Semantic lattices could be combined with large language models to dynamically expand relation vocabularies.
  • Testing on datasets with varying annotation densities would reveal how much the lattice compensates for human annotation gaps.

Load-bearing premise

That the semantic relation lattice plus visual-language compatibility and graph context can reliably distinguish true missing positives from true negatives without introducing more errors than it removes.

What would settle it

A manual inspection or human evaluation of inferred relations showing that more than half of the newly added relations are incorrect or that overall graph accuracy decreases.

Figures

Figures reproduced from arXiv: 2604.22546 by Amir Hosseini, Sara Farahani, Suiyang Guang, Xinyi Li.

Figure 1
Figure 1. Figure 1: Overall framework of ReLIC-SGG. Given detected objects, the model first proposes open-vocabulary relation candidates for each view at source ↗
read the original abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. ReLIC-SGG proposes a relation-incompleteness-aware framework for open-vocabulary scene graph generation. It treats unannotated object-pair relations as latent variables rather than negatives, constructs a semantic relation lattice to model predicate similarity/entailment/contradiction, and infers missing positives from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled learning objective is introduced, and lattice-guided decoding is used to produce consistent graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks are reported to show gains on rare and unseen predicates plus better recovery of missing relations.

Significance. If the lattice-based inference is shown to add more true positives than false positives, the work would meaningfully address a core limitation of SGG datasets (incomplete annotations) and improve open-vocabulary predicate recognition. The combination of semantic lattice modeling with positive-unlabeled learning is a targeted response to granularity and missing-relation issues that prior methods largely ignore.

major comments (2)
  1. [§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.
  2. [§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.
minor comments (2)
  1. [§3.1] Notation for the lattice construction (e.g., how entailment scores are computed for arbitrary open-vocabulary phrases) should be formalized with explicit equations rather than descriptive text.
  2. [§4] Figure captions and table headers would benefit from explicit statements of which metrics are computed only on annotated positives versus on the full inferred graph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications on our design choices and outlining the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that the semantic relation lattice reliably infers missing positives rests on the assumption that lattice similarity plus visual-language and graph context distinguish true missing relations from true negatives. No quantitative validation (e.g., precision of inferred relations on a held-out verified subset) is provided to confirm net error reduction, which is load-bearing for the reported gains on rare/unseen predicates.

    Authors: We agree that a direct quantitative validation of the lattice inference precision on a held-out verified subset would provide stronger support for the claim of net error reduction. While the reported gains on rare and unseen predicates across multiple benchmarks provide indirect evidence that the combination of lattice similarity, visual-language compatibility, and graph context effectively recovers true positives, we acknowledge the value of explicit verification. In the revised manuscript, we will add an analysis on a manually verified subset of inferred relations, reporting precision to demonstrate that the lattice inference yields more true positives than false positives. revision: yes

  2. Referee: [§4] §4 (Experiments): Improvements on rare and unseen predicates are reported across benchmarks, but the results lack ablations that isolate the lattice inference component from the positive-unlabeled objective and decoding strategy. Without these controls it is unclear whether the lattice itself drives the claimed recovery of missing relations.

    Authors: We thank the referee for highlighting the need for clearer component isolation. The current results show overall improvements from the full framework, but we agree that targeted ablations would better attribute the gains to the lattice inference. In the revision, we will include additional ablation experiments that disable the lattice-based inference (while keeping the positive-unlabeled objective and lattice-guided decoding active) to isolate and quantify its specific contribution to recovering missing relations on rare and unseen predicates. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces ReLIC-SGG by constructing a semantic relation lattice to model entailment and contradiction among open-vocabulary predicates, then infers latent positives via visual-language compatibility, graph context, and a positive-unlabeled learning objective. No equations, derivations, or self-citations are shown that reduce the claimed improvements in rare/unseen predicate recognition or missing-relation recovery to quantities defined by the method's own fitted inputs or prior self-referential results. The framework relies on independently motivated components evaluated against external benchmarks, rendering the central claims self-contained rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on the assumption that a hand-constructed or learned semantic lattice accurately encodes entailment and contradiction among arbitrary predicates; no independent verification of lattice quality is described.

axioms (1)
  • domain assumption Unannotated object-pair relations can be treated as latent variables whose positive status is inferable from visual-language compatibility and lattice semantics.
    Core modeling choice stated in abstract; if false, the positive-unlabeled objective collapses.
invented entities (1)
  • semantic relation lattice no independent evidence
    purpose: Model similarity, entailment, and contradiction among open-vocabulary predicates to infer missing positives.
    New structure introduced to guide inference; independent evidence of its accuracy is not provided in abstract.

pith-pipeline@v0.9.0 · 5527 in / 1139 out tokens · 31234 ms · 2026-05-14T21:27:15.145639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Shamma, Michael S

    Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3668–3678, 2015. 1, 2

  2. [2]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. 1, 7

  3. [3]

    Visual relationship detection with language priors

    Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei- Fei. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vi- sion, pages 852–869, 2016. 2

  4. [4]

    Neural motifs: Scene graph parsing with global con- text

    Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global con- text. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5831–5840, 2018. 1, 7, 8

  5. [5]

    Graph r-cnn for scene graph generation

    Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European Conference on Computer Vision, pages 670–685, 2018. 1, 2

  6. [6]

    Choy, and Li Fei-Fei

    Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. 1, 2

  7. [7]

    Scene graph generation from objects, phrases and region captions

    Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiaoou Tang. Scene graph generation from objects, phrases and region captions. InProceedings of the IEEE International Conference on Computer Vision, pages 1261–1270, 2017

  8. [8]

    Knowledge-embedded routing network for scene graph gen- eration

    Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. Knowledge-embedded routing network for scene graph gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6163– 6171, 2019. 1, 2

  9. [9]

    Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the International Conference on Machine Learning, pages 4904–4916, 2021. 1, 2

  10. [10]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021. 2, 7, 8

  11. [11]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900, 2022. 2

  12. [12]

    To- wards open-vocabulary scene graph generation with prompt- based finetuning

    Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InProceedings of the European Confer- ence on Computer Vision, 2022. 1, 7, 8

  13. [13]

    T. He, L. Gao, J. Song, and Y .-F. Li. Towards open- vocabulary scene graph generation with prompt-based fine- tuning. InEuropean Conference on Computer Vision, 2022

  14. [14]

    Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention

    Zuyao Chen, Jinlin Wu, Zhen Lei, and Chang Wen Chen. Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and re- tention. InProceedings of the European Conference on Com- puter Vision, 2024. 1, 8

  15. [15]

    From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

    Rongjie Li, Songyang Zhang, Bo Wan, Dahua Lin, and Xum- ing He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28076–28086, 2024. 2, 7, 8

  16. [16]

    Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

    Chufeng Min, Yifan Liu, and Hanwang Zhang. Vision- language interactive relation mining for open-vocabulary scene graph generation.arXiv preprint arXiv:2507.XXXX,

  17. [17]

    Please verify final arXiv identifier/proceedings infor- mation. 7, 8

  18. [18]

    Openpsg: Open-set panoptic scene graph generation via large multimodal models

    Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-set panoptic scene graph generation via large multimodal models. InProceedings of the European Conference on Computer Vision, pages 212–229, 2024. 1, 2, 3, 7, 8

  19. [19]

    X. Hu, K. Qin, G. Duan, M. Li, Y .-F. Li, and T. He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long- and local-range context reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision,

  20. [20]

    Learning to compose dynamic tree structures for visual contexts

    Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6619–6628, 2019. 1, 2, 8

  21. [21]

    Unbiased scene graph generation from bi- ased training

    Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from bi- ased training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716– 3725, 2020. 1, 2, 7, 8

  22. [22]

    Bipartite graph network with adaptive message passing for unbiased scene graph generation

    Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. Bipartite graph network with adaptive message passing for unbiased scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021. 8

  23. [23]

    Prototype-based embedding network for scene graph generation

    Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. Prototype-based embedding network for scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792, 2023. 1, 2, 7, 8

  24. [24]

    T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Learning from the scene and borrowing from the rich: Tackling the long tail in scene graph generation. InProceedings of the International Joint Conference on Artificial Intelligence, 2020

  25. [25]

    T. He, L. Gao, J. Song, J. Cai, and Y .-F. Li. Semantic compo- sitional learning for low-shot scene graph generation.arXiv preprint arXiv:2108.08600, 2021

  26. [26]

    T. He, L. Gao, J. Song, and Y .-F. Li. State-aware composi- tional learning toward unbiased training for scene graph gen- eration.IEEE Transactions on Image Processing, 32:43–56,

  27. [27]

    Fine-grained predicates learning for scene graph generation

    Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, and Heng Tao Shen Huang. Fine-grained predicates learning for scene graph generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19467–19475, 2022. 1, 2

  28. [28]

    Panoptic scene graph gener- ation

    Jingkang Yang, Yi Zhe Ang, Zhe Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gener- ation. InProceedings of the European Conference on Com- puter Vision, pages 178–196, 2022. 3, 7, 8

  29. [29]

    Pair-net: Panoptic scene graph generation with pairwise re- lation learning

    Yu Wang, Jiang Liu, Yong-Lu Li, Chang Xu, and Cewu Lu. Pair-net: Panoptic scene graph generation with pairwise re- lation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. Please verify exact authors/pages. 7, 8

  30. [30]

    Panoptic video scene graph generation

    Jingkang Yang, Kaiyang Peng, Yuxin Li, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  31. [31]

    Multi-label video scene graph generation from single-frame weak super- vision

    Shuo Chen, Qin Jin, Peng Wang, and Qi Wu. Multi-label video scene graph generation from single-frame weak super- vision. InProceedings of the International Conference on Learning Representations, 2023

  32. [32]

    Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese

    Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 5664–5673, 2019. 3

  33. [33]

    Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024

    Sebastian Koch, Narunas Vaskevicius, Brian Coltin, and Marija Popovi ´c. Open-vocabulary 3d scene graphs from point clouds with vision-language models.arXiv preprint arXiv:2406.XXXX, 2024. Please verify final arXiv identi- fier/proceedings information

  34. [34]

    Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning

    Francesco Rotondi, Emanuele Bastianelli, Danilo Avola, and Luigi Cinque. Functional 3d scene graphs: Building scene graphs for functional and task-oriented reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Please verify exact au- thors/pages

  35. [35]

    Moma: Mobile manipulation in 3d scenes with task relevant scene graphs

    Daniel Honerkamp, Tim Welschehold, and Abhinav Valada. Moma: Mobile manipulation in 3d scenes with task relevant scene graphs. InProceedings of the IEEE International Con- ference on Robotics and Automation, 2024. Please verify exact venue metadata

  36. [36]

    Mo- magraph: State-aware unified scene graphs for embodied task planning

    Yuanchen Ju, Jiayuan Han, Chen Wang, and Jia Deng. Mo- magraph: State-aware unified scene graphs for embodied task planning. InProceedings of the International Confer- ence on Learning Representations, 2026. Please verify final author list/proceedings information. 1, 3

  37. [37]

    Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008

    Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data.Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, 2008. 2, 3

  38. [38]

    Positive-unlabeled learning with non-negative risk estimator

    Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

  39. [39]

    Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

    Timothee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels.Journal of Machine Learning Research, 12:1501–1536, 2011

  40. [40]

    Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

    Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true la- bels for partial-label learning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8249–8264, 2022

  41. [41]

    R. Dai, C. Li, Y . Yan, L. Mo, K. Qin, and T. He. Unbi- ased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2025. 2, 3

  42. [42]

    Sgtr: End-to- end scene graph generation with transformer

    Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to- end scene graph generation with transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19486–19496, 2022. 2

  43. [43]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Pro- cessing Systems, volume 28, 2015. 7

  44. [44]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492– 1500, 2017

  45. [45]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 7

  46. [46]

    Schwing, Alexan- der Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 7

  47. [47]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations, 2019. 7