Recognition: unknown
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3
The pith
Scene graph generation improves when reframed as progressive flow-based transport from noise rather than one-shot classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlowSG recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). It first applies a VQ-VAE to quantize scene graphs into compact tokens, then uses a graph Transformer to predict a conditional velocity field for continuous geometry while updating discrete posteriors for categorical tokens, trained via combined flow-matching losses.
What carries the argument
Hybrid discrete-continuous flow matching on VQ-VAE quantized graph tokens driven by a graph Transformer velocity field that couples geometry transport with semantic posterior updates.
If this is right
- The method produces consistent gains of about 3 points in predicate recall, mean recall, and graph-level metrics over one-shot baselines like USG-Par on Visual Genome and PSG.
- Inference requires only a few steps while remaining compatible with standard off-the-shelf detectors and segmenters.
- Performance holds under both closed-vocabulary and open-vocabulary evaluation protocols.
- Training jointly optimizes flow losses on geometry and discrete objectives on tokens to handle mixed state spaces.
Where Pith is reading between the lines
- The progressive refinement strategy could extend naturally to temporal scene graphs in video by adding a time dimension to the flow.
- Adopting flow matching here suggests that other structured vision outputs with strong inter-element dependencies may benefit from generative transport over direct classification.
- The plug-and-play design opens the possibility of combining the graph flow with modern vision-language backbones for richer conditioning signals.
Load-bearing premise
Quantizing scene graphs into discrete tokens with a VQ-VAE preserves all necessary semantic and geometric details without introducing artifacts that reduce final predicate accuracy.
What would settle it
A controlled experiment showing lower predicate recall when using the VQ-VAE tokens versus direct continuous feature regression on the same architecture would falsify the quantization step's value.
Figures
read the original abstract
Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FlowSG, which recasts scene graph generation (SGG) as continuous-time transport on a hybrid discrete-continuous state space. A VQ-VAE first quantizes scene-graph features (visual and relational) into compact tokens; a graph Transformer then predicts a conditional velocity field that jointly transports continuous object geometry (boxes) and updates discrete posteriors over object and predicate tokens via flow-conditioned message passing. Training uses flow-matching losses on geometry together with a discrete-flow objective on tokens. The approach is claimed to enable few-step inference while remaining plug-and-play with off-the-shelf detectors and segmenters. Experiments on VG and PSG under closed- and open-vocabulary protocols report consistent gains of roughly 3 points in predicate R/mR and graph-level metrics over the prior state-of-the-art USG-Par.
Significance. If the reported gains can be shown to arise from the generative flow formulation rather than from the VQ-VAE stage or experimental choices, the work would offer a substantive alternative to one-shot classification pipelines in SGG. The hybrid discrete-continuous transport, constraint-aware refinement, and few-step inference are conceptually attractive and could improve handling of long-tail predicates and geometric-semantic coupling. Plug-and-play compatibility with existing detectors is a practical advantage that would facilitate adoption.
major comments (2)
- [Abstract] Abstract: the central empirical claim is an average ~3-point improvement in predicate R/mR and graph-level metrics over USG-Par, yet no error bars, standard deviations across runs, or exact experimental protocol (train/val splits, hyper-parameter search, post-hoc dataset filtering) are supplied. Without these, it is impossible to determine whether the gains are statistically reliable or attributable to the flow-matching objectives rather than other factors.
- [Method] Method (VQ-VAE quantization step): the paper quantizes continuous visual features into discrete tokens via a reconstruction-trained VQ-VAE before applying the graph-Transformer velocity field. Because the codebook is optimized for reconstruction fidelity rather than predicate discrimination, fine-grained distinctions (e.g., “on” vs. “above”, “holding” vs. “carrying”) may be collapsed into identical tokens. Any information lost at this irreversible quantization step cannot be recovered by subsequent flow transport; therefore the attribution of performance gains to the mixed discrete-continuous generative formulation requires an explicit ablation that isolates the quantization stage (e.g., continuous-feature baseline vs. quantized tokens).
minor comments (2)
- [Abstract] Abstract: the phrase “constraint-aware refinements” is used without defining the constraints or how they are enforced inside the velocity field; a brief clarification would improve readability.
- [Method] The manuscript states “plug-and-play compatibility with standard detectors and segmenters” but does not specify the exact interface (e.g., which layers receive the image features or how the graph Transformer conditions on detector outputs).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of empirical rigor and the role of the VQ-VAE stage. We address each major comment below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim is an average ~3-point improvement in predicate R/mR and graph-level metrics over USG-Par, yet no error bars, standard deviations across runs, or exact experimental protocol (train/val splits, hyper-parameter search, post-hoc dataset filtering) are supplied. Without these, it is impossible to determine whether the gains are statistically reliable or attributable to the flow-matching objectives rather than other factors.
Authors: We agree that error bars and fuller protocol details are needed to establish statistical reliability. The original experiments followed the standard VG and PSG splits and evaluation protocols from prior work (including USG-Par), with no post-hoc filtering applied. In the revised manuscript we have added results from three independent runs with different random seeds, reporting standard deviations in the main result tables (all <0.5 points). The gains remain consistent at ~3 points. We have also expanded Section 4.1 and the appendix to explicitly document the train/val/test splits, hyper-parameter search procedure, and full training details. These additions support that the improvements arise from the hybrid flow formulation. revision: yes
-
Referee: [Method] Method (VQ-VAE quantization step): the paper quantizes continuous visual features into discrete tokens via a reconstruction-trained VQ-VAE before applying the graph-Transformer velocity field. Because the codebook is optimized for reconstruction fidelity rather than predicate discrimination, fine-grained distinctions (e.g., “on” vs. “above”, “holding” vs. “carrying”) may be collapsed into identical tokens. Any information lost at this irreversible quantization step cannot be recovered by subsequent flow transport; therefore the attribution of performance gains to the mixed discrete-continuous generative formulation requires an explicit ablation that isolates the quantization stage (e.g., continuous-feature baseline vs. quantized tokens).
Authors: We concur that an explicit ablation isolating the quantization stage is required to attribute gains specifically to the generative flow. We have added a new ablation (Section 5.3, Table 5) that replaces the VQ-VAE tokens with continuous visual features fed directly into the identical graph Transformer and flow-matching objectives. The quantized-token version outperforms this continuous baseline by 1.8 points on average predicate mR, indicating that the discrete state space enables more effective flow-conditioned message passing and posterior refinement for fine-grained predicates. We have clarified in the method section that, although the VQ-VAE is reconstruction-trained, the subsequent flow model progressively updates token posteriors, allowing recovery of distinctions through constraint-aware transport. The VQ-VAE remains frozen during flow training. revision: yes
Circularity Check
No significant circularity; derivation is self-contained and empirically validated
full rationale
The paper introduces FlowSG as a new generative formulation that recasts SGG as hybrid discrete-continuous flow matching, with explicitly defined components: VQ-VAE quantization of scene graphs into tokens, a graph Transformer velocity field for continuous geometry transport, and a discrete-flow objective for categorical tokens. Training losses are stated as combinations of flow-matching for geometry and discrete-flow for tokens; these are not algebraically equivalent to the input data or to any prior result by construction. The central claims rest on reported empirical gains (predicate R/mR and graph metrics) versus baselines such as USG-Par on VG and PSG datasets, rather than any definitional reduction, fitted-input prediction, or load-bearing self-citation chain. No self-definitional, uniqueness-imported, or ansatz-smuggled steps appear in the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flow matching can be extended to jointly transport continuous geometry and discrete categorical tokens via a shared graph Transformer
Reference graph
Works this paper leans on
-
[1]
Generative flows on discrete state-spaces: Enablingmultimodalflowswithapplicationsto proteinco-design
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rain- forth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enablingmultimodalflowswithapplicationsto proteinco-design. InICML,pages5453–5512.PMLR,2024. 7
2024
-
[2]
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh,LeonidasGuibas,andFeiXia.Spatialvlm: Endowing vision-languagemodelswithspatialreasoningcapabilities.In CVPR, pages 14455–14465, 2024. 2
2024
-
[3]
Hydra-SGG: Hybrid relation assignment for one-stage scene graph generation
MinghanChen,GuikunChen,WenguanWang,andYiYang. Hydra-SGG: Hybrid relation assignment for one-stage scene graph generation. InICLR, 2025. 7
2025
-
[4]
Expanding scene graph boundaries: fully open-vocabulary scene graph generation via visual-concept alignment and retention
Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, and Chang Wen Chen. Expanding scene graph boundaries: fully open-vocabulary scene graph generation via visual-concept alignment and retention. InECCV, pages 108–124, 2024. 6, 7
2024
-
[5]
InCVPR, pages 1290–1299, 2022
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- derKirillov,andRohitGirdhar.Masked-attentionmasktrans- former for universal image segmentation. InCVPR, pages 1290–1299, 2022. 4, 6
2022
-
[6]
Reltr: Relation transformer for scene graph generation
Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation. TPAMI, 45(9):11169–11183, 2023. 1, 2
2023
-
[7]
Principalneighbourhoodaggrega- tion for graph nets.NeurIPS, 33:13260–13271, 2020
Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò,andPetarVeličković. Principalneighbourhoodaggrega- tion for graph nets.NeurIPS, 33:13260–13271, 2020. 6
2020
-
[8]
Variationalflowmatchingforgraphgeneration.NeurIPS,37: 11735–11764, 2024
Floor Eijkelboom, Grigory Bartosh, Christian Anders- son Naesseth, Max Welling, and Jan-Willem van de Meent. Variationalflowmatchingforgraphgeneration.NeurIPS,37: 11735–11764, 2024. 2, 3
2024
-
[9]
Hybrid reciprocaltransformerwithtripletfeaturealignmentforscene graph generation
Jiawei Fu, Tiantian Zhang, Kai Chen, and Qi Dou. Hybrid reciprocaltransformerwithtripletfeaturealignmentforscene graph generation. InCVPR, pages 8953–8963, 2025. 1, 2, 7
2025
-
[10]
Unconditional scenegraphgeneration
SarthakGarg,HelisaDhamo,AzadeFarshad,SabrinaMusa- tian, Nassir Navab, and Federico Tombari. Unconditional scenegraphgeneration. InICCV,pages16362–16371,2021. 3
2021
-
[11]
Discrete flow matching.NeurIPS, 37:133345–133385, 2024
ItaiGat,TalRemez,NetaShaul,FelixKreuk,RickyTQChen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.NeurIPS, 37:133345–133385, 2024. 2, 3
2024
-
[12]
Open- vocabulary object detection via vision and language knowl- edge distillation
XiuyeGu,Tsung-YiLin,WeichengKuo,andYinCui. Open- vocabulary object detection via vision and language knowl- edge distillation. InICLR 2022, Virtual Event, April 25-29,
2022
-
[13]
OpenReview.net, 2022. 2
2022
-
[14]
Visual semantic role la- beling
Saurabh Gupta and Jitendra Malik. Visual semantic role labeling.arXiv preprint arXiv:1505.04474, 2015. 4
-
[15]
Et-flow: Equivariant flow-matchingformolecularconformergeneration.NeurIPS, 37:128798–128824, 2024
Majdi Hassan, Nikhil Shenoy, Jungyoon Lee, Hannes Stärk, StephanThaler,andDominiqueBeaini. Et-flow: Equivariant flow-matchingformolecularconformergeneration.NeurIPS, 37:128798–128824, 2024. 2
2024
-
[16]
Dsgg: Dense relation transformer for an end-to-end scene graph generation
Zeeshan Hayder and Xuming He. Dsgg: Dense relation transformer for an end-to-end scene graph generation. In CVPR, pages 28317–28326, 2024. 7
2024
-
[17]
Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation
Tao He, Lianli Gao, Jingkuan Song, Jianfei Cai, and Yuan- Fang Li. Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation. In AAAI, pages 587–593, 2021. 2
2021
-
[18]
To- wardsopen-vocabularyscenegraphgenerationwithprompt- based finetuning
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wardsopen-vocabularyscenegraphgenerationwithprompt- based finetuning. InECCV, pages 56–73, 2022. 6
2022
-
[19]
To- ward a unified transformer-based framework for scene graph generationandhuman-objectinteractiondetection
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- ward a unified transformer-based framework for scene graph generationandhuman-objectinteractiondetection. 32:6274– 6288, 2023. 2
2023
-
[20]
Lifelong scene graph generation.Pattern Recognition, page 113132, 2026
Tao He, Xin Hu, Tongtong Wu, Dongyang Zhang, Ming Li, Yuan-Fang Li, and Fei Richard Yu. Lifelong scene graph generation.Pattern Recognition, page 113132, 2026. 1
2026
-
[21]
Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2
2020
-
[22]
Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long-and local-rangecontextreasoning
Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li, and Tao He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long-and local-rangecontextreasoning. InICCV,pages15562–15572,
-
[23]
Navigating the unseen: Zero-shot scene graph genera- tion via capsule-based equivariant features
WenhuanHuang,YiJI,GuiqianZhu,LiYing,andChunping Liu. Navigating the unseen: Zero-shot scene graph genera- tion via capsule-based equivariant features. InCVPR, pages 29448–29457, 2025. 7
2025
-
[24]
Egtr: Extracting graph from trans- former for scene graph generation
Jinbae Im, JeongYeon Nam, Nokyung Park, Hyungmin Lee, and Seunghyun Park. Egtr: Extracting graph from trans- former for scene graph generation. InCVPR, pages 24229– 24238, 2024. 7
2024
-
[25]
Categorical repa- rameterization with gumbel-softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InICLR, 2017. 4
2017
-
[26]
Elucidating the design space of diffusion-based generative models.NeurIPS, 35:26565–26577, 2022
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.NeurIPS, 35:26565–26577, 2022. 6
2022
-
[27]
Groupwise query special- ization and quality-aware multi-assignment for transformer- based visual relationship detection
JonghaKim,JihwanPark,JinyoungPark,JinyoungKim,Se- hyung Kim, and Hyunwoo J Kim. Groupwise query special- ization and quality-aware multi-assignment for transformer- based visual relationship detection. InCVPR, pages 28160– 28169, 2024. 7
2024
-
[28]
Autoregressivediffusion model for graph generation
Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, BAdityaPrakash,andChaoZhang. Autoregressivediffusion model for graph generation. InICML, pages 17391–17408. PMLR, 2023. 3
2023
-
[29]
Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 123:32–73, 2017
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 123:32–73, 2017. 6
2017
-
[30]
Panoptic scene graph generation with semantics-prototype learning
LiLi,WeiJi,YimingWu,MengzeLi,YouQin,LinaWei,and Roger Zimmermann. Panoptic scene graph generation with semantics-prototype learning. InAAAI, pages 3145–3153,
-
[31]
Stprivacy: Spatio-temporal privacy-preserving action recognition
MingLi,XiangyuXu,HeheFan,PanZhou,JunLiu,Jia-Wei Liu,JiaheLi,JussiKeppo,MikeZhengShou,andShuicheng Yan. Stprivacy: Spatio-temporal privacy-preserving action recognition. InICCV, 2023. 3
2023
-
[32]
Instant3d: instant text-to- 3d generation.IJCV, 2024
Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: instant text-to- 3d generation.IJCV, 2024. 3
2024
-
[33]
Efficient industrial dataset distillation with textualtrajectorymatching.IEEETransactionsonIndustrial Informatics, 2026
Muquan Li, Qian Dong, Dongyang Zhang, Ke Qin, and Guangchun Luo. Efficient industrial dataset distillation with textualtrajectorymatching.IEEETransactionsonIndustrial Informatics, 2026. 3
2026
-
[34]
Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin,andTaoHe. Fixedanchorsarenotenough: Dynamicre- trievalandpersistenthomologyfordatasetdistillation.arXiv preprint arXiv:2602.24144, 2026. 3
-
[35]
Sgtr: End- to-end scene graph generation with transformer
Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End- to-end scene graph generation with transformer. InCVPR, pages 19486–19496, 2022. 7
2022
-
[36]
From pixels to graphs: Open-vocabulary scene graph generation with vision-language models
Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InCVPR, pages 28076–28086, 2024. 2, 7
2024
-
[37]
Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior
Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. InICLR, 2024. 4, 6, 7
2024
-
[38]
Gps-net: Graph property sensing network for scene graph generation
Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. Gps-net: Graph property sensing network for scene graph generation. InCVPR, pages 3746–3753, 2020. 2
2020
-
[39]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 3
2023
-
[40]
Eventgpt: Event stream understanding with multimodal large language models
ShaoyuLiu,JianingLi,GuanghuiZhao,YunjianZhang,Xin Meng,FeiRichardYu,XiangyangJi,andMingLi. Eventgpt: Event stream understanding with multimodal large language models. InCVPR, pages 29139–29149, 2025. 1
2025
-
[41]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2018. 6
2018
-
[42]
Spectre: Spectralconditioninghelps to overcome the expressivity limits of one-shot graph gener- ators
Karolis Martinkus, Andreas Loukas, Nathanaël Perraudin, andRogerWattenhofer. Spectre: Spectralconditioninghelps to overcome the expressivity limits of one-shot graph gener- ators. InICML, pages 15159–15179. PMLR, 2022. 3
2022
-
[43]
Vision-language interactive rela- tion mining for open-vocabulary scene graph generation
Yukuan Min, Muli Yang, Jinhao Zhang, Yuxuan Wang, Am- ing Wu, and Cheng Deng. Vision-language interactive rela- tion mining for open-vocabulary scene graph generation. In ICCV, pages 16755–16764, 2025. 1, 7
2025
-
[44]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV (ICCV), 2023. 6
2023
-
[45]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 5
2023
-
[46]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2, 5
2018
-
[47]
Defog: Discrete flow matching for graph genera- tion
Yiming Qin, Manuel Madeira, Dorina Thanou, and Pascal Frossard. Defog: Discrete flow matching for graph genera- tion. InICML, 2025. 2, 3
2025
-
[48]
Learning transferablevisualmodelsfromnaturallanguagesupervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, AmandaAskell,PamelaMishkin,JackClark,etal. Learning transferablevisualmodelsfromnaturallanguagesupervision. InICML, pages 8748–8763. PMLR, 2021. 4, 5, 6
2021
-
[49]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, ChelseaVoss, AlecRadford, MarkChen, andIlyaSutskever. Zero-shot text-to-image generation. InICML, pages 8821–
-
[50]
Discrete variational autoencoders
Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2017. 4
2017
-
[51]
Graphvae: To- wards generation of small graphs using variational autoen- coders
Martin Simonovsky and Nikos Komodakis. Graphvae: To- wards generation of small graphs using variational autoen- coders. InICANN, pages 412–422. Springer, 2018. 3
2018
-
[52]
Deep discrete hashing with self-supervised pairwise labels
Jingkuan Song, Tao He, Hangbo Fan, and Lianli Gao. Deep discrete hashing with self-supervised pairwise labels. In ECML-PKDD, pages 223–238. Springer, 2017. 1
2017
-
[53]
Transformer-basedimagegenerationfrom scene graphs.Computer Vision and Image Understanding, 233:103721, 2023
RenatoSortino,SimonePalazzo,FrancescoRundo,andCon- cettoSpampinato. Transformer-basedimagegenerationfrom scene graphs.Computer Vision and Image Understanding, 233:103721, 2023. 2
2023
-
[54]
In AAAI, 2017
RobynSpeer,JoshuaChin,andCatherineHavasi.Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017. 4
2017
-
[55]
Learning to compose dynamic tree structures for visual contexts
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InCVPR, pages 6619–6628, 2019. 1, 7
2019
-
[56]
Neural discrete representation learning.NeurIPS, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.NeurIPS, 30, 2017. 2
2017
-
[57]
InICLR, 2023
Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang,VolkanCevher,andPascalFrossard.Digress: Discrete denoising diffusion for graph generation. InICLR, 2023. 2, 7
2023
-
[58]
Pair then relation: Pair-net for panoptic scene graph generation.TPAMI, 2024
Jinghao Wang, Zhengyu Wen, Xiangtai Li, Zujin Guo, Jingkang Yang, and Ziwei Liu. Pair then relation: Pair-net for panoptic scene graph generation.TPAMI, 2024. 7
2024
-
[59]
Universal scenegraphgeneration
Shengqiong Wu, Hao Fei, and Tat-Seng Chua. Universal scenegraphgeneration. InCVPR,pages14158–14168,2025. 2, 7
2025
-
[60]
Scene graph generation by iterative message passing
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017. 7
2017
-
[61]
Scene graph generation by iterative message passing
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017. 1
2017
-
[62]
Open-vocabulary panoptic segmentationwithtext-to-imagediffusionmodels
JiaruiXu,SifeiLiu,ArashVahdat,WonminByeon,Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentationwithtext-to-imagediffusionmodels. InCVPR, pages 2955–2966, 2023. 5
2023
-
[63]
Discrete-statecontinuous-timediffusionfor graph generation.NeurIPS, 37:79704–79740, 2024
ZheXu,RuizhongQiu,YuzhongChen,HuiyuanChen,Xiran Fan, Menghai Pan, Zhichen Zeng, Mahashweta Das, and HanghangTong. Discrete-statecontinuous-timediffusionfor graph generation.NeurIPS, 37:79704–79740, 2024. 7
2024
-
[64]
Synthetic-to-realself-supervisedrobustdepth estimation via learning with motion and structure priors
Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, and RobbyT.Tan. Synthetic-to-realself-supervisedrobustdepth estimation via learning with motion and structure priors. In CVPR, pages 21880–21890, 2025. 1
2025
-
[65]
Panoptic scene graph gen- eration
Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gen- eration. InECCV, pages 178–196. Springer, 2022. 1, 2, 4, 6
2022
-
[66]
JieYang,BingliangLi,AilingZeng,LeiZhang,andRuimao Zhang. Open-world human-object interaction detection via multi-modalprompts.InIEEE/CVFConferenceonComputer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 16954–16964. IEEE, 2024. 5
2024
-
[67]
Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervisedcross-domainvisualemotionrecognition
Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervisedcross-domainvisualemotionrecognition. InCVPR, pages 3888–3898, 2025. 1
2025
-
[68]
Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan, Xiurui Xie, Yuan-Fang Li, and Tao He. Tical: Typicality- based consistency-aware learning for multimodal emotion recognition.arXiv preprint arXiv:2511.15085, 2025. 1
-
[69]
Vqa and visual reasoning: An overview of approaches, datasets, and future direction.Neurocomputing, 622:129345, 2025
Rufai Yusuf Zakari, Jim Wilson Owusu, Ke Qin, Hailin Wang, Zaharaddeen Karami Lawal, and Tao He. Vqa and visual reasoning: An overview of approaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 1
2025
-
[70]
Neural motifs: Scene graph parsing with global context
RowanZellers,MarkYatskar,SamThomson,andYejinChoi. Neural motifs: Scene graph parsing with global context. In CVPR, pages 5831–5840, 2018. 2, 7
2018
-
[71]
Learning to generate language- supervised and open-vocabulary scene graph using pre- trained visual-semantic space
Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. Learning to generate language- supervised and open-vocabulary scene graph using pre- trained visual-semantic space. InCVPR, pages 2915–2924,
-
[72]
Prototype-based embedding network for scene graph generation
ChaofanZheng,XinyuLyu,LianliGao,BoDai,andJingkuan Song. Prototype-based embedding network for scene graph generation. InCVPR (CVPR), pages 22783–22792, 2023. 7
2023
-
[73]
Openpsg: Open-setpanopticscenegraphgenerationvialarge multimodal models
Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-setpanopticscenegraphgenerationvialarge multimodal models. InECCV, pages 199–215. Springer,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.