pith. machine review for the scientific record. sign in

arxiv: 2604.18623 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene graph generationflow matchinggenerative modelsvisual relationship detectiongraph transformersVQ-VAEprogressive refinement
0
0 comments X

The pith

Scene graph generation improves when reframed as progressive flow-based transport from noise rather than one-shot classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scene graph generation should be treated as a generative process of continuous-time transport on hybrid states instead of deterministic classification of objects and predicates. FlowSG begins with a noised graph and refines it step by step, conditioned on image features, to jointly produce bounding boxes and relationship labels. This matters because it models the dependencies between elements progressively and allows the use of flow matching for efficient inference. A reader would care if the approach yields more accurate graphs that integrate better with existing detection pipelines.

Core claim

FlowSG recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). It first applies a VQ-VAE to quantize scene graphs into compact tokens, then uses a graph Transformer to predict a conditional velocity field for continuous geometry while updating discrete posteriors for categorical tokens, trained via combined flow-matching losses.

What carries the argument

Hybrid discrete-continuous flow matching on VQ-VAE quantized graph tokens driven by a graph Transformer velocity field that couples geometry transport with semantic posterior updates.

If this is right

  • The method produces consistent gains of about 3 points in predicate recall, mean recall, and graph-level metrics over one-shot baselines like USG-Par on Visual Genome and PSG.
  • Inference requires only a few steps while remaining compatible with standard off-the-shelf detectors and segmenters.
  • Performance holds under both closed-vocabulary and open-vocabulary evaluation protocols.
  • Training jointly optimizes flow losses on geometry and discrete objectives on tokens to handle mixed state spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The progressive refinement strategy could extend naturally to temporal scene graphs in video by adding a time dimension to the flow.
  • Adopting flow matching here suggests that other structured vision outputs with strong inter-element dependencies may benefit from generative transport over direct classification.
  • The plug-and-play design opens the possibility of combining the graph flow with modern vision-language backbones for richer conditioning signals.

Load-bearing premise

Quantizing scene graphs into discrete tokens with a VQ-VAE preserves all necessary semantic and geometric details without introducing artifacts that reduce final predicate accuracy.

What would settle it

A controlled experiment showing lower predicate recall when using the VQ-VAE tokens versus direct continuous feature regression on the same architecture would falsify the quantization step's value.

Figures

Figures reproduced from arXiv: 2604.18623 by Ke Qin, Ming Li, Tao He, Wen Yin, Xin Hu, Yuan-Fang Li.

Figure 1
Figure 1. Figure 1: Comparison of recent SGG paradigms. (a) Two￾stage: a pre-trained detector proposes objects and enumerates human–object pairs; a relation head refines multi-stream features to classify predicates. (b) One-stage: objects and predicates are detected jointly in a single pass, followed by a matching step to attach predicates to object pairs. (c) Ours (generative): given an image and an initially noisy graph, an… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of our FlowSG. (Left) Image-guided iterative scene graph generation via flow matching. Starting from a noised graph 𝐺0, our graph transformer refines predictions through ODE integration steps (𝐺𝑡 → 𝐺𝑡+Δ𝑡), outputting velocity fields for continuous bounding boxes and clean posteriors for discrete codes. (Right) Graph transformer architecture consisting of: (i) Relation-modulated Self-Attention … view at source ↗
Figure 4
Figure 4. Figure 4: visualizes FlowSG’s image-conditioned transport over time. Early steps (𝑡=0.1, 0.2) establish nodes and coarse relations but include generic or misplaced edges. Mid steps refine structure, repairing relation endpoints and correct￾ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FlowSG, which recasts scene graph generation (SGG) as continuous-time transport on a hybrid discrete-continuous state space. A VQ-VAE first quantizes scene-graph features (visual and relational) into compact tokens; a graph Transformer then predicts a conditional velocity field that jointly transports continuous object geometry (boxes) and updates discrete posteriors over object and predicate tokens via flow-conditioned message passing. Training uses flow-matching losses on geometry together with a discrete-flow objective on tokens. The approach is claimed to enable few-step inference while remaining plug-and-play with off-the-shelf detectors and segmenters. Experiments on VG and PSG under closed- and open-vocabulary protocols report consistent gains of roughly 3 points in predicate R/mR and graph-level metrics over the prior state-of-the-art USG-Par.

Significance. If the reported gains can be shown to arise from the generative flow formulation rather than from the VQ-VAE stage or experimental choices, the work would offer a substantive alternative to one-shot classification pipelines in SGG. The hybrid discrete-continuous transport, constraint-aware refinement, and few-step inference are conceptually attractive and could improve handling of long-tail predicates and geometric-semantic coupling. Plug-and-play compatibility with existing detectors is a practical advantage that would facilitate adoption.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim is an average ~3-point improvement in predicate R/mR and graph-level metrics over USG-Par, yet no error bars, standard deviations across runs, or exact experimental protocol (train/val splits, hyper-parameter search, post-hoc dataset filtering) are supplied. Without these, it is impossible to determine whether the gains are statistically reliable or attributable to the flow-matching objectives rather than other factors.
  2. [Method] Method (VQ-VAE quantization step): the paper quantizes continuous visual features into discrete tokens via a reconstruction-trained VQ-VAE before applying the graph-Transformer velocity field. Because the codebook is optimized for reconstruction fidelity rather than predicate discrimination, fine-grained distinctions (e.g., “on” vs. “above”, “holding” vs. “carrying”) may be collapsed into identical tokens. Any information lost at this irreversible quantization step cannot be recovered by subsequent flow transport; therefore the attribution of performance gains to the mixed discrete-continuous generative formulation requires an explicit ablation that isolates the quantization stage (e.g., continuous-feature baseline vs. quantized tokens).
minor comments (2)
  1. [Abstract] Abstract: the phrase “constraint-aware refinements” is used without defining the constraints or how they are enforced inside the velocity field; a brief clarification would improve readability.
  2. [Method] The manuscript states “plug-and-play compatibility with standard detectors and segmenters” but does not specify the exact interface (e.g., which layers receive the image features or how the graph Transformer conditions on detector outputs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of empirical rigor and the role of the VQ-VAE stage. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim is an average ~3-point improvement in predicate R/mR and graph-level metrics over USG-Par, yet no error bars, standard deviations across runs, or exact experimental protocol (train/val splits, hyper-parameter search, post-hoc dataset filtering) are supplied. Without these, it is impossible to determine whether the gains are statistically reliable or attributable to the flow-matching objectives rather than other factors.

    Authors: We agree that error bars and fuller protocol details are needed to establish statistical reliability. The original experiments followed the standard VG and PSG splits and evaluation protocols from prior work (including USG-Par), with no post-hoc filtering applied. In the revised manuscript we have added results from three independent runs with different random seeds, reporting standard deviations in the main result tables (all <0.5 points). The gains remain consistent at ~3 points. We have also expanded Section 4.1 and the appendix to explicitly document the train/val/test splits, hyper-parameter search procedure, and full training details. These additions support that the improvements arise from the hybrid flow formulation. revision: yes

  2. Referee: [Method] Method (VQ-VAE quantization step): the paper quantizes continuous visual features into discrete tokens via a reconstruction-trained VQ-VAE before applying the graph-Transformer velocity field. Because the codebook is optimized for reconstruction fidelity rather than predicate discrimination, fine-grained distinctions (e.g., “on” vs. “above”, “holding” vs. “carrying”) may be collapsed into identical tokens. Any information lost at this irreversible quantization step cannot be recovered by subsequent flow transport; therefore the attribution of performance gains to the mixed discrete-continuous generative formulation requires an explicit ablation that isolates the quantization stage (e.g., continuous-feature baseline vs. quantized tokens).

    Authors: We concur that an explicit ablation isolating the quantization stage is required to attribute gains specifically to the generative flow. We have added a new ablation (Section 5.3, Table 5) that replaces the VQ-VAE tokens with continuous visual features fed directly into the identical graph Transformer and flow-matching objectives. The quantized-token version outperforms this continuous baseline by 1.8 points on average predicate mR, indicating that the discrete state space enables more effective flow-conditioned message passing and posterior refinement for fine-grained predicates. We have clarified in the method section that, although the VQ-VAE is reconstruction-trained, the subsequent flow model progressively updates token posteriors, allowing recovery of distinctions through constraint-aware transport. The VQ-VAE remains frozen during flow training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained and empirically validated

full rationale

The paper introduces FlowSG as a new generative formulation that recasts SGG as hybrid discrete-continuous flow matching, with explicitly defined components: VQ-VAE quantization of scene graphs into tokens, a graph Transformer velocity field for continuous geometry transport, and a discrete-flow objective for categorical tokens. Training losses are stated as combinations of flow-matching for geometry and discrete-flow for tokens; these are not algebraically equivalent to the input data or to any prior result by construction. The central claims rest on reported empirical gains (predicate R/mR and graph metrics) versus baselines such as USG-Par on VG and PSG datasets, rather than any definitional reduction, fitted-input prediction, or load-bearing self-citation chain. No self-definitional, uniqueness-imported, or ansatz-smuggled steps appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard flow-matching mathematics and VQ-VAE quantization; no new physical entities are postulated. Free parameters such as the number of flow steps or the VQ codebook size are implicit but not enumerated in the abstract.

axioms (1)
  • domain assumption Flow matching can be extended to jointly transport continuous geometry and discrete categorical tokens via a shared graph Transformer
    Invoked when the paper states that the model predicts a conditional velocity field for boxes while updating discrete posteriors for tokens.

pith-pipeline@v0.9.0 · 5564 in / 1412 out tokens · 31499 ms · 2026-05-10T07:58:18.147913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 3 canonical work pages

  1. [1]

    Generative flows on discrete state-spaces: Enablingmultimodalflowswithapplicationsto proteinco-design

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rain- forth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enablingmultimodalflowswithapplicationsto proteinco-design. InICML,pages5453–5512.PMLR,2024. 7

  2. [2]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh,LeonidasGuibas,andFeiXia.Spatialvlm: Endowing vision-languagemodelswithspatialreasoningcapabilities.In CVPR, pages 14455–14465, 2024. 2

  3. [3]

    Hydra-SGG: Hybrid relation assignment for one-stage scene graph generation

    MinghanChen,GuikunChen,WenguanWang,andYiYang. Hydra-SGG: Hybrid relation assignment for one-stage scene graph generation. InICLR, 2025. 7

  4. [4]

    Expanding scene graph boundaries: fully open-vocabulary scene graph generation via visual-concept alignment and retention

    Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, and Chang Wen Chen. Expanding scene graph boundaries: fully open-vocabulary scene graph generation via visual-concept alignment and retention. InECCV, pages 108–124, 2024. 6, 7

  5. [5]

    InCVPR, pages 1290–1299, 2022

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- derKirillov,andRohitGirdhar.Masked-attentionmasktrans- former for universal image segmentation. InCVPR, pages 1290–1299, 2022. 4, 6

  6. [6]

    Reltr: Relation transformer for scene graph generation

    Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation. TPAMI, 45(9):11169–11183, 2023. 1, 2

  7. [7]

    Principalneighbourhoodaggrega- tion for graph nets.NeurIPS, 33:13260–13271, 2020

    Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò,andPetarVeličković. Principalneighbourhoodaggrega- tion for graph nets.NeurIPS, 33:13260–13271, 2020. 6

  8. [8]

    Variationalflowmatchingforgraphgeneration.NeurIPS,37: 11735–11764, 2024

    Floor Eijkelboom, Grigory Bartosh, Christian Anders- son Naesseth, Max Welling, and Jan-Willem van de Meent. Variationalflowmatchingforgraphgeneration.NeurIPS,37: 11735–11764, 2024. 2, 3

  9. [9]

    Hybrid reciprocaltransformerwithtripletfeaturealignmentforscene graph generation

    Jiawei Fu, Tiantian Zhang, Kai Chen, and Qi Dou. Hybrid reciprocaltransformerwithtripletfeaturealignmentforscene graph generation. InCVPR, pages 8953–8963, 2025. 1, 2, 7

  10. [10]

    Unconditional scenegraphgeneration

    SarthakGarg,HelisaDhamo,AzadeFarshad,SabrinaMusa- tian, Nassir Navab, and Federico Tombari. Unconditional scenegraphgeneration. InICCV,pages16362–16371,2021. 3

  11. [11]

    Discrete flow matching.NeurIPS, 37:133345–133385, 2024

    ItaiGat,TalRemez,NetaShaul,FelixKreuk,RickyTQChen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.NeurIPS, 37:133345–133385, 2024. 2, 3

  12. [12]

    Open- vocabulary object detection via vision and language knowl- edge distillation

    XiuyeGu,Tsung-YiLin,WeichengKuo,andYinCui. Open- vocabulary object detection via vision and language knowl- edge distillation. InICLR 2022, Virtual Event, April 25-29,

  13. [13]

    OpenReview.net, 2022. 2

  14. [14]

    Visual semantic role la- beling

    Saurabh Gupta and Jitendra Malik. Visual semantic role labeling.arXiv preprint arXiv:1505.04474, 2015. 4

  15. [15]

    Et-flow: Equivariant flow-matchingformolecularconformergeneration.NeurIPS, 37:128798–128824, 2024

    Majdi Hassan, Nikhil Shenoy, Jungyoon Lee, Hannes Stärk, StephanThaler,andDominiqueBeaini. Et-flow: Equivariant flow-matchingformolecularconformergeneration.NeurIPS, 37:128798–128824, 2024. 2

  16. [16]

    Dsgg: Dense relation transformer for an end-to-end scene graph generation

    Zeeshan Hayder and Xuming He. Dsgg: Dense relation transformer for an end-to-end scene graph generation. In CVPR, pages 28317–28326, 2024. 7

  17. [17]

    Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation

    Tao He, Lianli Gao, Jingkuan Song, Jianfei Cai, and Yuan- Fang Li. Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation. In AAAI, pages 587–593, 2021. 2

  18. [18]

    To- wardsopen-vocabularyscenegraphgenerationwithprompt- based finetuning

    Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wardsopen-vocabularyscenegraphgenerationwithprompt- based finetuning. InECCV, pages 56–73, 2022. 6

  19. [19]

    To- ward a unified transformer-based framework for scene graph generationandhuman-objectinteractiondetection

    Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- ward a unified transformer-based framework for scene graph generationandhuman-objectinteractiondetection. 32:6274– 6288, 2023. 2

  20. [20]

    Lifelong scene graph generation.Pattern Recognition, page 113132, 2026

    Tao He, Xin Hu, Tongtong Wu, Dongyang Zhang, Ming Li, Yuan-Fang Li, and Fei Richard Yu. Lifelong scene graph generation.Pattern Recognition, page 113132, 2026. 1

  21. [21]

    Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2

  22. [22]

    Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long-and local-rangecontextreasoning

    Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li, and Tao He. Spade: Spatial-aware denoising network for open- vocabulary panoptic scene graph generation with long-and local-rangecontextreasoning. InICCV,pages15562–15572,

  23. [23]

    Navigating the unseen: Zero-shot scene graph genera- tion via capsule-based equivariant features

    WenhuanHuang,YiJI,GuiqianZhu,LiYing,andChunping Liu. Navigating the unseen: Zero-shot scene graph genera- tion via capsule-based equivariant features. InCVPR, pages 29448–29457, 2025. 7

  24. [24]

    Egtr: Extracting graph from trans- former for scene graph generation

    Jinbae Im, JeongYeon Nam, Nokyung Park, Hyungmin Lee, and Seunghyun Park. Egtr: Extracting graph from trans- former for scene graph generation. InCVPR, pages 24229– 24238, 2024. 7

  25. [25]

    Categorical repa- rameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InICLR, 2017. 4

  26. [26]

    Elucidating the design space of diffusion-based generative models.NeurIPS, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.NeurIPS, 35:26565–26577, 2022. 6

  27. [27]

    Groupwise query special- ization and quality-aware multi-assignment for transformer- based visual relationship detection

    JonghaKim,JihwanPark,JinyoungPark,JinyoungKim,Se- hyung Kim, and Hyunwoo J Kim. Groupwise query special- ization and quality-aware multi-assignment for transformer- based visual relationship detection. InCVPR, pages 28160– 28169, 2024. 7

  28. [28]

    Autoregressivediffusion model for graph generation

    Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, BAdityaPrakash,andChaoZhang. Autoregressivediffusion model for graph generation. InICML, pages 17391–17408. PMLR, 2023. 3

  29. [29]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 123:32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 123:32–73, 2017. 6

  30. [30]

    Panoptic scene graph generation with semantics-prototype learning

    LiLi,WeiJi,YimingWu,MengzeLi,YouQin,LinaWei,and Roger Zimmermann. Panoptic scene graph generation with semantics-prototype learning. InAAAI, pages 3145–3153,

  31. [31]

    Stprivacy: Spatio-temporal privacy-preserving action recognition

    MingLi,XiangyuXu,HeheFan,PanZhou,JunLiu,Jia-Wei Liu,JiaheLi,JussiKeppo,MikeZhengShou,andShuicheng Yan. Stprivacy: Spatio-temporal privacy-preserving action recognition. InICCV, 2023. 3

  32. [32]

    Instant3d: instant text-to- 3d generation.IJCV, 2024

    Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: instant text-to- 3d generation.IJCV, 2024. 3

  33. [33]

    Efficient industrial dataset distillation with textualtrajectorymatching.IEEETransactionsonIndustrial Informatics, 2026

    Muquan Li, Qian Dong, Dongyang Zhang, Ke Qin, and Guangchun Luo. Efficient industrial dataset distillation with textualtrajectorymatching.IEEETransactionsonIndustrial Informatics, 2026. 3

  34. [34]

    Fixedanchorsarenotenough: Dynamicre- trievalandpersistenthomologyfordatasetdistillation.arXiv preprint arXiv:2602.24144, 2026

    Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin,andTaoHe. Fixedanchorsarenotenough: Dynamicre- trievalandpersistenthomologyfordatasetdistillation.arXiv preprint arXiv:2602.24144, 2026. 3

  35. [35]

    Sgtr: End- to-end scene graph generation with transformer

    Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End- to-end scene graph generation with transformer. InCVPR, pages 19486–19496, 2022. 7

  36. [36]

    From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

    Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InCVPR, pages 28076–28086, 2024. 2, 7

  37. [37]

    Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

    Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. InICLR, 2024. 4, 6, 7

  38. [38]

    Gps-net: Graph property sensing network for scene graph generation

    Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. Gps-net: Graph property sensing network for scene graph generation. InCVPR, pages 3746–3753, 2020. 2

  39. [39]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 3

  40. [40]

    Eventgpt: Event stream understanding with multimodal large language models

    ShaoyuLiu,JianingLi,GuanghuiZhao,YunjianZhang,Xin Meng,FeiRichardYu,XiangyangJi,andMingLi. Eventgpt: Event stream understanding with multimodal large language models. InCVPR, pages 29139–29149, 2025. 1

  41. [41]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2018. 6

  42. [42]

    Spectre: Spectralconditioninghelps to overcome the expressivity limits of one-shot graph gener- ators

    Karolis Martinkus, Andreas Loukas, Nathanaël Perraudin, andRogerWattenhofer. Spectre: Spectralconditioninghelps to overcome the expressivity limits of one-shot graph gener- ators. InICML, pages 15159–15179. PMLR, 2022. 3

  43. [43]

    Vision-language interactive rela- tion mining for open-vocabulary scene graph generation

    Yukuan Min, Muli Yang, Jinhao Zhang, Yuxuan Wang, Am- ing Wu, and Cheng Deng. Vision-language interactive rela- tion mining for open-vocabulary scene graph generation. In ICCV, pages 16755–16764, 2025. 1, 7

  44. [44]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV (ICCV), 2023. 6

  45. [45]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 5

  46. [46]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2, 5

  47. [47]

    Defog: Discrete flow matching for graph genera- tion

    Yiming Qin, Manuel Madeira, Dorina Thanou, and Pascal Frossard. Defog: Discrete flow matching for graph genera- tion. InICML, 2025. 2, 3

  48. [48]

    Learning transferablevisualmodelsfromnaturallanguagesupervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, AmandaAskell,PamelaMishkin,JackClark,etal. Learning transferablevisualmodelsfromnaturallanguagesupervision. InICML, pages 8748–8763. PMLR, 2021. 4, 5, 6

  49. [49]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, ChelseaVoss, AlecRadford, MarkChen, andIlyaSutskever. Zero-shot text-to-image generation. InICML, pages 8821–

  50. [50]

    Discrete variational autoencoders

    Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2017. 4

  51. [51]

    Graphvae: To- wards generation of small graphs using variational autoen- coders

    Martin Simonovsky and Nikos Komodakis. Graphvae: To- wards generation of small graphs using variational autoen- coders. InICANN, pages 412–422. Springer, 2018. 3

  52. [52]

    Deep discrete hashing with self-supervised pairwise labels

    Jingkuan Song, Tao He, Hangbo Fan, and Lianli Gao. Deep discrete hashing with self-supervised pairwise labels. In ECML-PKDD, pages 223–238. Springer, 2017. 1

  53. [53]

    Transformer-basedimagegenerationfrom scene graphs.Computer Vision and Image Understanding, 233:103721, 2023

    RenatoSortino,SimonePalazzo,FrancescoRundo,andCon- cettoSpampinato. Transformer-basedimagegenerationfrom scene graphs.Computer Vision and Image Understanding, 233:103721, 2023. 2

  54. [54]

    In AAAI, 2017

    RobynSpeer,JoshuaChin,andCatherineHavasi.Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017. 4

  55. [55]

    Learning to compose dynamic tree structures for visual contexts

    Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. InCVPR, pages 6619–6628, 2019. 1, 7

  56. [56]

    Neural discrete representation learning.NeurIPS, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.NeurIPS, 30, 2017. 2

  57. [57]

    InICLR, 2023

    Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang,VolkanCevher,andPascalFrossard.Digress: Discrete denoising diffusion for graph generation. InICLR, 2023. 2, 7

  58. [58]

    Pair then relation: Pair-net for panoptic scene graph generation.TPAMI, 2024

    Jinghao Wang, Zhengyu Wen, Xiangtai Li, Zujin Guo, Jingkang Yang, and Ziwei Liu. Pair then relation: Pair-net for panoptic scene graph generation.TPAMI, 2024. 7

  59. [59]

    Universal scenegraphgeneration

    Shengqiong Wu, Hao Fei, and Tat-Seng Chua. Universal scenegraphgeneration. InCVPR,pages14158–14168,2025. 2, 7

  60. [60]

    Scene graph generation by iterative message passing

    Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017. 7

  61. [61]

    Scene graph generation by iterative message passing

    Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017. 1

  62. [62]

    Open-vocabulary panoptic segmentationwithtext-to-imagediffusionmodels

    JiaruiXu,SifeiLiu,ArashVahdat,WonminByeon,Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentationwithtext-to-imagediffusionmodels. InCVPR, pages 2955–2966, 2023. 5

  63. [63]

    Discrete-statecontinuous-timediffusionfor graph generation.NeurIPS, 37:79704–79740, 2024

    ZheXu,RuizhongQiu,YuzhongChen,HuiyuanChen,Xiran Fan, Menghai Pan, Zhichen Zeng, Mahashweta Das, and HanghangTong. Discrete-statecontinuous-timediffusionfor graph generation.NeurIPS, 37:79704–79740, 2024. 7

  64. [64]

    Synthetic-to-realself-supervisedrobustdepth estimation via learning with motion and structure priors

    Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, and RobbyT.Tan. Synthetic-to-realself-supervisedrobustdepth estimation via learning with motion and structure priors. In CVPR, pages 21880–21890, 2025. 1

  65. [65]

    Panoptic scene graph gen- eration

    Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic scene graph gen- eration. InECCV, pages 178–196. Springer, 2022. 1, 2, 4, 6

  66. [66]

    JieYang,BingliangLi,AilingZeng,LeiZhang,andRuimao Zhang. Open-world human-object interaction detection via multi-modalprompts.InIEEE/CVFConferenceonComputer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 16954–16964. IEEE, 2024. 5

  67. [67]

    Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervisedcross-domainvisualemotionrecognition

    Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, and Tao He. Knowledge-aligned counterfactual-enhancement diffusion perception for unsu- pervisedcross-domainvisualemotionrecognition. InCVPR, pages 3888–3898, 2025. 1

  68. [68]

    Tical: Typicality- based consistency-aware learning for multimodal emotion recognition.arXiv preprint arXiv:2511.15085, 2025

    Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan, Xiurui Xie, Yuan-Fang Li, and Tao He. Tical: Typicality- based consistency-aware learning for multimodal emotion recognition.arXiv preprint arXiv:2511.15085, 2025. 1

  69. [69]

    Vqa and visual reasoning: An overview of approaches, datasets, and future direction.Neurocomputing, 622:129345, 2025

    Rufai Yusuf Zakari, Jim Wilson Owusu, Ke Qin, Hailin Wang, Zaharaddeen Karami Lawal, and Tao He. Vqa and visual reasoning: An overview of approaches, datasets, and future direction.Neurocomputing, 622:129345, 2025. 1

  70. [70]

    Neural motifs: Scene graph parsing with global context

    RowanZellers,MarkYatskar,SamThomson,andYejinChoi. Neural motifs: Scene graph parsing with global context. In CVPR, pages 5831–5840, 2018. 2, 7

  71. [71]

    Learning to generate language- supervised and open-vocabulary scene graph using pre- trained visual-semantic space

    Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. Learning to generate language- supervised and open-vocabulary scene graph using pre- trained visual-semantic space. InCVPR, pages 2915–2924,

  72. [72]

    Prototype-based embedding network for scene graph generation

    ChaofanZheng,XinyuLyu,LianliGao,BoDai,andJingkuan Song. Prototype-based embedding network for scene graph generation. InCVPR (CVPR), pages 22783–22792, 2023. 7

  73. [73]

    Openpsg: Open-setpanopticscenegraphgenerationvialarge multimodal models

    Zijian Zhou, Zheng Zhu, Holger Caesar, and Miaojing Shi. Openpsg: Open-setpanopticscenegraphgenerationvialarge multimodal models. InECCV, pages 199–215. Springer,