pith. sign in

arxiv: 2412.07160 · v3 · submitted 2024-12-10 · 💻 cs.CV

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Pith reviewed 2026-05-23 07:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords contrastive learningtemporal scene graph generationpanoptic scene graphmask tubesmotion patternsvideo understanding4D data
0
0 comments X

The pith

Contrastive objectives on motion in mask tubes improve temporal relation prediction over pooling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a contrastive representation learning approach can better capture how motion patterns signal relations between entities in videos and 4D scenes. It replaces or augments standard temporal pooling on tracked entity masks with three specific contrastive objectives: pulling together mask tubes of matching subject-relation-object triplets, pushing them away from their own temporally shuffled versions, and separating tubes from different triplets within the same video. A reader would care because this targets the under-used motion signal without needing extra labels. Experiments report consistent gains on both video and 4D panoptic scene graph benchmarks.

Core claim

A motion-aware contrastive framework learns closer representations for mask tubes of similar triplets, distant representations for temporally shuffled versions of the same tube, and distant representations for different triplets inside one video; these learned representations then improve relation prediction in temporal panoptic scene graphs.

What carries the argument

Three contrastive objectives applied directly to motion patterns inside tracked entity mask tubes.

If this is right

  • Temporal scene graph models can improve by incorporating motion-focused contrastive terms without extra supervision.
  • Gains appear on both video scene graph and 4D scene graph tasks.
  • The framework preserves the original triplet annotations and does not introduce new failure modes from added labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same motion-contrastive pattern could be tested on action recognition or video object tracking where motion distinguishes categories.
  • Longer video sequences might require adjustments if motion patterns become less stable over extended time.

Load-bearing premise

Motion patterns inside mask tubes reliably indicate the subject-relation-object triplet and the three contrastive objectives extract this signal more effectively than temporal pooling.

What would settle it

A controlled replacement of the contrastive losses by standard temporal pooling on the same backbone and data, followed by measurement of whether relation prediction accuracy drops on the video and 4D test sets.

Figures

Figures reproduced from arXiv: 2412.07160 by Anh Tuan Luu, Cong-Duy T Nguyen, See-kiong Ng, Thong Thanh Nguyen, Xiaobao Wu, Yi Bin.

Figure 1
Figure 1. Figure 1: State-of-the-art IPS+T - Convolution (Yang et al. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of temporal panoptic scene graph generation of state-of-the-art (Yang et al. 2023, 2024) and our method. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework overview of contrastive learning for temporal scene graph generation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proposed strategy to select strong-motion tubes. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on threshold γ. use an ImageNet pretrained on ResNet-101 (Russakovsky et al. 2015) and the DKNet (Wu et al. 2022) as the visual en￾coder, respectively. We fine-tune the segmentation module for RGB-D and point cloud videos for 12 and 200 epochs, respectively. We use additional 100 epochs to train the re￾lation classification module. Based on validation, we adopt a threshold γ = 9.0 and a ma… view at source ↗
read the original abstract

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets. Code is available at: https://github.com/nguyentthong/motion-contrastive-sgg

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a motion-aware contrastive learning framework for temporal panoptic scene graph generation. It encodes mask tubes and replaces temporal pooling with three contrastive objectives: (1) pulling embeddings of mask tubes that share the same subject-relation-object triplet, (2) pushing apart a mask tube and its temporally shuffled version, and (3) pushing apart mask tubes from the same video that belong to different triplets. The central claim is that this framework yields significant improvements over state-of-the-art methods on both video and 4D datasets.

Significance. If the motion-specific component can be shown to drive the gains, the work would provide a practical way to inject motion signals into supervised scene-graph pipelines without new annotations or architectural changes. The public code release is a positive factor for reproducibility. However, because two of the three objectives are label-dependent supervised contrastive terms, the attribution of gains specifically to motion patterns remains unverified.

major comments (2)
  1. [Method (contrastive objectives)] Method section (contrastive objectives): Objective 1 pulls same-triplet mask tubes using ground-truth relation labels to define positives, and Objective 3 uses different triplets as negatives; only Objective 2 (temporal shuffling) is motion-specific and label-free. Because the main task is already supervised, these label-dependent terms amount to standard supervised contrastive regularization. An ablation that isolates the shuffling term (or removes the label-dependent terms) is required to support the claim that motion patterns are being extracted more effectively than temporal pooling.
  2. [Experiments] Experiments section: The abstract asserts 'significant improvements' on video and 4D datasets, yet the provided description supplies no quantitative numbers, baseline details, ablation tables, or error analysis. Without these, the central empirical claim cannot be verified and the motion-aware attribution cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract and introduction should explicitly state the three objectives with a short equation or pseudocode so readers can immediately distinguish the motion-specific term from the label-dependent terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the attribution of gains to the motion-specific component of our framework. We address each major comment below.

read point-by-point responses
  1. Referee: [Method (contrastive objectives)] Method section (contrastive objectives): Objective 1 pulls same-triplet mask tubes using ground-truth relation labels to define positives, and Objective 3 uses different triplets as negatives; only Objective 2 (temporal shuffling) is motion-specific and label-free. Because the main task is already supervised, these label-dependent terms amount to standard supervised contrastive regularization. An ablation that isolates the shuffling term (or removes the label-dependent terms) is required to support the claim that motion patterns are being extracted more effectively than temporal pooling.

    Authors: We agree that isolating the contribution of the motion-specific Objective 2 (temporal shuffling) is important for attributing improvements specifically to motion patterns rather than the supervised contrastive regularization. In the revised version, we will add a dedicated ablation that evaluates performance with and without Objective 2 while retaining the label-dependent terms, allowing direct comparison to the temporal pooling baseline. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts 'significant improvements' on video and 4D datasets, yet the provided description supplies no quantitative numbers, baseline details, ablation tables, or error analysis. Without these, the central empirical claim cannot be verified and the motion-aware attribution cannot be assessed.

    Authors: The full manuscript contains quantitative results, baseline comparisons, and ablation tables in the Experiments section. We will expand the presentation of these results (including any additional error analysis) and ensure all tables are clearly cross-referenced from the abstract and method sections in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical contrastive addition to existing pipelines

full rationale

The paper adds three contrastive objectives to mask-tube encoders for temporal scene-graph prediction and reports experimental gains on video/4D datasets. No equations, derivations, or self-citations are shown that reduce any claimed result to a fitted quantity or prior result defined by the present work itself. The method is presented as an empirical extension rather than a closed mathematical chain, so the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard contrastive learning assumptions and the domain premise that motion encodes relation information; no new entities or fitted constants are introduced in the abstract.

axioms (2)
  • domain assumption Motion patterns in tracked mask tubes are indicative of subject-relation-object relations and can be captured by contrastive objectives.
    Central motivation for replacing temporal pooling with the proposed losses.
  • domain assumption Contrastive representation learning improves downstream relation prediction when positive and negative pairs are defined by triplet identity and temporal order.
    Underpins the three contrastive terms described.

pith-pipeline@v0.9.0 · 5721 in / 1320 out tokens · 43310 ms · 2026-05-23T07:21:48.472611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 5 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; and Hu, W. 2020. Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recognition, 103: 107321

  4. [4]

    Chen, Z.; Zheng, T.; and Song, M. 2024. Curriculum Negative Mining For Temporal Networks. arXiv preprint arXiv:2407.17070

  5. [5]

    G.; Kirillov, A.; and Girdhar, R

    Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1290--1299

  6. [6]

    M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al

    Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. 2022. Epic-kitchens-100. International Journal of Computer Vision, 130: 33--55

  7. [7]

    Davtyan, A.; Sameni, S.; and Favaro, P. 2023. Efficient video prediction via sparsely conditioned flow matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23263--23274

  8. [8]

    Dong, Q.; and Fu, Y. 2024. MemFlow: Optical Flow Estimation and Prediction with Memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19068--19078

  9. [9]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D.; Xia, F.; Sajjadi, M. S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378

  10. [10]

    Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

  11. [11]

    Semi-Supervised Classification with Graph Convolutional Networks

    Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  12. [12]

    Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; and Loy, C. C. 2023 a . Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854

  13. [13]

    Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; and Loy, C. C. 2023 b . Tube-Link: A flexible cross tube framework for universal video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13923--13933

  14. [14]

    Li, X.; Zhang, W.; Pang, J.; Chen, K.; Cheng, G.; Tong, Y.; and Loy, C. C. 2022. Video k-net: A simple, strong, and unified baseline for video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18847--18857

  15. [15]

    Li, Y.; Yang, X.; and Xu, C. 2022. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13874--13883

  16. [16]

    A.; and Tripathi, S

    Liu, H.; Min, K.; Valdez, H. A.; and Tripathi, S. 2024. Contrastive Language Video Time Pre-training. arXiv preprint arXiv:2406.02631

  17. [17]

    Liu, K.; Li, Y.; Xu, Y.; Liu, S.; and Liu, S. 2022. Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters, 29: 1883--1887

  18. [18]

    Ma, X.; Yong, S.; Zheng, Z.; Li, Q.; Liang, Y.; Zhu, S.-C.; and Huang, S. 2022. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474

  19. [19]

    Nag, S.; Min, K.; Tripathi, S.; and Roy-Chowdhury, A. K. 2023. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22803--22813

  20. [20]

    A.; and Tuan, L

    Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

  21. [21]

    Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

  22. [22]

    Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 b . Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. arXiv preprint arXiv:2407.03788

  23. [23]

    Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

    Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J. Z.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 c . Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. arXiv preprint arXiv:2406.05615

  24. [24]

    Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

  25. [25]

    M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A

    Nguyen, T.; Wu, X.; Dong, X.; Le, K. M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A. T. 2024 d . READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 18824--18832

  26. [26]

    Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

  27. [27]

    T.; Ng, S.-K.; and Luu, A

    Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 e . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

  28. [28]

    Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

  29. [29]

    T.; Hu, Z.; Wu, X.; Nguyen, C.-D

    Nguyen, T. T.; Hu, Z.; Wu, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 f . Encoding and Controlling Global Semantics for Long-form Video Question Answering. arXiv preprint arXiv:2405.19723

  30. [30]

    Pu, T.; Chen, T.; Wu, H.; Lu, Y.; and Lin, L. 2023. Spatial-temporal knowledge-embedded transformer for video scene graph generation. IEEE Transactions on Image Processing

  31. [31]

    R.; Su, H.; Mo, K.; and Guibas, L

    Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652--660

  32. [32]

    Raychaudhuri, S.; Campari, T.; Jain, U.; Savva, M.; and Chang, A. X. 2023. Reduce, reuse, recycle: Modular multi-object navigation. arXiv preprint arXiv:2304.03696, 2

  33. [33]

    Ren, S.; Zhu, H.; Wei, C.; Li, Y.; Yuille, A.; and Xie, C. 2024. ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning. arXiv preprint arXiv:2405.15160

  34. [34]

    Rodin, I.; Furnari, A.; Min, K.; Tripathi, S.; and Farinella, G. M. 2024. Action Scene Graphs for Long-Form Understanding of Egocentric Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18622--18632

  35. [35]

    Rosa, K. D. 2024. Video Enriched Retrieval Augmented Generation Using Aligned Video Captions. arXiv preprint arXiv:2405.17706

  36. [36]

    Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211--252

  37. [37]

    Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; and Chua, T.-S. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, 279--287

  38. [38]

    Shen, H.; Shi, L.; Xu, W.; Cen, Y.; Zhang, L.; and An, G. 2024. Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection. arXiv preprint arXiv:2403.19111

  39. [39]

    Sobel, I.; Duda, R.; Hart, P.; and Wiley, J. 2022. Sobel-feldman operator. Preprint at https://www. researchgate. net/profile/Irwin-Sobel/publication/285159837. Accessed, 20

  40. [40]

    Song, X.; Li, Z.; Chen, S.; Cai, X.-Q.; and Demachi, K. 2024. An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video. arXiv preprint arXiv:2404.06741

  41. [41]

    S.; Kersting, K.; and Roth, S

    Sudhakaran, G.; Dhami, D. S.; Kersting, K.; and Roth, S. 2023. Vision relation transformer for unbiased scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21882--21893

  42. [42]

    Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961--3970

  43. [43]

    Wang, G.; Li, Z.; Chen, Q.; and Liu, Y. 2024 a . OED: Towards One-stage End-to-End Dynamic Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 27938--27947

  44. [44]

    Wang, W.; Luo, Y.; Chen, Z.; Jiang, T.; Yang, Y.; and Xiao, J. 2023. Taking a closer look at visual relation: Unbiased video scene graph generation with decoupled label learning. IEEE Transactions on Multimedia

  45. [45]

    Wang, Y.; Yuan, S.; Jian, X.; Pang, W.; Wang, M.; and Yu, N. 2024 b . HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models. arXiv preprint arXiv:2404.05083

  46. [46]

    Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.; and Bertinetto, L. 2021. Do different tracking tasks require different appearance models? Advances in Neural Information Processing Systems, 34: 726--738

  47. [47]

    Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

  48. [48]

    Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

  49. [49]

    Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; and Zhong, W. 2022. 3d instances as 1d kernels. In European Conference on Computer Vision, 235--252. Springer

  50. [50]

    Xiao, F.; Tighe, J.; and Modolo, D. 2021. Modist: Motion distillation for self-supervised video representation learning. arXiv preprint arXiv:2106.09703, 3

  51. [51]

    Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z

    Yang, J.; Ang, Y. Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z. 2022. Panoptic scene graph generation. In European Conference on Computer Vision, 178--196. Springer

  52. [52]

    Yang, J.; Cen, J.; Peng, W.; Liu, S.; Hong, F.; Li, X.; Zhou, K.; Chen, Q.; and Liu, Z. 2024. 4d panoptic scene graph generation. Advances in Neural Information Processing Systems, 36

  53. [53]

    C.; et al

    Yang, J.; Peng, W.; Li, X.; Guo, Z.; Chen, L.; Li, B.; Ma, Z.; Zhou, K.; Zhang, W.; Loy, C. C.; et al. 2023. Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18675--18685

  54. [54]

    Zhao, C.; Shen, Y.; Chen, Z.; Ding, M.; and Gan, C. 2023. Textpsg: Panoptic scene graph generation from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2839--2850

  55. [55]

    Zhou, H.; Liu, Q.; and Wang, Y. 2023. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10608--10617

  56. [56]

    L.; and Xu, Y

    Zhou, L.; Zhou, Y.; Lam, T. L.; and Xu, Y. 2022. Context-aware mixture-of-experts for unbiased scene graph generation. arXiv preprint arXiv:2208.07109