Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Anh Tuan Luu; Cong-Duy T Nguyen; See-kiong Ng; Thong Thanh Nguyen; Xiaobao Wu; Yi Bin

arxiv: 2412.07160 · v3 · submitted 2024-12-10 · 💻 cs.CV

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Thong Thanh Nguyen , Xiaobao Wu , Yi Bin , Cong-Duy T Nguyen , See-kiong Ng , Anh Tuan Luu This is my paper

Pith reviewed 2026-05-23 07:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive learningtemporal scene graph generationpanoptic scene graphmask tubesmotion patternsvideo understanding4D data

0 comments

The pith

Contrastive objectives on motion in mask tubes improve temporal relation prediction over pooling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a contrastive representation learning approach can better capture how motion patterns signal relations between entities in videos and 4D scenes. It replaces or augments standard temporal pooling on tracked entity masks with three specific contrastive objectives: pulling together mask tubes of matching subject-relation-object triplets, pushing them away from their own temporally shuffled versions, and separating tubes from different triplets within the same video. A reader would care because this targets the under-used motion signal without needing extra labels. Experiments report consistent gains on both video and 4D panoptic scene graph benchmarks.

Core claim

A motion-aware contrastive framework learns closer representations for mask tubes of similar triplets, distant representations for temporally shuffled versions of the same tube, and distant representations for different triplets inside one video; these learned representations then improve relation prediction in temporal panoptic scene graphs.

What carries the argument

Three contrastive objectives applied directly to motion patterns inside tracked entity mask tubes.

If this is right

Temporal scene graph models can improve by incorporating motion-focused contrastive terms without extra supervision.
Gains appear on both video scene graph and 4D scene graph tasks.
The framework preserves the original triplet annotations and does not introduce new failure modes from added labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same motion-contrastive pattern could be tested on action recognition or video object tracking where motion distinguishes categories.
Longer video sequences might require adjustments if motion patterns become less stable over extended time.

Load-bearing premise

Motion patterns inside mask tubes reliably indicate the subject-relation-object triplet and the three contrastive objectives extract this signal more effectively than temporal pooling.

What would settle it

A controlled replacement of the contrastive losses by standard temporal pooling on the same backbone and data, followed by measurement of whether relation prediction accuracy drops on the video and 4D test sets.

Figures

Figures reproduced from arXiv: 2412.07160 by Anh Tuan Luu, Cong-Duy T Nguyen, See-kiong Ng, Thong Thanh Nguyen, Xiaobao Wu, Yi Bin.

**Figure 2.** Figure 2: Examples of temporal panoptic scene graph generation of state-of-the-art (Yang et al. 2023, 2024) and our method. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Framework overview of contrastive learning for temporal scene graph generation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Proposed strategy to select strong-motion tubes. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation results on threshold γ. use an ImageNet pretrained on ResNet-101 (Russakovsky et al. 2015) and the DKNet (Wu et al. 2022) as the visual encoder, respectively. We fine-tune the segmentation module for RGB-D and point cloud videos for 12 and 200 epochs, respectively. We use additional 100 epochs to train the relation classification module. Based on validation, we adopt a threshold γ = 9.0 and a ma… view at source ↗

read the original abstract

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets. Code is available at: https://github.com/nguyentthong/motion-contrastive-sgg

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds three contrastive losses on mask tubes, but two rely on ground-truth relation labels so the motion-specific part is narrower than the title suggests.

read the letter

The main addition is a set of three contrastive objectives applied to tracked mask tubes in temporal scene graph generation. One pulls embeddings of the same subject-relation-object triplet closer, one pushes temporally shuffled versions of a tube apart, and one pushes tubes from different triplets in the same video apart. The shuffle term is the only one that directly targets motion without needing the target labels. The other two are standard supervised contrastive terms on the relation labels already used by the downstream task. This is a straightforward way to regularize the representation beyond simple temporal pooling, and releasing the code helps with checking the details. The abstract claims clear gains on video and 4D benchmarks, which is the kind of result that matters in this subfield. The limitation is that the paper does not appear to isolate whether the shuffle loss alone drives the reported improvements or whether the label-dependent terms are doing most of the work. Without that separation, the motion-aware framing is harder to evaluate. The work is aimed at people already running scene-graph pipelines on video data. It is solid enough on its own terms to go to referees, though any review should ask for ablations that separate the motion term from the supervised contrastive terms.

Referee Report

2 major / 1 minor

Summary. The paper proposes a motion-aware contrastive learning framework for temporal panoptic scene graph generation. It encodes mask tubes and replaces temporal pooling with three contrastive objectives: (1) pulling embeddings of mask tubes that share the same subject-relation-object triplet, (2) pushing apart a mask tube and its temporally shuffled version, and (3) pushing apart mask tubes from the same video that belong to different triplets. The central claim is that this framework yields significant improvements over state-of-the-art methods on both video and 4D datasets.

Significance. If the motion-specific component can be shown to drive the gains, the work would provide a practical way to inject motion signals into supervised scene-graph pipelines without new annotations or architectural changes. The public code release is a positive factor for reproducibility. However, because two of the three objectives are label-dependent supervised contrastive terms, the attribution of gains specifically to motion patterns remains unverified.

major comments (2)

[Method (contrastive objectives)] Method section (contrastive objectives): Objective 1 pulls same-triplet mask tubes using ground-truth relation labels to define positives, and Objective 3 uses different triplets as negatives; only Objective 2 (temporal shuffling) is motion-specific and label-free. Because the main task is already supervised, these label-dependent terms amount to standard supervised contrastive regularization. An ablation that isolates the shuffling term (or removes the label-dependent terms) is required to support the claim that motion patterns are being extracted more effectively than temporal pooling.
[Experiments] Experiments section: The abstract asserts 'significant improvements' on video and 4D datasets, yet the provided description supplies no quantitative numbers, baseline details, ablation tables, or error analysis. Without these, the central empirical claim cannot be verified and the motion-aware attribution cannot be assessed.

minor comments (1)

[Abstract] The abstract and introduction should explicitly state the three objectives with a short equation or pseudocode so readers can immediately distinguish the motion-specific term from the label-dependent terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the attribution of gains to the motion-specific component of our framework. We address each major comment below.

read point-by-point responses

Referee: [Method (contrastive objectives)] Method section (contrastive objectives): Objective 1 pulls same-triplet mask tubes using ground-truth relation labels to define positives, and Objective 3 uses different triplets as negatives; only Objective 2 (temporal shuffling) is motion-specific and label-free. Because the main task is already supervised, these label-dependent terms amount to standard supervised contrastive regularization. An ablation that isolates the shuffling term (or removes the label-dependent terms) is required to support the claim that motion patterns are being extracted more effectively than temporal pooling.

Authors: We agree that isolating the contribution of the motion-specific Objective 2 (temporal shuffling) is important for attributing improvements specifically to motion patterns rather than the supervised contrastive regularization. In the revised version, we will add a dedicated ablation that evaluates performance with and without Objective 2 while retaining the label-dependent terms, allowing direct comparison to the temporal pooling baseline. revision: yes
Referee: [Experiments] Experiments section: The abstract asserts 'significant improvements' on video and 4D datasets, yet the provided description supplies no quantitative numbers, baseline details, ablation tables, or error analysis. Without these, the central empirical claim cannot be verified and the motion-aware attribution cannot be assessed.

Authors: The full manuscript contains quantitative results, baseline comparisons, and ablation tables in the Experiments section. We will expand the presentation of these results (including any additional error analysis) and ensure all tables are clearly cross-referenced from the abstract and method sections in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical contrastive addition to existing pipelines

full rationale

The paper adds three contrastive objectives to mask-tube encoders for temporal scene-graph prediction and reports experimental gains on video/4D datasets. No equations, derivations, or self-citations are shown that reduce any claimed result to a fitted quantity or prior result defined by the present work itself. The method is presented as an empirical extension rather than a closed mathematical chain, so the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard contrastive learning assumptions and the domain premise that motion encodes relation information; no new entities or fitted constants are introduced in the abstract.

axioms (2)

domain assumption Motion patterns in tracked mask tubes are indicative of subject-relation-object relations and can be captured by contrastive objectives.
Central motivation for replacing temporal pooling with the proposed losses.
domain assumption Contrastive representation learning improves downstream relation prediction when positive and negative pairs are defined by triplet identity and temporal order.
Underpins the three contrastive terms described.

pith-pipeline@v0.9.0 · 5721 in / 1320 out tokens · 43310 ms · 2026-05-23T07:21:48.472611+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

contrastive objective ... similar subject-relation-object triplets ... temporally shuffled versions ... optimal transport distance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jcost-style reciprocal cost or phi-ladder spacing never appears

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 5 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; and Hu, W. 2020. Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recognition, 103: 107321

work page 2020
[4]

Chen, Z.; Zheng, T.; and Song, M. 2024. Curriculum Negative Mining For Temporal Networks. arXiv preprint arXiv:2407.17070

work page arXiv 2024
[5]

G.; Kirillov, A.; and Girdhar, R

Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1290--1299

work page 2022
[6]

M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al

Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. 2022. Epic-kitchens-100. International Journal of Computer Vision, 130: 33--55

work page 2022
[7]

Davtyan, A.; Sameni, S.; and Favaro, P. 2023. Efficient video prediction via sparsely conditioned flow matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23263--23274

work page 2023
[8]

Dong, Q.; and Fu, Y. 2024. MemFlow: Optical Flow Estimation and Prediction with Memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19068--19078

work page 2024
[9]

PaLM-E: An Embodied Multimodal Language Model

Driess, D.; Xia, F.; Sajjadi, M. S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

work page 2022
[11]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; and Loy, C. C. 2023 a . Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854

work page arXiv 2023
[13]

Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; and Loy, C. C. 2023 b . Tube-Link: A flexible cross tube framework for universal video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13923--13933

work page 2023
[14]

Li, X.; Zhang, W.; Pang, J.; Chen, K.; Cheng, G.; Tong, Y.; and Loy, C. C. 2022. Video k-net: A simple, strong, and unified baseline for video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18847--18857

work page 2022
[15]

Li, Y.; Yang, X.; and Xu, C. 2022. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13874--13883

work page 2022
[16]

A.; and Tripathi, S

Liu, H.; Min, K.; Valdez, H. A.; and Tripathi, S. 2024. Contrastive Language Video Time Pre-training. arXiv preprint arXiv:2406.02631

work page arXiv 2024
[17]

Liu, K.; Li, Y.; Xu, Y.; Liu, S.; and Liu, S. 2022. Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters, 29: 1883--1887

work page 2022
[18]

Ma, X.; Yong, S.; Zheng, Z.; Li, Q.; Liang, Y.; Zhu, S.-C.; and Huang, S. 2022. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474

work page arXiv 2022
[19]

Nag, S.; Min, K.; Tripathi, S.; and Roy-Chowdhury, A. K. 2023. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22803--22813

work page 2023
[20]

A.; and Tuan, L

Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

work page arXiv 2023
[21]

Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

work page arXiv 2024
[22]

Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 b . Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. arXiv preprint arXiv:2407.03788

work page arXiv 2024
[23]

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J. Z.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 c . Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. arXiv preprint arXiv:2406.05615

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

work page 2021
[25]

M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Le, K. M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A. T. 2024 d . READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 18824--18832

work page 2024
[26]

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

T.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 e . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

work page arXiv 2024
[28]

Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

T.; Hu, Z.; Wu, X.; Nguyen, C.-D

Nguyen, T. T.; Hu, Z.; Wu, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 f . Encoding and Controlling Global Semantics for Long-form Video Question Answering. arXiv preprint arXiv:2405.19723

work page arXiv 2024
[30]

Pu, T.; Chen, T.; Wu, H.; Lu, Y.; and Lin, L. 2023. Spatial-temporal knowledge-embedded transformer for video scene graph generation. IEEE Transactions on Image Processing

work page 2023
[31]

R.; Su, H.; Mo, K.; and Guibas, L

Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652--660

work page 2017
[32]

Raychaudhuri, S.; Campari, T.; Jain, U.; Savva, M.; and Chang, A. X. 2023. Reduce, reuse, recycle: Modular multi-object navigation. arXiv preprint arXiv:2304.03696, 2

work page arXiv 2023
[33]

Ren, S.; Zhu, H.; Wei, C.; Li, Y.; Yuille, A.; and Xie, C. 2024. ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning. arXiv preprint arXiv:2405.15160

work page arXiv 2024
[34]

Rodin, I.; Furnari, A.; Min, K.; Tripathi, S.; and Farinella, G. M. 2024. Action Scene Graphs for Long-Form Understanding of Egocentric Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18622--18632

work page 2024
[35]

Rosa, K. D. 2024. Video Enriched Retrieval Augmented Generation Using Aligned Video Captions. arXiv preprint arXiv:2405.17706

work page arXiv 2024
[36]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211--252

work page 2015
[37]

Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; and Chua, T.-S. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, 279--287

work page 2019
[38]

Shen, H.; Shi, L.; Xu, W.; Cen, Y.; Zhang, L.; and An, G. 2024. Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection. arXiv preprint arXiv:2403.19111

work page arXiv 2024
[39]

Sobel, I.; Duda, R.; Hart, P.; and Wiley, J. 2022. Sobel-feldman operator. Preprint at https://www. researchgate. net/profile/Irwin-Sobel/publication/285159837. Accessed, 20

work page arXiv 2022
[40]

Song, X.; Li, Z.; Chen, S.; Cai, X.-Q.; and Demachi, K. 2024. An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video. arXiv preprint arXiv:2404.06741

work page arXiv 2024
[41]

S.; Kersting, K.; and Roth, S

Sudhakaran, G.; Dhami, D. S.; Kersting, K.; and Roth, S. 2023. Vision relation transformer for unbiased scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21882--21893

work page 2023
[42]

Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961--3970

work page 2020
[43]

Wang, G.; Li, Z.; Chen, Q.; and Liu, Y. 2024 a . OED: Towards One-stage End-to-End Dynamic Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 27938--27947

work page 2024
[44]

Wang, W.; Luo, Y.; Chen, Z.; Jiang, T.; Yang, Y.; and Xiao, J. 2023. Taking a closer look at visual relation: Unbiased video scene graph generation with decoupled label learning. IEEE Transactions on Multimedia

work page 2023
[45]

Wang, Y.; Yuan, S.; Jian, X.; Pang, W.; Wang, M.; and Yu, N. 2024 b . HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models. arXiv preprint arXiv:2404.05083

work page arXiv 2024
[46]

Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.; and Bertinetto, L. 2021. Do different tracking tasks require different appearance models? Advances in Neural Information Processing Systems, 34: 726--738

work page 2021
[47]

Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

work page 2023
[48]

Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

work page 2024
[49]

Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; and Zhong, W. 2022. 3d instances as 1d kernels. In European Conference on Computer Vision, 235--252. Springer

work page 2022
[50]

Xiao, F.; Tighe, J.; and Modolo, D. 2021. Modist: Motion distillation for self-supervised video representation learning. arXiv preprint arXiv:2106.09703, 3

work page arXiv 2021
[51]

Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z

Yang, J.; Ang, Y. Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z. 2022. Panoptic scene graph generation. In European Conference on Computer Vision, 178--196. Springer

work page 2022
[52]

Yang, J.; Cen, J.; Peng, W.; Liu, S.; Hong, F.; Li, X.; Zhou, K.; Chen, Q.; and Liu, Z. 2024. 4d panoptic scene graph generation. Advances in Neural Information Processing Systems, 36

work page 2024
[53]

C.; et al

Yang, J.; Peng, W.; Li, X.; Guo, Z.; Chen, L.; Li, B.; Ma, Z.; Zhou, K.; Zhang, W.; Loy, C. C.; et al. 2023. Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18675--18685

work page 2023
[54]

Zhao, C.; Shen, Y.; Chen, Z.; Ding, M.; and Gan, C. 2023. Textpsg: Panoptic scene graph generation from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2839--2850

work page 2023
[55]

Zhou, H.; Liu, Q.; and Wang, Y. 2023. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10608--10617

work page 2023
[56]

L.; and Xu, Y

Zhou, L.; Zhou, Y.; Lam, T. L.; and Xu, Y. 2022. Context-aware mixture-of-experts for unbiased scene graph generation. arXiv preprint arXiv:2208.07109

work page arXiv 2022

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; and Hu, W. 2020. Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recognition, 103: 107321

work page 2020

[4] [4]

Chen, Z.; Zheng, T.; and Song, M. 2024. Curriculum Negative Mining For Temporal Networks. arXiv preprint arXiv:2407.17070

work page arXiv 2024

[5] [5]

G.; Kirillov, A.; and Girdhar, R

Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1290--1299

work page 2022

[6] [6]

M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al

Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; Kazakos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. 2022. Epic-kitchens-100. International Journal of Computer Vision, 130: 33--55

work page 2022

[7] [7]

Davtyan, A.; Sameni, S.; and Favaro, P. 2023. Efficient video prediction via sparsely conditioned flow matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23263--23274

work page 2023

[8] [8]

Dong, Q.; and Fu, Y. 2024. MemFlow: Optical Flow Estimation and Prediction with Memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19068--19078

work page 2024

[9] [9]

PaLM-E: An Embodied Multimodal Language Model

Driess, D.; Xia, F.; Sajjadi, M. S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995--19012

work page 2022

[11] [11]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; and Loy, C. C. 2023 a . Transformer-based visual segmentation: A survey. arXiv preprint arXiv:2304.09854

work page arXiv 2023

[13] [13]

Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; and Loy, C. C. 2023 b . Tube-Link: A flexible cross tube framework for universal video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13923--13933

work page 2023

[14] [14]

Li, X.; Zhang, W.; Pang, J.; Chen, K.; Cheng, G.; Tong, Y.; and Loy, C. C. 2022. Video k-net: A simple, strong, and unified baseline for video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18847--18857

work page 2022

[15] [15]

Li, Y.; Yang, X.; and Xu, C. 2022. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13874--13883

work page 2022

[16] [16]

A.; and Tripathi, S

Liu, H.; Min, K.; Valdez, H. A.; and Tripathi, S. 2024. Contrastive Language Video Time Pre-training. arXiv preprint arXiv:2406.02631

work page arXiv 2024

[17] [17]

Liu, K.; Li, Y.; Xu, Y.; Liu, S.; and Liu, S. 2022. Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters, 29: 1883--1887

work page 2022

[18] [18]

Ma, X.; Yong, S.; Zheng, Z.; Li, Q.; Liang, Y.; Zhu, S.-C.; and Huang, S. 2022. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474

work page arXiv 2022

[19] [19]

Nag, S.; Min, K.; Tripathi, S.; and Roy-Chowdhury, A. K. 2023. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22803--22813

work page 2023

[20] [20]

A.; and Tuan, L

Nguyen, C.-D.; Nguyen, T.; Vu, D. A.; and Tuan, L. A. 2023 a . Improving multimodal sentiment analysis: Supervised angular margin-based contrastive learning for enhanced fusion representation. arXiv preprint arXiv:2312.02227

work page arXiv 2023

[21] [21]

Nguyen, C.-D.; Nguyen, T.; Wu, X.; and Luu, A. T. 2024 a . Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning. arXiv preprint arXiv:2403.17486

work page arXiv 2024

[22] [22]

Nguyen, T.; Bin, Y.; Wu, X.; Dong, X.; Hu, Z.; Le, K.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 b . Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning. arXiv preprint arXiv:2407.03788

work page arXiv 2024

[23] [23]

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J. Z.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2024 c . Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. arXiv preprint arXiv:2406.05615

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Nguyen, T.; and Luu, A. T. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34: 11974--11986

work page 2021

[25] [25]

M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Le, K. M.; Hu, Z.; Nguyen, C.-D.; Ng, S.-K.; and Luu, A. T. 2024 d . READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 18824--18832

work page 2024

[26] [26]

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D.; Ng, S.-K.; and Tuan, L. A. 2023 b . Demaformer: Damped exponential moving average transformer with energy-based modeling for temporal language grounding. arXiv preprint arXiv:2312.02549

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

T.; Ng, S.-K.; and Luu, A

Nguyen, T.; Wu, X.; Dong, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 e . Topic Modeling as Multi-Objective Contrastive Optimization. arXiv preprint arXiv:2402.07577

work page arXiv 2024

[28] [28]

Nguyen, T.; Wu, X.; Luu, A.-T.; Nguyen, C.-D.; Hai, Z.; and Bing, L. 2022. Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

T.; Hu, Z.; Wu, X.; Nguyen, C.-D

Nguyen, T. T.; Hu, Z.; Wu, X.; Nguyen, C.-D. T.; Ng, S.-K.; and Luu, A. T. 2024 f . Encoding and Controlling Global Semantics for Long-form Video Question Answering. arXiv preprint arXiv:2405.19723

work page arXiv 2024

[30] [30]

Pu, T.; Chen, T.; Wu, H.; Lu, Y.; and Lin, L. 2023. Spatial-temporal knowledge-embedded transformer for video scene graph generation. IEEE Transactions on Image Processing

work page 2023

[31] [31]

R.; Su, H.; Mo, K.; and Guibas, L

Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652--660

work page 2017

[32] [32]

Raychaudhuri, S.; Campari, T.; Jain, U.; Savva, M.; and Chang, A. X. 2023. Reduce, reuse, recycle: Modular multi-object navigation. arXiv preprint arXiv:2304.03696, 2

work page arXiv 2023

[33] [33]

Ren, S.; Zhu, H.; Wei, C.; Li, Y.; Yuille, A.; and Xie, C. 2024. ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning. arXiv preprint arXiv:2405.15160

work page arXiv 2024

[34] [34]

Rodin, I.; Furnari, A.; Min, K.; Tripathi, S.; and Farinella, G. M. 2024. Action Scene Graphs for Long-Form Understanding of Egocentric Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18622--18632

work page 2024

[35] [35]

Rosa, K. D. 2024. Video Enriched Retrieval Augmented Generation Using Aligned Video Captions. arXiv preprint arXiv:2405.17706

work page arXiv 2024

[36] [36]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211--252

work page 2015

[37] [37]

Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; and Chua, T.-S. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, 279--287

work page 2019

[38] [38]

Shen, H.; Shi, L.; Xu, W.; Cen, Y.; Zhang, L.; and An, G. 2024. Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection. arXiv preprint arXiv:2403.19111

work page arXiv 2024

[39] [39]

Sobel, I.; Duda, R.; Hart, P.; and Wiley, J. 2022. Sobel-feldman operator. Preprint at https://www. researchgate. net/profile/Irwin-Sobel/publication/285159837. Accessed, 20

work page arXiv 2022

[40] [40]

Song, X.; Li, Z.; Chen, S.; Cai, X.-Q.; and Demachi, K. 2024. An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video. arXiv preprint arXiv:2404.06741

work page arXiv 2024

[41] [41]

S.; Kersting, K.; and Roth, S

Sudhakaran, G.; Dhami, D. S.; Kersting, K.; and Roth, S. 2023. Vision relation transformer for unbiased scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21882--21893

work page 2023

[42] [42]

Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961--3970

work page 2020

[43] [43]

Wang, G.; Li, Z.; Chen, Q.; and Liu, Y. 2024 a . OED: Towards One-stage End-to-End Dynamic Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 27938--27947

work page 2024

[44] [44]

Wang, W.; Luo, Y.; Chen, Z.; Jiang, T.; Yang, Y.; and Xiao, J. 2023. Taking a closer look at visual relation: Unbiased video scene graph generation with decoupled label learning. IEEE Transactions on Multimedia

work page 2023

[45] [45]

Wang, Y.; Yuan, S.; Jian, X.; Pang, W.; Wang, M.; and Yu, N. 2024 b . HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models. arXiv preprint arXiv:2404.05083

work page arXiv 2024

[46] [46]

Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.; and Bertinetto, L. 2021. Do different tracking tasks require different appearance models? Advances in Neural Information Processing Systems, 34: 726--738

work page 2021

[47] [47]

Wu, X.; Dong, X.; Nguyen, T.; Liu, C.; Pan, L.-M.; and Luu, A. T. 2023. Infoctm: A mutual information maximization perspective of cross-lingual topic modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13763--13771

work page 2023

[48] [48]

Wu, X.; Dong, X.; Pan, L.; Nguyen, T.; and Luu, A. T. 2024. Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 3088--3105. Bangkok, Thailand and virtual meeting: Assoc...

work page 2024

[49] [49]

Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; and Zhong, W. 2022. 3d instances as 1d kernels. In European Conference on Computer Vision, 235--252. Springer

work page 2022

[50] [50]

Xiao, F.; Tighe, J.; and Modolo, D. 2021. Modist: Motion distillation for self-supervised video representation learning. arXiv preprint arXiv:2106.09703, 3

work page arXiv 2021

[51] [51]

Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z

Yang, J.; Ang, Y. Z.; Guo, Z.; Zhou, K.; Zhang, W.; and Liu, Z. 2022. Panoptic scene graph generation. In European Conference on Computer Vision, 178--196. Springer

work page 2022

[52] [52]

Yang, J.; Cen, J.; Peng, W.; Liu, S.; Hong, F.; Li, X.; Zhou, K.; Chen, Q.; and Liu, Z. 2024. 4d panoptic scene graph generation. Advances in Neural Information Processing Systems, 36

work page 2024

[53] [53]

C.; et al

Yang, J.; Peng, W.; Li, X.; Guo, Z.; Chen, L.; Li, B.; Ma, Z.; Zhou, K.; Zhang, W.; Loy, C. C.; et al. 2023. Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18675--18685

work page 2023

[54] [54]

Zhao, C.; Shen, Y.; Chen, Z.; Ding, M.; and Gan, C. 2023. Textpsg: Panoptic scene graph generation from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2839--2850

work page 2023

[55] [55]

Zhou, H.; Liu, Q.; and Wang, Y. 2023. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10608--10617

work page 2023

[56] [56]

L.; and Xu, Y

Zhou, L.; Zhou, Y.; Lam, T. L.; and Xu, Y. 2022. Context-aware mixture-of-experts for unbiased scene graph generation. arXiv preprint arXiv:2208.07109

work page arXiv 2022