DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Cong-Duy Nguyen; Luu Anh Tuan; See-kiong Ng; Thong Nguyen; Xiaobao Wu; Xinshuai Dong

arxiv: 2312.02549 · v2 · submitted 2023-12-05 · 💻 cs.CV · cs.CL

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen , Xiaobao Wu , Xinshuai Dong , Cong-Duy Nguyen , See-kiong Ng , Luu Anh Tuan This is my paper

classification 💻 cs.CV cs.CL

keywords languagegroundingmomentstemporalvideoattentionaveragedemaformer

0 comments

read the original abstract

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
cs.CV 2024-12 unverdicted novelty 6.0

Motion-aware contrastive learning on mask tubes improves temporal panoptic scene graph generation over pooling-based methods on video and 4D datasets.
Multi-Scale Contrastive Learning for Video Temporal Grounding
cs.CV 2024-12 unverdicted novelty 6.0

A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.