DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

· 2023 · cs.CV · arXiv 2312.02549

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

representative citing papers

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

cs.CV · 2024-12-10 · unverdicted · novelty 6.0

Motion-aware contrastive learning on mask tubes improves temporal panoptic scene graph generation over pooling-based methods on video and 4D datasets.

Multi-Scale Contrastive Learning for Video Temporal Grounding

cs.CV · 2024-12-10 · unverdicted · novelty 6.0

A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

cs.CV · 2023-12-12 · unverdicted · novelty 6.0

READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation cs.CV · 2024-12-10 · unverdicted · none · ref 26 · internal anchor
Motion-aware contrastive learning on mask tubes improves temporal panoptic scene graph generation over pooling-based methods on video and 4D datasets.
Multi-Scale Contrastive Learning for Video Temporal Grounding cs.CV · 2024-12-10 · unverdicted · none · ref 41 · internal anchor
A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling cs.CV · 2023-12-12 · unverdicted · none · ref 18 · internal anchor
READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

fields

years

verdicts

representative citing papers

citing papers explorer