DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Cong-Duy Nguyen; Luu Anh Tuan; See-kiong Ng; Thong Nguyen; Xiaobao Wu; Xinshuai Dong

arxiv: 2312.02549 · v2 · submitted 2023-12-05 · 💻 cs.CV · cs.CL

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen , Xiaobao Wu , Xinshuai Dong , Cong-Duy Nguyen , See-kiong Ng , Luu Anh Tuan This is my paper

Pith reviewed 2026-05-24 04:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords temporal language groundingenergy-based modeltransformerexponential moving averagevideo moment localizationdamped EMAmultimodal attentiondistribution modeling

0 comments

The pith

An energy-based model and damped exponential moving average transformer improve separation of target video moments from text queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Temporal language grounding requires identifying the exact video segment that matches a natural language description. Standard attention often produces flat distributions in which the correct moment blends with incorrect ones. The paper introduces an energy-based modeling framework that learns the joint distribution of moments and queries explicitly. It pairs this with DemaFormer, a transformer variant that applies exponential moving average with a trainable damping factor to encode the inputs more effectively. Experiments across four public datasets indicate the combination yields higher localization accuracy than prior attention-based approaches.

Core claim

The paper claims that naive attention produces ineffective moment-query distributions in which target moments cannot be separated from the rest, and that an energy-based model framework together with the DemaFormer architecture using exponential moving average and a learnable damping factor resolves this separation problem and improves grounding performance.

What carries the argument

DemaFormer, a transformer that encodes moment-query inputs via exponential moving average with a learnable damping factor, paired with an energy-based model that explicitly represents moment-query distributions.

If this is right

Target moments become more separable in the learned distributions than under standard attention.
The approach reports superior performance over state-of-the-art baselines on four public temporal language grounding datasets.
Attention can be replaced or augmented by energy-based modeling to capture moment-query relations more explicitly.
The learnable damping factor adapts the encoding of temporal and textual features during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The damping mechanism might transfer to other sequence modeling tasks where gradual incorporation of context is beneficial.
Energy-based modeling could be applied to related video-text tasks such as moment retrieval or video question answering.
The separation of target moments might allow downstream modules to operate on sharper probability maps.
Analysis of the learned damping values across datasets could reveal dataset-specific temporal dynamics.

Load-bearing premise

The assumption that the energy-based model plus damped exponential moving average will produce distributions in which target moments stand out clearly from non-target moments.

What would settle it

If the proposed method is evaluated on the same four datasets using standard recall and intersection-over-union metrics and shows no consistent gains over the strongest attention baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2312.02549 by Cong-Duy Nguyen, Luu Anh Tuan, See-kiong Ng, Thong Nguyen, Xiaobao Wu, Xinshuai Dong.

**Figure 2.** Figure 2: A TLG example. To produce the output, we [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the proposed DemaFormer. Our archtiecture comprises an encoder of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of the number of Langevin sampling [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of DemaFormer model. Green arrow line denotes the predicted localization and green normal line the predicted salience scores. Red arrow line denotes the groundtruth localization and red normal line the annotated salience scores. localizes target moments with respect to the user query. Our predicted salience scores also align with the groundtruth scores, which are measured by ave… view at source ↗

**Figure 6.** Figure 6: Prediction example 1 with the t-SNE visualizations of the DemaFormer model and the UMT model. Green [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Prediction example 2 with the t-SNE visualizations of the DemaFormer model and the UMT model. Green [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Prediction example 3 with the t-SNE visualizations of the DemaFormer model and the UMT model. Green [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs energy-based modeling with a damped-EMA Transformer for temporal language grounding and reports gains over baselines on four datasets.

read the letter

The main takeaway is that this work takes the common attention setup for temporal language grounding, adds an energy-based model to shape the moment-query distributions, and inserts a Transformer encoder that uses exponential moving average with a learnable damping factor. They test the combination on four standard datasets and claim it beats prior methods. The specific pairing is the new piece; both energy-based modeling and EMA exist elsewhere, but the damped version tuned for this task is the contribution here. The experiments give some evidence that the approach improves separation of target moments, which matches the stated motivation about naive attention producing flat distributions. The paper does a reasonable job laying out the architecture and running the comparisons. The soft spots are modest. The gains are presented as superiority, but without seeing the exact delta sizes or variance across runs it is hard to judge how reliable the edge is. The learnable damping factor adds a free parameter, so the results need to show it is not just extra capacity. Ablations would help isolate whether the energy-based framing or the EMA change drives most of the lift. The citation pattern is standard and cites the relevant prior attention work. This paper is aimed at researchers working on video-language retrieval and grounding. Someone already running experiments in that sub-area would get practical value from the architecture choices and the dataset results. It is solid enough on its own terms to deserve a serious referee who can check the implementation details and statistical significance of the reported improvements.

Referee Report

0 major / 1 minor

Summary. The manuscript proposes an energy-based model framework to explicitly learn moment-query distributions for temporal language grounding, along with DemaFormer, a Transformer architecture that applies exponential moving average with a learnable damping factor to encode moment-query inputs. It argues that naive attention produces ineffective distributions from which target moments cannot be separated and claims superiority over state-of-the-art baselines on four public datasets.

Significance. If the experimental results hold, the work offers a concrete architectural response to a stated limitation of standard attention in video-language tasks by introducing damped EMA encoding and energy-based distribution modeling. The explicit separation of target moments via the proposed framework is a potentially useful direction for the field.

minor comments (1)

[Abstract] Abstract: the superiority claim is stated without any numerical results, dataset names, or metric values; adding one sentence summarizing the gains would improve readability while remaining within abstract length limits.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point at this stage. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript proposes an energy-based modeling framework and DemaFormer architecture (with learnable damping in EMA) to improve moment-query distribution separation over naive attention. No equations, derivations, or self-citations are shown that reduce any central claim to a fitted parameter renamed as prediction, a self-definitional loop, or an imported uniqueness result. Experimental superiority on four datasets supplies the validation, leaving the architecture description self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The learnable damping factor is a free parameter introduced to control the EMA; no other free parameters, axioms, or invented entities are identifiable from the abstract.

free parameters (1)

learnable damping factor
Described as a learnable component of the DemaFormer architecture that must be optimized during training.

pith-pipeline@v0.9.0 · 5667 in / 992 out tokens · 19019 ms · 2026-05-24T04:49:42.315183+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
cs.CV 2024-12 unverdicted novelty 6.0

Motion-aware contrastive learning on mask tubes improves temporal panoptic scene graph generation over pooling-based methods on video and 4D datasets.
Multi-Scale Contrastive Learning for Video Temporal Grounding
cs.CV 2024-12 unverdicted novelty 6.0

A multi-scale and cross-scale contrastive learning framework uses intra-encoder stage features and a new sampling process to link short-range and long-range video moments for temporal grounding.
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
cs.CV 2023-12 unverdicted novelty 6.0

READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 3 Pith papers · 11 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803--5812

work page 2017
[4]

Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2021. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8127--8137

work page 2021
[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299--6308

work page 2017
[6]

Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Stefanie Jegelka. 2020. Debiased contrastive learning. Advances in neural information processing systems, 33:8765--8775

work page 2020
[7]

Yilun Du and Igor Mordatch. 2019. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689

work page arXiv 2019
[8]

Shiv Ram Dubey, Satish Kumar Singh, and Bidyut Baran Chaudhuri. 2022. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing

work page 2022
[9]

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3--11

work page 2018
[10]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202--6211

work page 2019
[11]

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021 a . Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717

work page arXiv 2021
[12]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267--5275

work page 2017
[13]

Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021 b . Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4833--4837

work page 2021
[14]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776--780. IEEE

work page 2017
[15]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755

work page internal anchor Pith review Pith/arXiv arXiv 2019
[16]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. arXiv preprint arXiv:1809.01337

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIII 16, pages 345--360. Springer

work page 2020
[19]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2013
[21]

Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846--11858

work page 2021
[22]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXI 16, pages 447--463. Springer

work page 2020
[24]

Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042--3051

work page 2022
[25]

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 2022. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655

work page arXiv 2022
[26]

Thong Nguyen and Anh Tuan Luu. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34:11974--11986

work page 2021
[27]

Thong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, See-Kiong Ng, and Anh Tuan Luu. 2022 a . Vision-and-language pretraining. arXiv preprint arXiv:2207.01772

work page arXiv 2022
[28]

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Anh Tuan Luu, Cong-Duy Nguyen, Zhen Hai, and Lidong Bing. 2023. Gradient-boosted decision tree for listwise context model in multimodal review helpfulness prediction. arXiv preprint arXiv:2305.12678

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Thong Nguyen, Xiaobao Wu, Anh-Tuan Luu, Cong-Duy Nguyen, Zhen Hai, and Lidong Bing. 2022 b . Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543

work page 2014
[32]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR

work page 2021
[33]

Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference on machine learning, pages 1530--1538. PMLR

work page 2015
[35]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2014
[36]

Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 787--802. Springer

work page 2014
[37]

Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, and Vijay Chandrasekhar. 2019 a . Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, 29:489--499

work page 2019
[38]

Weining Wang, Yan Huang, and Liang Wang. 2019 b . Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 334--343

work page 2019
[39]

Jie Wei, Guanyu Hu, Luu Anh Tuan, Xinyu Yang, and Wenjing Zhu. 2023. Multi-scale receptive field graph model for emotion recognition in conversations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

work page 2023
[40]

Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2022. Audio-visual domain adaptation feature fusion for speech emotion recognition. In INTERSPEECH, pages 1988--1992

work page 2022
[41]

Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2024. Learning facial expression and body gesture visual information for video emotion recognition. Expert Systems with Applications, 237:121419

work page 2024
[42]

Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI, volume 3, page 8

work page 2018
[43]

Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, and Jun Xiao. 2021. Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678

work page arXiv 2021
[44]

Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, and Kristen Grauman. 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1258--1267

work page 2019
[45]

Minghao Xu, Hang Wang, Bingbing Ni, Riheng Zhu, Zhenbang Sun, and Changhu Wang. 2021. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7970--7979

work page 2021
[46]

Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, and Sidan Du. 2023. Mh-detr: Video moment and highlight detection with cross-modal transformer. arXiv preprint arXiv:2305.00355

work page arXiv 2023
[47]

Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, and Guang Yang. 2021. Temporal cue guided video highlight detection with low-rank audio-visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7950--7959

work page 2021
[48]

Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 829--832

work page 2017
[49]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Advances in Neural Information Processing Systems, 32

work page 2019
[50]

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287--10296

work page 2020
[51]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1247--1257

work page 2019
[52]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020 a . Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931

work page arXiv 2020
[53]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870--12877

work page 2020

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803--5812

work page 2017

[4] [4]

Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2021. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8127--8137

work page 2021

[5] [5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299--6308

work page 2017

[6] [6]

Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Stefanie Jegelka. 2020. Debiased contrastive learning. Advances in neural information processing systems, 33:8765--8775

work page 2020

[7] [7]

Yilun Du and Igor Mordatch. 2019. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689

work page arXiv 2019

[8] [8]

Shiv Ram Dubey, Satish Kumar Singh, and Bidyut Baran Chaudhuri. 2022. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing

work page 2022

[9] [9]

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3--11

work page 2018

[10] [10]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202--6211

work page 2019

[11] [11]

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021 a . Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717

work page arXiv 2021

[12] [12]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267--5275

work page 2017

[13] [13]

Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021 b . Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4833--4837

work page 2021

[14] [14]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776--780. IEEE

work page 2017

[15] [15]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [16]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. arXiv preprint arXiv:1809.01337

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei-Shi Zheng. 2020. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIII 16, pages 345--360. Springer

work page 2020

[19] [19]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2013

[21] [21]

Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846--11858

work page 2021

[22] [22]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXI 16, pages 447--463. Springer

work page 2020

[24] [24]

Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042--3051

work page 2022

[25] [25]

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 2022. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655

work page arXiv 2022

[26] [26]

Thong Nguyen and Anh Tuan Luu. 2021. Contrastive learning for neural topic model. Advances in neural information processing systems, 34:11974--11986

work page 2021

[27] [27]

Thong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, See-Kiong Ng, and Anh Tuan Luu. 2022 a . Vision-and-language pretraining. arXiv preprint arXiv:2207.01772

work page arXiv 2022

[28] [28]

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Anh Tuan Luu, Cong-Duy Nguyen, Zhen Hai, and Lidong Bing. 2023. Gradient-boosted decision tree for listwise context model in multimodal review helpfulness prediction. arXiv preprint arXiv:2305.12678

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Thong Nguyen, Xiaobao Wu, Anh-Tuan Luu, Cong-Duy Nguyen, Zhen Hai, and Lidong Bing. 2022 b . Adaptive contrastive learning on multimodal transformer for review helpfulness predictions. arXiv preprint arXiv:2211.03524

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543

work page 2014

[32] [32]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR

work page 2021

[33] [33]

Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference on machine learning, pages 1530--1538. PMLR

work page 2015

[35] [35]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2014

[36] [36]

Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 787--802. Springer

work page 2014

[37] [37]

Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, and Vijay Chandrasekhar. 2019 a . Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, 29:489--499

work page 2019

[38] [38]

Weining Wang, Yan Huang, and Liang Wang. 2019 b . Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 334--343

work page 2019

[39] [39]

Jie Wei, Guanyu Hu, Luu Anh Tuan, Xinyu Yang, and Wenjing Zhu. 2023. Multi-scale receptive field graph model for emotion recognition in conversations. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

work page 2023

[40] [40]

Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2022. Audio-visual domain adaptation feature fusion for speech emotion recognition. In INTERSPEECH, pages 1988--1992

work page 2022

[41] [41]

Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, and Yizhuo Dong. 2024. Learning facial expression and body gesture visual information for video emotion recognition. Expert Systems with Applications, 237:121419

work page 2024

[42] [42]

Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI, volume 3, page 8

work page 2018

[43] [43]

Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, and Jun Xiao. 2021. Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678

work page arXiv 2021

[44] [44]

Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, and Kristen Grauman. 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1258--1267

work page 2019

[45] [45]

Minghao Xu, Hang Wang, Bingbing Ni, Riheng Zhu, Zhenbang Sun, and Changhu Wang. 2021. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7970--7979

work page 2021

[46] [46]

Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, and Sidan Du. 2023. Mh-detr: Video moment and highlight detection with cross-modal transformer. arXiv preprint arXiv:2305.00355

work page arXiv 2023

[47] [47]

Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, and Guang Yang. 2021. Temporal cue guided video highlight detection with low-rank audio-visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7950--7959

work page 2021

[48] [48]

Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 829--832

work page 2017

[49] [49]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Advances in Neural Information Processing Systems, 32

work page 2019

[50] [50]

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287--10296

work page 2020

[51] [51]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1247--1257

work page 2019

[52] [52]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020 a . Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931

work page arXiv 2020

[53] [53]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020 b . Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870--12877

work page 2020