arxiv: 2604.02860 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: no theorem link

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He , Qi Liu , Kun Liu , Xinchen Liu , Wu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords temporal sentence groundingend-to-end trainingvideo localizationsentence conditioned adapterfeature modulationmultimodal video understandingadapter tuning

0 comments

The pith

Fully end-to-end training of video backbones and localization heads, guided by a sentence-conditioned adapter, outperforms frozen-backbone methods on temporal sentence grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most existing approaches extract video features once using classification-trained encoders that stay frozen during grounding training, creating a mismatch between what the backbone learned and what the task needs. The paper demonstrates that switching to joint optimization of the entire pipeline closes this gap and raises accuracy on standard benchmarks. A small Sentence Conditioned Adapter (SCADA) injects sentence information into selected layers of the backbone so that only a modest number of parameters are updated, keeping memory costs manageable even for deeper networks. This adaptive modulation produces video features that are more directly useful for locating sentence-matched segments.

Core claim

The central claim is that a fully end-to-end training paradigm, in which the video backbone and localization head are optimized together, combined with the Sentence Conditioned Adapter that uses sentence embeddings to modulate feature maps, resolves the task discrepancy between classification pre-training and temporal sentence grounding and delivers higher localization performance than prior state-of-the-art methods that keep backbones frozen.

What carries the argument

Sentence Conditioned Adapter (SCADA), a lightweight module that conditions a small subset of video-backbone parameters on linguistic embeddings to adaptively modulate feature maps during joint optimization.

If this is right

End-to-end training improves results over frozen baselines at multiple model scales.
SCADA permits deeper backbones to be trained with lower peak memory than full fine-tuning.
Visual features become more task-aligned for sentence-guided localization through direct linguistic modulation.
The overall pipeline exceeds previously published state-of-the-art numbers on the two evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lightweight conditioning adapters could be tested on other video-language tasks that currently rely on frozen encoders.
The memory savings reported for deeper networks suggest the method may scale to larger backbones where full fine-tuning is currently prohibitive.
If SCADA-style modulation proves stable across datasets, it could reduce reliance on massive classification pre-training corpora for grounding-specific models.

Load-bearing premise

End-to-end optimization together with a small sentence-conditioned adapter can close the classification-to-grounding gap without causing overfitting or exceeding practical memory limits.

What would settle it

Training the same deeper backbone end-to-end without SCADA or with a random adapter on the two reported benchmarks yields no accuracy gain or a clear drop relative to the frozen baseline while memory usage remains comparable.

Figures

Figures reproduced from arXiv: 2604.02860 by Allen He, Kun Liu, Qi Liu, Wu Liu, Xinchen Liu.

**Figure 2.** Figure 2: The proposed overall framework for fully end-to-end training for TSGV. (a) In the Sentence Conditioned Adapter, sentence [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The architecture of sentence conditioned adapter. It con [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The effect of image and temporal resolution. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of training time and test performance for the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

End-to-end training with a sentence-conditioned adapter improves TSGV over frozen backbones, but the gains need tighter ablations to confirm they come from the paradigm shift rather than extra capacity.

read the letter

The main takeaway is that freezing video backbones creates a real mismatch for temporal sentence grounding, and switching to joint optimization with a lightweight sentence-conditioned adapter (SCADA) delivers better localization on the benchmarks they test. They first run a study across model scales showing end-to-end beats the frozen case, then use the sentence features to modulate only a small slice of backbone parameters. This keeps memory reasonable for deeper nets while letting the visual features adapt to the grounding task instead of staying tuned for classification. The reported results beat prior SOTA on two datasets, and they plan to release code, which is helpful for checking the details. The adapter approach is a clean way to add task-specific modulation without retraining the entire network from scratch, and the empirical validation of end-to-end learning is a useful data point for anyone stuck on frozen encoders. The soft spots sit in the experimental support. The abstract claims benchmark wins, but without full ablations it is unclear how much comes from the end-to-end training itself versus the added adapter parameters or training schedule choices. I would want to see whether the gains hold when controlling for total trainable parameters, whether overfitting appears on smaller splits, and concrete memory numbers against full fine-tuning. The design looks internally consistent, but the strength of the central claim depends on those controls. This is aimed at people working on video-language grounding who want to move past frozen features. It has enough substance and a clear practical angle to deserve a serious referee, even if revisions will likely be needed on the evaluation details.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a fully end-to-end training paradigm for temporal sentence grounding in videos (TSGV) that jointly optimizes the video backbone and localization head, in contrast to prior work that freezes pre-trained query-agnostic visual encoders. It introduces the Sentence Conditioned Adapter (SCADA) to adaptively modulate a small subset of backbone parameters using sentence features, enabling deeper networks with lower memory while addressing the classification-to-grounding task discrepancy. An empirical study across model scales plus experiments on two benchmarks are reported to show outperformance over state-of-the-art methods.

Significance. If the empirical gains hold under rigorous controls, the work would mark a meaningful shift in TSGV by demonstrating that end-to-end optimization with a lightweight sentence-conditioned adapter can close the task gap without prohibitive memory costs, potentially improving visual representations for video-language localization tasks and encouraging similar joint training in related domains.

major comments (2)

[Empirical study and experiments section] The empirical study validating end-to-end learning (mentioned in the abstract and presumably detailed in the experiments section) reports benchmark gains but provides no information on statistical significance testing, variance across runs, or full ablation controls isolating the contribution of joint backbone optimization versus the SCADA adapter alone; this weakens support for the central claim that the paradigm shift is effective across model scales.
[SCADA introduction and method] The description of SCADA (abstract and method) claims it modulates feature maps through precise integration of linguistic embeddings while training only a small portion of parameters, yet no explicit formulation, pseudocode, or memory analysis is referenced to confirm it avoids overfitting or enables the stated deeper backbones; this is load-bearing for the weakest assumption identified in the review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the empirical validation and methodological clarity of our work on fully end-to-end training for temporal sentence grounding. We address each major comment below and will incorporate revisions to provide additional statistical details, expanded ablations, and explicit SCADA documentation.

read point-by-point responses

Referee: The empirical study validating end-to-end learning (mentioned in the abstract and presumably detailed in the experiments section) reports benchmark gains but provides no information on statistical significance testing, variance across runs, or full ablation controls isolating the contribution of joint backbone optimization versus the SCADA adapter alone; this weakens support for the central claim that the paradigm shift is effective across model scales.

Authors: We agree that reporting variance, statistical significance, and more targeted ablations would provide stronger support for the claims. In the revised manuscript, we will add results from multiple independent runs (reporting mean and standard deviation), paired t-tests for significance on key gains, and expanded ablation tables that separately quantify the contribution of joint backbone optimization versus the SCADA adapter across model scales. The existing study already demonstrates consistent improvements over frozen baselines, but these additions will address the noted gaps directly. revision: yes
Referee: The description of SCADA (abstract and method) claims it modulates feature maps through precise integration of linguistic embeddings while training only a small portion of parameters, yet no explicit formulation, pseudocode, or memory analysis is referenced to confirm it avoids overfitting or enables the stated deeper backbones; this is load-bearing for the weakest assumption identified in the review.

Authors: Section 3.2 of the manuscript provides the mathematical formulation for SCADA's sentence-conditioned modulation of feature maps. To further clarify, the revision will include pseudocode for the adapter integration, a quantitative memory analysis (comparing trainable parameters and peak GPU usage against full fine-tuning), and explicit discussion of how the low parameter count reduces overfitting risk while supporting deeper backbones. These additions will make the implementation details self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a fully end-to-end training paradigm for temporal sentence grounding, backed by an empirical study of end-to-end vs. frozen backbones across scales and by benchmark results showing gains from the SCADA adapter. No derivation chain, equations, or first-principles claims appear that reduce to fitted inputs or self-citations by construction. The method is presented as a practical design choice whose value is measured externally via experiments, not internally forced by its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the domain assumption that frozen classification backbones create a measurable task discrepancy that end-to-end training plus a small adapter can resolve; SCADA is an invented module whose effectiveness is demonstrated empirically.

axioms (1)

domain assumption Pre-trained visual encoders are query-agnostic and induce task discrepancy when used frozen for grounding
Explicitly stated in the abstract as the core motivation for the new paradigm.

invented entities (1)

Sentence Conditioned Adapter (SCADA) no independent evidence
purpose: Modulate video backbone feature maps using sentence embeddings while updating only a small parameter subset
New module introduced to enable memory-efficient end-to-end training

pith-pipeline@v0.9.0 · 5490 in / 1215 out tokens · 35564 ms · 2026-05-13T19:40:30.968259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 1 internal anchor

[1]

Quo vadis, action recognition? A new model and the kinetics dataset

Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. InCVPR, pages 4724–4733, 2017. 3, 6

work page 2017
[2]

Temporally grounding natural sentence in video

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat- Seng Chua. Temporally grounding natural sentence in video. InEMNLP, pages 162–171, 2018. 2

work page 2018
[3]

Plugmark: A plug- in zero-watermarking framework for diffusion models

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Enci Liu, Zhuoyi Shang, Xiangyang Ji, and Wu Liu. Plugmark: A plug- in zero-watermarking framework for diffusion models. In ICCV, pages 17335–17345, 2025

work page 2025
[4]

MAPLE: masked pseudo- labeling autoencoder for semi-supervised point cloud action recognition

Xiaodong Chen, Wu Liu, Xinchen Liu, Yongdong Zhang, Jungong Han, and Tao Mei. MAPLE: masked pseudo- labeling autoencoder for semi-supervised point cloud action recognition. InACM Multimedia, pages 708–718, 2022. 2

work page 2022
[5]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InICCV, pages 2694–2703, 2023. 2

work page 2023
[6]

Mose: A new dataset for video object segmentation in complex scenes

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. InICCV, pages 20224–20234, 2023. 2

work page 2023
[7]

Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE TPAMI, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE TPAMI, 2025. 2

work page 2025
[8]

Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025. 2

work page arXiv 2025
[9]

Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip HS Torr, and Song Bai. Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025. 2

work page arXiv 2025
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 6

work page 2021
[11]

Heterogeneous memory en- hanced multimodal attention model for video question an- swering

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory en- hanced multimodal attention model for video question an- swering. InCVPR, pages 1999–2007, 2019. 1

work page 1999
[12]

Ac- tom sequence models for efficient action detection

Adrien Gaidon, Za ¨ıd Harchaoui, and Cordelia Schmid. Ac- tom sequence models for efficient action detection. InCVPR, pages 3201–3208, 2011. 2

work page 2011
[13]

TALL: temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: temporal activity localization via language query. In ICCV, pages 5277–5285, 2017. 1, 2, 4, 5

work page 2017
[14]

Relation-aware video reading comprehension for temporal language grounding

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. Relation-aware video reading comprehension for temporal language grounding. InEMNLP (1), pages 3978– 3988, 2021. 2

work page 2021
[15]

MAC: mining activity concepts for language-based tempo- ral localization

Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. MAC: mining activity concepts for language-based tempo- ral localization. InWACV, pages 245–253, 2019. 2

work page 2019
[16]

Hauptmann

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexan- der G. Hauptmann. Excl: Extractive clip localization us- ing natural language descriptions. InNAACL-HLT (1), pages 1984–1990, 2019. 2

work page 1984
[17]

Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Chen, and Bo Zhao. VTG-LLM: integrating timestamp knowledge into video llms for enhanced video temporal grounding.CoRR, abs/2405.13382, 2024. 2

work page arXiv 2024
[18]

TRACE: temporal grounding video LLM via causal event modeling

Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. TRACE: temporal grounding video LLM via causal event modeling. InICLR, 2025. 5

work page 2025
[19]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. Localizing mo- ments in video with natural language. InICCV, pages 5804– 5813, 2017. 1, 2

work page 2017
[20]

Vtimellm: Empower LLM to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower LLM to grasp video moments. CoRR, abs/2311.18445, 2023. 2

work page arXiv 2023
[21]

Lita: Language instructed temporal-localization assistant,

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: language instructed temporal-localization assistant. CoRR, abs/2403.19046, 2024. 2

work page arXiv 2024
[22]

Cross- modal video moment retrieval with spatial and language- temporal attention

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. Cross- modal video moment retrieval with spatial and language- temporal attention. InICMR, pages 217–225, 2019. 2

work page 2019
[23]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In ICCV, pages 706–715, 2017. 4

work page 2017
[24]

G2L: semantically aligned and uniform video grounding via geodesic and game theory

Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhi- hong Zhu, and Yuexian Zou. G2L: semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, pages 11998–12008, 2023. 5

work page 2023
[25]

Proposal-free video grounding with contextual pyramid network

Kun Li, Dan Guo, and Meng Wang. Proposal-free video grounding with contextual pyramid network. InAAAI, pages 1902–1910, 2021. 2

work page 1902
[26]

Single shot tem- poral action detection

Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tem- poral action detection. InACM Multimedia, pages 988–996,

work page
[27]

BMN: boundary-matching network for temporal action pro- posal generation

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. BMN: boundary-matching network for temporal action pro- posal generation. InICCV, pages 3888–3897, 2019. 4

work page 2019
[28]

Moment retrieval via cross-modal interaction networks with query reconstruction.IEEE Trans

Zhijie Lin, Zhou Zhao, Zhu Zhang, Zijian Zhang, and Deng Cai. Moment retrieval via cross-modal interaction networks with query reconstruction.IEEE Trans. Image Process., 29: 3750–3762, 2020. 2

work page 2020
[29]

Adaptive proposal generation network for temporal sentence localization in videos

Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. Adaptive proposal generation network for temporal sentence localization in videos. InEMNLP (1), pages 9292–9301,

work page
[30]

Context-aware biaffine localizing network for temporal sentence grounding

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. Context-aware biaffine localizing network for temporal sentence grounding. InCVPR, pages 11235–11244, 2021. 2

work page 2021
[31]

Progressively guide to attend: An iterative alignment framework for tem- poral sentence grounding

Daizong Liu, Xiaoye Qu, and Pan Zhou. Progressively guide to attend: An iterative alignment framework for tem- poral sentence grounding. InEMNLP (1), pages 9302–9311, 2021

work page 2021
[32]

Memory-guided semantic learning network for temporal sentence grounding

Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, and Pan Zhou. Memory-guided semantic learning network for temporal sentence grounding. InAAAI, pages 1665– 1673, 2022. 2

work page 2022
[33]

Single-frame supervision for spatio-temporal video grounding.IEEE TPAMI, 47(7):5177– 5191, 2024

Kun Liu, Mengxue Qu, Yang Liu, Yunchao Wei, Wenming Zhe, Yao Zhao, and Wu Liu. Single-frame supervision for spatio-temporal video grounding.IEEE TPAMI, 47(7):5177– 5191, 2024. 2

work page 2024
[34]

Attentive moment retrieval in videos

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Bao- quan Chen, and Tat-Seng Chua. Attentive moment retrieval in videos. InSIGIR, pages 15–24, 2018. 1, 2

work page 2018
[35]

Sigformer: Sparse signal-guided transformer for multi- modal action segmentation.ACM ToMM, 20(8):1–22, 2024

Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, and Wu Liu. Sigformer: Sparse signal-guided transformer for multi- modal action segmentation.ACM ToMM, 20(8):1–22, 2024. 2

work page 2024
[36]

ETAD: training action detection end to end on a laptop

Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, and Bernard Ghanem. ETAD: training action detection end to end on a laptop. InCVPR Workshops, pages 4525–4534,

work page
[37]

End-to-end temporal action detection with 1b pa- rameters across 1000 frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b pa- rameters across 1000 frames. InCVPR, 2024. 2

work page 2024
[38]

Vehicle retrieval and trajectory inference in urban traffic surveillance scene

Xinchen Liu, Huadong Ma, Huiyuan Fu, and Mo Zhou. Vehicle retrieval and trajectory inference in urban traffic surveillance scene. InICDSC, pages 26:1–26:6, 2014. 2

work page 2014
[39]

An empirical study of end-to-end temporal action detection

Xiaolong Liu, Song Bai, and Xiang Bai. An empirical study of end-to-end temporal action detection. InCVPR, pages 19978–19987, 2022. 2

work page 2022
[40]

End-to-end temporal ac- tion detection with transformer.IEEE Trans

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. End-to-end temporal ac- tion detection with transformer.IEEE Trans. Image Process., 31:5427–5441, 2022. 2

work page 2022
[41]

DEBUG: A dense bottom-up grounding approach for natural language video localization

Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. DEBUG: A dense bottom-up grounding approach for natural language video localization. InEMNLP/IJCNLP (1), pages 5143–5152, 2019. 2

work page 2019
[42]

Zero-shot video grounding with pseudo query lookup and verification

Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Zero-shot video grounding with pseudo query lookup and verification. IEEE Trans. Image Process., 33:1643–1654, 2024. 2

work page 2024
[43]

Humannerf-se: A simple yet effec- tive approach to animate humannerf with diverse poses

Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, and Zheng Wang. Humannerf-se: A simple yet effec- tive approach to animate humannerf with diverse poses. In CVPR, pages 1460–1470, 2024. 2

work page 2024
[44]

Interaction-integrated network for natural language moment localization.IEEE Trans

Ke Ning, Lingxi Xie, Jianzhuang Liu, Fei Wu, and Qi Tian. Interaction-integrated network for natural language moment localization.IEEE Trans. Image Process., 30:2538–2548, 2021

work page 2021
[45]

Proposal-free temporal moment localization of a natural- language query in video using guided attention

Cristian Rodriguez Opazo, Edison Marrese-Taylor, Fate- meh Sadat Saleh, Hongdong Li, and Stephen Gould. Proposal-free temporal moment localization of a natural- language query in video using guided attention. InWACV, pages 2453–2462, 2020. 2

work page 2020
[46]

Video captioning with transferred semantic attributes

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. Video captioning with transferred semantic attributes. InCVPR, pages 984–992, 2017. 1

work page 2017
[47]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNeurIPS Workshop, 2017. 5

work page 2017
[48]

Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.CoRR, abs/2402.11435, 2024. 2, 5

work page arXiv 2024
[49]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding.CoRR, abs/2312.02051,

work page arXiv
[50]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter.CoRR, abs/1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[51]

Temporal action localization in untrimmed videos via multi-stage cnns

Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. InCVPR, pages 1049–1058, 2016. 2

work page 2016
[52]

Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

Gunnar A. Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity under- standing. InECCV (1), pages 510–526, 2016. 4

work page 2016
[53]

V AL: visual-attention ac- tion localizer

Xiaomeng Song and Yahong Han. V AL: visual-attention ac- tion localizer. InPCM (2), pages 340–350, 2018. 2

work page 2018
[54]

You need to read again: Multi-granularity perception net- work for moment retrieval in videos

Xin Sun, Xuan Wang, Jialin Gao, Qiong Liu, and Xi Zhou. You need to read again: Multi-granularity perception net- work for moment retrieval in videos. InSIGIR, pages 1022– 1032, 2022. 2, 5

work page 2022
[55]

Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InICCV, pages 4489–4497,

work page
[56]

Structured multi-level interaction network for video moment localization via language query

Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. Structured multi-level interaction network for video moment localization via language query. InCVPR, pages 7026–7035, 2021. 2

work page 2021
[57]

Tempo- rally grounding language queries in videos by contextual boundary-aware prediction

Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo- rally grounding language queries in videos by contextual boundary-aware prediction. InAAAI, pages 12168–12175,

work page
[58]

Videomae V2: scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae V2: scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023. 3, 6

work page 2023
[59]

Internvideo: General video foundation models via generative and discriminative learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning.CoRR, abs/2212.03191, 2022. 2

work page arXiv 2022
[60]

Hawkeye: Training video-text llms for grounding text in videos,

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos.CoRR, abs/2403.10228,

work page arXiv
[61]

Negative sample matters: A renaissance of metric learning for temporal grounding

Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of metric learning for temporal grounding. InAAAI, pages 2613–2623,

work page
[62]

Boundary proposal network for two-stage natural language video localization

Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. Boundary proposal network for two-stage natural language video localization. InAAAI, pages 2986–2994, 2021. 2

work page 2021
[63]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InACM Multimedia, pages 1645–1653, 2017. 1

work page 2017
[64]

R-C3D: region convolutional 3d network for temporal activity detection

Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803, 2017. 2

work page 2017
[65]

Learning multimodal attention LSTM networks for video captioning

Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. Learning multimodal attention LSTM networks for video captioning. InACM Multimedia, pages 537–545, 2017. 1

work page 2017
[66]

Entity-aware and motion- aware transformers for language-driven action localization in videos.CoRR, abs/2205.05854, 2022

Shuo Yang and Xinxiao Wu. Entity-aware and motion- aware transformers for language-driven action localization in videos.CoRR, abs/2205.05854, 2022. 2

work page arXiv 2022
[67]

Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans

Shuo Yang, Xinxiao Wu, Zirui Shang, and Jiebo Luo. Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans. Multim., 26:7451– 7461, 2024. 5

work page 2024
[68]

Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans

Shuo Yang, Xinxiao Wu, Zirui Shang, and Jiebo Luo. Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans. Multim., 26:7451– 7461, 2024. 2

work page 2024
[69]

To find where you talk: Temporal sentence localization in video with attention based location regression

Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. InAAAI, pages 9159–9166, 2019. 2

work page 2019
[70]

Dense regression network for video grounding

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding. InCVPR, pages 10284–10293, 2020. 2

work page 2020
[71]

Factorized learning for temporally grounded video-language models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou, and Hwee Tou Ng. Factorized learning for temporally grounded video-language models. InICCV, pages 20683–20693,

work page
[72]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, pages 1204–1213, 2022. 6

work page 2022
[73]

Span-based localizing network for natural language video lo- calization

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Span-based localizing network for natural language video lo- calization. InACL, pages 6543–6554, 2020. 5

work page 2020
[74]

Exploiting temporal relationships in video moment localization with natural language

Songyang Zhang, Jinsong Su, and Jiebo Luo. Exploiting temporal relationships in video moment localization with natural language. InACM Multimedia, pages 1230–1238,

work page
[75]

Learning 2d temporal adjacent networks for moment local- ization with natural language

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment local- ization with natural language. InAAAI, pages 12870–12877,

work page
[76]

Multi-scale 2d temporal adjacency networks for moment localization with natural language.IEEE Trans

Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. Multi-scale 2d temporal adjacency networks for moment localization with natural language.IEEE Trans. Pattern Anal. Mach. Intell., 44(12):9073–9087, 2022. 2, 4, 5

work page 2022
[77]

Text-visual prompting for efficient 2d temporal video grounding

Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, and Ke Ding. Text-visual prompting for efficient 2d temporal video grounding. InCVPR, pages 14794–14804, 2023. 2

work page 2023
[78]

Videoexpert: Augmented LLM for temporal- sensitive video understanding.CoRR, abs/2504.07519, 2025

Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, and Zechao Li. Videoexpert: Augmented LLM for temporal- sensitive video understanding.CoRR, abs/2504.07519, 2025. 5

work page arXiv 2025
[79]

Temporal action detection with structured segment networks.Int

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xi- aoou Tang, and Dahua Lin. Temporal action detection with structured segment networks.Int. J. Comput. Vis., 128(1): 74–95, 2020. 2

work page 2020
[80]

Cas- caded prediction network via segment tree for temporal video grounding

Yang Zhao, Zhou Zhao, Zhu Zhang, and Zhijie Lin. Cas- caded prediction network via segment tree for temporal video grounding. InCVPR, pages 4197–4206, 2021. 2, 5

work page 2021

Showing first 80 references.