pith. sign in

arxiv: 2412.02930 · v6 · submitted 2024-12-04 · 💻 cs.CV

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Pith reviewed 2026-05-23 08:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords video large language modelstemporal reasoninglong videosBiLSTMvisual encodertemporal action segmentationdense video captioningIndustryASM dataset
0
0 comments X

The pith

TemporalVLM adds a timestamp-aware visual encoder with BiLSTM to let video LLMs reason about time in long videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TemporalVLM to handle temporal reasoning and fine-grained understanding in extended videos using video large language models. Its core contribution is a visual encoder that splits videos into short clips encoded jointly with timestamps, fuses features across overlapping windows to capture local time cues, and routes the results through a BiLSTM for global aggregation. Experiments demonstrate gains over prior methods on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. A new IndustryASM dataset of factory assembly videos with engineer-annotated timestamps and actions supports evaluation of these capabilities. The work positions itself as the first to incorporate LSTMs into video LLMs.

Core claim

TemporalVLM maps long videos to time-aware features by dividing input into short-term clips jointly encoded with timestamps, fusing across overlapping temporal windows into local features, and passing those through a BiLSTM module for global aggregation, which then supports improved LLM performance on temporal tasks.

What carries the argument

Visual encoder that divides videos into timestamped short-term clips, fuses overlapping windows for local time-sensitive features, and aggregates globally via BiLSTM.

If this is right

  • Video LLMs achieve higher accuracy on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.
  • The IndustryASM dataset supplies long factory videos with precise action timestamps for training and benchmarking temporal segmentation.
  • Combining local timestamp fusion with BiLSTM aggregation supplies both short-range and long-range time cues to the language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clip-plus-timestamp fusion pattern could be tested in other sequential modalities such as audio or sensor streams.
  • If the encoder's gains hold on non-assembly videos, it would suggest the method is not limited to industrial domains.
  • Replacing BiLSTM with a lighter recurrent unit or attention block while keeping the timestamp fusion could be checked for efficiency trade-offs.

Load-bearing premise

The BiLSTM step on the timestamp-fused local features will create global representations that measurably help the LLM on temporal reasoning tasks.

What would settle it

Ablation experiments on the four listed tasks showing no performance difference when the BiLSTM or the timestamp-overlap fusion is removed.

Figures

Figures reproduced from arXiv: 2412.02930 by Fawad Javed Fateh, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran, Umer Ahmed.

Figure 1
Figure 1. Figure 1: Previous video large language models (a-c) usually are not time-sensitive (a, b), consider an input video as a single clip (a, c), and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Given a long-term input video with timestamps, we divide it into short-term clips with timestamps. The time-aware clip encoder [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example videos with different camera viewpoints, actors, backgrounds, and activities from our IndustryASM dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example dense video captioning results in the zero-shot [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example temporal action segmentation results in the su [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of our instruction and target response for the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples highlighting the dense video captioning capabilities of TemporalVLM in the supervised setting. The model [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples demonstrating the video highlight detection capabilities of TemporalVLM in the supervised setting. The [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples showing the temporal video grounding capabilities of TemporalVLM in the supervised setting. A video and [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of the generalization capabilities of Tempo [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces TemporalVLM, a video LLM for temporal reasoning and fine-grained understanding in long videos. Its visual encoder divides input videos into short-term clips that are jointly encoded with timestamps, fused across overlapping temporal windows to produce time-sensitive local features, and then aggregated via a BiLSTM for global cues. The work also presents the IndustryASM dataset of factory-floor assembly videos with engineer-annotated actions and timestamps. The central claim is that TemporalVLM outperforms prior methods on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation, and that the work is the first to incorporate LSTMs into video LLMs.

Significance. If the reported outperformance is substantiated with proper baselines, ablations, and statistical tests, the architecture could offer a practical route to injecting explicit temporal modeling into video LLMs via recurrent aggregation of local time-aware features. The IndustryASM dataset would additionally supply a new, domain-specific benchmark for long-video temporal action segmentation in industrial settings.

major comments (1)
  1. Abstract: the claim that TemporalVLM 'outperforms previous methods across' the four listed tasks is asserted without any quantitative results, baselines, error bars, dataset statistics, or even a reference to the experimental section, so the central empirical claim cannot be evaluated from the supplied text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for better support of the central empirical claim in the abstract. We agree this is a substantive issue and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the claim that TemporalVLM 'outperforms previous methods across' the four listed tasks is asserted without any quantitative results, baselines, error bars, dataset statistics, or even a reference to the experimental section, so the central empirical claim cannot be evaluated from the supplied text.

    Authors: We agree that the abstract's assertion of outperformance lacks supporting quantitative details or a pointer to the experiments, making the claim difficult to assess from the abstract alone. This is a fair observation. In the revised version, we will update the abstract to include key quantitative highlights (e.g., specific improvements on representative benchmarks) and add an explicit reference to Section 4 (Experiments), where full baselines, ablations, error bars, and dataset statistics are reported. This change will strengthen the abstract without exceeding typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on architecture and experiments

full rationale

The paper describes a visual encoder (clip division + timestamp joint encoding + overlapping-window fusion + BiLSTM aggregation) as a compositional design choice that produces time-sensitive features, then reports empirical outperformance on downstream tasks. No equations define any quantity in terms of the target result, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain. The 'first to incorporate LSTMs' statement is an external literature claim, not a definitional step. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard video LLM components plus the added BiLSTM module will suffice.

pith-pipeline@v0.9.0 · 5741 in / 1092 out tokens · 69001 ms · 2026-05-23T08:04:46.129462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation

    Nadine Behrmann, S Alireza Golestaneh, Zico Kolter, J¨urgen Gall, and Mehdi Noroozi. Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation. In European Conference on Com- puter Vision, pages 52–68. Springer, 2022. 1, 3

  3. [3]

    Revisiting the” video” in video-language understanding

    Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022. 3

  4. [4]

    a person runs up some stairs

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- 13 1s 9s 12s 15s 19s 23s User: Locate and describe the visual content mentioned in the text query “a person runs up some stairs.” within the video, including timestamps. TemporalVLM (Ours): The given query happens in 0.0 - 10.0 seconds. User: Loca...

  5. [5]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  6. [6]

    Long-term recurrent convolutional net- works for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 3

  7. [7]

    Soda: Story oriented dense video captioning evaluation framework

    Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Man- abu Okumura, and Masaaki Nagata. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 , pages 517–531. Springer, 2020. 6

  8. [8]

    multiple people are riding bikes across the field

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In 14 3s 20s 48s 60s 77s 81s User: You are given an egocentric video where a person performs a series of actions. Please watch the video and extract a minimum of 6 and a maximum of 10 significant steps. For each step, determine the starting and e...

  9. [9]

    Frameexit: Conditional early exiting for effi- cient video recognition

    Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhos- sein Habibian. Frameexit: Conditional early exiting for effi- cient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15608–15618, 2021. 3

  10. [10]

    Smart frame selection for action recognition

    Shreyank N Gowda, Marcus Rohrbach, and Laura Sevilla- Lara. Smart frame selection for action recognition. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , pages 1451–1459, 2021. 3

  11. [11]

    Temporal alignment networks for long-term video

    Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 3

  12. [12]

    Jing Hu, Yiming Yang, Huan Wang, and Eric P. Xing. Lora: Low-rank adaptation of large language models. In Proceed- ings of the 2023 Conference on Neural Information Process- ing Systems (NeurIPS 2023), 2023. 5

  13. [13]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 2, 3

  14. [14]

    Lita: Language instructed temporal-localization assistant

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 2, 3

  15. [15]

    Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachi- dambaresan. Pytorch. Programming with TensorFlow: solu- tion for edge computing applications , pages 87–104, 2021. 5

  16. [16]

    Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 2

  17. [17]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of naacL-HLT, page 2. Minneapolis, Minnesota, 2019. 2

  18. [18]

    Movinets: Mobile video networks for efficient video recog- nition

    Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efficient video recog- nition. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 16020–16030,

  19. [19]

    Ian goodfellow

    Deep Learning. Ian goodfellow. Yoshua Bengio, and Aaron Courville, 2016. 4

  20. [20]

    Detecting mo- ments and highlights in videos via natural language queries

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems , 34: 11846–11858, 2021. 3, 5, 7

  21. [21]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2

  22. [22]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 3, 5, 7

  23. [23]

    Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation

    Shijie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation. IEEE transactions on pattern analysis and machine intelligence , 45(6):6647– 6658, 2020. 1, 3

  24. [24]

    Video- teller: Enhancing cross-modal generation with fusion and decoupling

    Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Video- teller: Enhancing cross-modal generation with fusion and decoupling. arXiv preprint arXiv:2310.04991, 2023. 2 15

  25. [25]

    Bt-adapter: Video conversation is fea- sible without video instruction tuning

    Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is fea- sible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024. 12

  26. [26]

    Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion

    Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18175–18185,

  27. [27]

    Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training

    Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23045–23055, 2023. 2, 3

  28. [28]

    Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

    Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 ,

  29. [29]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2, 3, 8, 11, 12

  30. [30]

    Query-dependent video representa- tion for moment retrieval and highlight detection

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representa- tion for moment retrieval and highlight detection. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023. 2, 3, 6, 9, 10

  31. [31]

    Chatgpt: Generative pre-trained transformer, 2024

    OpenAI. Chatgpt: Generative pre-trained transformer, 2024. 2

  32. [32]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

  33. [33]

    Momen- tor: Advancing video large language model with fine-grained temporal reasoning

    Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning. arXiv preprint arXiv:2402.11435, 2024. 2, 3

  34. [34]

    Improving language understanding by gener- ative pre-training

    Alec Radford. Improving language understanding by gener- ative pre-training. 2018. 2

  35. [35]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 2

  36. [36]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2, 3, 5, 6, 7, 8, 9, 11

  37. [37]

    Temporal aggregate representations for long-range video understand- ing

    Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understand- ing. In Computer Vision–ECCV 2020: 16th European Con- ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 154–171. Springer, 2020. 3

  38. [38]

    An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

    Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021. 2

  39. [39]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

  40. [40]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3, 5

  41. [41]

    Alpaca: A strong, replicable instruction-following model

    Aakarsh Taori, Evan Chen, Sainbayar Sukhbaatar, et al. Alpaca: A strong, replicable instruction-following model. arXiv preprint arXiv:2303.11347, 2023. 2

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 5

  44. [44]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 4, 8

  45. [45]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

  46. [46]

    End-to-end dense video captioning with parallel decoding

    Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6847– 6857, 2021. 1, 3

  47. [47]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. 2

  48. [48]

    Benchmarking generalization via in-context instructions on 1,600+ language tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Ar- jun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2, 2022. 2

  49. [49]

    Negative sample matters: A renaissance of met- ric learning for temporal grounding

    Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of met- ric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2613– 2623, 2022. 3, 6 16

  50. [50]

    Longvlm: Efficient long video un- derstanding via large language models

    Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video un- derstanding via large language models. arXiv preprint arXiv:2404.03384, 2024. 3, 4, 5, 6, 7, 8, 9, 11, 12

  51. [51]

    Towards long-form video understanding

    Chao-Yuan Wu and Philipp Krahenbuhl. Towards long-form video understanding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021. 3

  52. [52]

    Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

    Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 3

  53. [53]

    Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning

    Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, An- toine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 1, 2, 3, 9, 10

  54. [54]

    Asformer: Transformer for action segmentation

    Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. arXiv preprint, 2021. 1, 3

  55. [55]

    Merlot reserve: Neural script knowledge through vision and language and sound

    Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16375–16387, 2022. 5, 11

  56. [56]

    Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023. 8

  57. [57]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 3, 5, 7, 8, 12

  58. [58]

    Streaming video model

    Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, and Zheng-Jun Zha. Streaming video model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 14602–14612, 2023. 3

  59. [59]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 5, 7, 8

  60. [60]

    End-to-end dense video captioning as sequence generation

    Wanrong Zhu, Bo Pang, Ashish V Thapliyal, William Yang Wang, and Radu Soricut. End-to-end dense video captioning as sequence generation. arXiv preprint arXiv:2204.08121 ,