TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh; Hamza Khan; M. Zeeshan Zia; Quoc-Huy Tran; Umer Ahmed

arxiv: 2412.02930 · v6 · submitted 2024-12-04 · 💻 cs.CV

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh , Umer Ahmed , Hamza Khan , M. Zeeshan Zia , Quoc-Huy Tran This is my paper

Pith reviewed 2026-05-23 08:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords video large language modelstemporal reasoninglong videosBiLSTMvisual encodertemporal action segmentationdense video captioningIndustryASM dataset

0 comments

The pith

TemporalVLM adds a timestamp-aware visual encoder with BiLSTM to let video LLMs reason about time in long videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TemporalVLM to handle temporal reasoning and fine-grained understanding in extended videos using video large language models. Its core contribution is a visual encoder that splits videos into short clips encoded jointly with timestamps, fuses features across overlapping windows to capture local time cues, and routes the results through a BiLSTM for global aggregation. Experiments demonstrate gains over prior methods on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. A new IndustryASM dataset of factory assembly videos with engineer-annotated timestamps and actions supports evaluation of these capabilities. The work positions itself as the first to incorporate LSTMs into video LLMs.

Core claim

TemporalVLM maps long videos to time-aware features by dividing input into short-term clips jointly encoded with timestamps, fusing across overlapping temporal windows into local features, and passing those through a BiLSTM module for global aggregation, which then supports improved LLM performance on temporal tasks.

What carries the argument

Visual encoder that divides videos into timestamped short-term clips, fuses overlapping windows for local time-sensitive features, and aggregates globally via BiLSTM.

If this is right

Video LLMs achieve higher accuracy on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.
The IndustryASM dataset supplies long factory videos with precise action timestamps for training and benchmarking temporal segmentation.
Combining local timestamp fusion with BiLSTM aggregation supplies both short-range and long-range time cues to the language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clip-plus-timestamp fusion pattern could be tested in other sequential modalities such as audio or sensor streams.
If the encoder's gains hold on non-assembly videos, it would suggest the method is not limited to industrial domains.
Replacing BiLSTM with a lighter recurrent unit or attention block while keeping the timestamp fusion could be checked for efficiency trade-offs.

Load-bearing premise

The BiLSTM step on the timestamp-fused local features will create global representations that measurably help the LLM on temporal reasoning tasks.

What would settle it

Ablation experiments on the four listed tasks showing no performance difference when the BiLSTM or the timestamp-overlap fusion is removed.

Figures

Figures reproduced from arXiv: 2412.02930 by Fawad Javed Fateh, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran, Umer Ahmed.

**Figure 1.** Figure 1: Previous video large language models (a-c) usually are not time-sensitive (a, b), consider an input video as a single clip (a, c), and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Given a long-term input video with timestamps, we divide it into short-term clips with timestamps. The time-aware clip encoder [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example videos with different camera viewpoints, actors, backgrounds, and activities from our IndustryASM dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example dense video captioning results in the zero-shot [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Example temporal action segmentation results in the su [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of our instruction and target response for the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples highlighting the dense video captioning capabilities of TemporalVLM in the supervised setting. The model [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative examples demonstrating the video highlight detection capabilities of TemporalVLM in the supervised setting. The [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative examples showing the temporal video grounding capabilities of TemporalVLM in the supervised setting. A video and [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of the generalization capabilities of Tempo [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TemporalVLM adds timestamped clips plus BiLSTM aggregation to a video LLM and ships a new industrial assembly dataset, but the abstract gives no numbers so the outperformance claims stay uncheckable from the text.

read the letter

The core new pieces are the visual encoder that tags short clips with timestamps, fuses overlapping windows, and runs BiLSTM for global aggregation, plus the IndustryASM dataset of factory-floor videos annotated by engineers for time-motion studies. The dataset targets a real use case that most academic video benchmarks ignore, and the encoder description is straightforward enough to implement and test. The paper also states it is the first to put LSTMs inside a video LLM, which is a concrete claim that referees can check against the cited prior work. Those two elements are what a reader would actually take away. The main weakness is that the abstract asserts gains on dense captioning, temporal grounding, highlight detection, and action segmentation without any numbers, baselines, or dataset sizes. If the full experiments section still omits error bars or clear comparisons, the central performance story cannot be evaluated. The architecture itself is logically consistent with the stated goal of producing time-sensitive features, so there is no hidden circularity. This work is aimed at groups already building or fine-tuning video LLMs for long videos and at anyone who needs annotated industrial footage. A reader who wants to try the BiLSTM trick or reuse the dataset will find something usable even if the headline results need more scrutiny. I would send it to peer review because the dataset and the encoder design are concrete additions that can be stress-tested, even though the current evidence level is low.

Referee Report

1 major / 0 minor

Summary. The paper introduces TemporalVLM, a video LLM for temporal reasoning and fine-grained understanding in long videos. Its visual encoder divides input videos into short-term clips that are jointly encoded with timestamps, fused across overlapping temporal windows to produce time-sensitive local features, and then aggregated via a BiLSTM for global cues. The work also presents the IndustryASM dataset of factory-floor assembly videos with engineer-annotated actions and timestamps. The central claim is that TemporalVLM outperforms prior methods on dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation, and that the work is the first to incorporate LSTMs into video LLMs.

Significance. If the reported outperformance is substantiated with proper baselines, ablations, and statistical tests, the architecture could offer a practical route to injecting explicit temporal modeling into video LLMs via recurrent aggregation of local time-aware features. The IndustryASM dataset would additionally supply a new, domain-specific benchmark for long-video temporal action segmentation in industrial settings.

major comments (1)

Abstract: the claim that TemporalVLM 'outperforms previous methods across' the four listed tasks is asserted without any quantitative results, baselines, error bars, dataset statistics, or even a reference to the experimental section, so the central empirical claim cannot be evaluated from the supplied text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for better support of the central empirical claim in the abstract. We agree this is a substantive issue and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the claim that TemporalVLM 'outperforms previous methods across' the four listed tasks is asserted without any quantitative results, baselines, error bars, dataset statistics, or even a reference to the experimental section, so the central empirical claim cannot be evaluated from the supplied text.

Authors: We agree that the abstract's assertion of outperformance lacks supporting quantitative details or a pointer to the experiments, making the claim difficult to assess from the abstract alone. This is a fair observation. In the revised version, we will update the abstract to include key quantitative highlights (e.g., specific improvements on representative benchmarks) and add an explicit reference to Section 4 (Experiments), where full baselines, ablations, error bars, and dataset statistics are reported. This change will strengthen the abstract without exceeding typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on architecture and experiments

full rationale

The paper describes a visual encoder (clip division + timestamp joint encoding + overlapping-window fusion + BiLSTM aggregation) as a compositional design choice that produces time-sensitive features, then reports empirical outperformance on downstream tasks. No equations define any quantity in terms of the target result, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain. The 'first to incorporate LSTMs' statement is an external literature claim, not a definitional step. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard video LLM components plus the added BiLSTM module will suffice.

pith-pipeline@v0.9.0 · 5741 in / 1092 out tokens · 69001 ms · 2026-05-23T08:04:46.129462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation

Nadine Behrmann, S Alireza Golestaneh, Zico Kolter, J¨urgen Gall, and Mehdi Noroozi. Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation. In European Conference on Com- puter Vision, pages 52–68. Springer, 2022. 1, 3

work page 2022
[3]

Revisiting the” video” in video-language understanding

Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022. 3

work page 2022
[4]

a person runs up some stairs

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- 13 1s 9s 12s 15s 19s 23s User: Locate and describe the visual content mentioned in the text query “a person runs up some stairs.” within the video, including timestamps. TemporalVLM (Ours): The given query happens in 0.0 - 10.0 seconds. User: Loca...

work page 2023
[5]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[6]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 3

work page 2015
[7]

Soda: Story oriented dense video captioning evaluation framework

Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Man- abu Okumura, and Masaaki Nagata. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 , pages 517–531. Springer, 2020. 6

work page 2020
[8]

multiple people are riding bikes across the field

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In 14 3s 20s 48s 60s 77s 81s User: You are given an egocentric video where a person performs a series of actions. Please watch the video and extract a minimum of 6 and a maximum of 10 significant steps. For each step, determine the starting and e...

work page 2017
[9]

Frameexit: Conditional early exiting for effi- cient video recognition

Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhos- sein Habibian. Frameexit: Conditional early exiting for effi- cient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15608–15618, 2021. 3

work page 2021
[10]

Smart frame selection for action recognition

Shreyank N Gowda, Marcus Rohrbach, and Laura Sevilla- Lara. Smart frame selection for action recognition. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , pages 1451–1459, 2021. 3

work page 2021
[11]

Temporal alignment networks for long-term video

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 3

work page 2022
[12]

Jing Hu, Yiming Yang, Huan Wang, and Eric P. Xing. Lora: Low-rank adaptation of large language models. In Proceed- ings of the 2023 Conference on Neural Information Process- ing Systems (NeurIPS 2023), 2023. 5

work page 2023
[13]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 2, 3

work page 2024
[14]

Lita: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 2, 3

work page 2024
[15]

Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachi- dambaresan. Pytorch. Programming with TensorFlow: solu- tion for edge computing applications , pages 87–104, 2021. 5

work page 2021
[16]

Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 2

work page 2024
[17]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of naacL-HLT, page 2. Minneapolis, Minnesota, 2019. 2

work page 2019
[18]

Movinets: Mobile video networks for efficient video recog- nition

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efficient video recog- nition. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 16020–16030,

work page
[19]

Ian goodfellow

Deep Learning. Ian goodfellow. Yoshua Bengio, and Aaron Courville, 2016. 4

work page 2016
[20]

Detecting mo- ments and highlights in videos via natural language queries

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems , 34: 11846–11858, 2021. 3, 5, 7

work page 2021
[21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2

work page 2023
[22]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation

Shijie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation. IEEE transactions on pattern analysis and machine intelligence , 45(6):6647– 6658, 2020. 1, 3

work page 2020
[24]

Video- teller: Enhancing cross-modal generation with fusion and decoupling

Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Video- teller: Enhancing cross-modal generation with fusion and decoupling. arXiv preprint arXiv:2310.04991, 2023. 2 15

work page arXiv 2023
[25]

Bt-adapter: Video conversation is fea- sible without video instruction tuning

Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is fea- sible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024. 12

work page 2024
[26]

Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion

Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18175–18185,

work page
[27]

Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training

Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23045–23055, 2023. 2, 3

work page 2023
[28]

Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 ,

work page arXiv
[29]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2, 3, 8, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Query-dependent video representa- tion for moment retrieval and highlight detection

WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representa- tion for moment retrieval and highlight detection. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023. 2, 3, 6, 9, 10

work page 2023
[31]

Chatgpt: Generative pre-trained transformer, 2024

OpenAI. Chatgpt: Generative pre-trained transformer, 2024. 2

work page 2024
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

work page 2022
[33]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning. arXiv preprint arXiv:2402.11435, 2024. 2, 3

work page arXiv 2024
[34]

Improving language understanding by gener- ative pre-training

Alec Radford. Improving language understanding by gener- ative pre-training. 2018. 2

work page 2018
[35]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 2

work page 2020
[36]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2, 3, 5, 6, 7, 8, 9, 11

work page 2024
[37]

Temporal aggregate representations for long-range video understand- ing

Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understand- ing. In Computer Vision–ECCV 2020: 16th European Con- ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 154–171. Springer, 2020. 3

work page 2020
[38]

An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021. 2

work page arXiv 2021
[39]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024
[40]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Alpaca: A strong, replicable instruction-following model

Aakarsh Taori, Evan Chen, Sainbayar Sukhbaatar, et al. Alpaca: A strong, replicable instruction-following model. arXiv preprint arXiv:2303.11347, 2023. 2

work page arXiv 2023
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 4, 8

work page 2017
[45]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

work page 2015
[46]

End-to-end dense video captioning with parallel decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6847– 6857, 2021. 1, 3

work page 2021
[47]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Benchmarking generalization via in-context instructions on 1,600+ language tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Ar- jun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2, 2022. 2

work page arXiv 2022
[49]

Negative sample matters: A renaissance of met- ric learning for temporal grounding

Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of met- ric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2613– 2623, 2022. 3, 6 16

work page 2022
[50]

Longvlm: Efficient long video un- derstanding via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video un- derstanding via large language models. arXiv preprint arXiv:2404.03384, 2024. 3, 4, 5, 6, 7, 8, 9, 11, 12

work page arXiv 2024
[51]

Towards long-form video understanding

Chao-Yuan Wu and Philipp Krahenbuhl. Towards long-form video understanding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021. 3

work page 2021
[52]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 3

work page 2022
[53]

Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, An- toine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 1, 2, 3, 9, 10

work page 2023
[54]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. arXiv preprint, 2021. 1, 3

work page 2021
[55]

Merlot reserve: Neural script knowledge through vision and language and sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16375–16387, 2022. 5, 11

work page 2022
[56]

Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023. 8

work page 2023
[57]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 3, 5, 7, 8, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Streaming video model

Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, and Zheng-Jun Zha. Streaming video model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 14602–14612, 2023. 3

work page 2023
[59]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 5, 7, 8

work page 2018
[60]

End-to-end dense video captioning as sequence generation

Wanrong Zhu, Bo Pang, Ashish V Thapliyal, William Yang Wang, and Radu Soricut. End-to-end dense video captioning as sequence generation. arXiv preprint arXiv:2204.08121 ,

work page arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation

Nadine Behrmann, S Alireza Golestaneh, Zico Kolter, J¨urgen Gall, and Mehdi Noroozi. Unified fully and times- tamp supervised temporal action segmentation via sequence to sequence translation. In European Conference on Com- puter Vision, pages 52–68. Springer, 2022. 1, 3

work page 2022

[3] [3]

Revisiting the” video” in video-language understanding

Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022. 3

work page 2022

[4] [4]

a person runs up some stairs

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- 13 1s 9s 12s 15s 19s 23s User: Locate and describe the visual content mentioned in the text query “a person runs up some stairs.” within the video, including timestamps. TemporalVLM (Ours): The given query happens in 0.0 - 10.0 seconds. User: Loca...

work page 2023

[5] [5]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page

[6] [6]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 3

work page 2015

[7] [7]

Soda: Story oriented dense video captioning evaluation framework

Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Man- abu Okumura, and Masaaki Nagata. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 , pages 517–531. Springer, 2020. 6

work page 2020

[8] [8]

multiple people are riding bikes across the field

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In 14 3s 20s 48s 60s 77s 81s User: You are given an egocentric video where a person performs a series of actions. Please watch the video and extract a minimum of 6 and a maximum of 10 significant steps. For each step, determine the starting and e...

work page 2017

[9] [9]

Frameexit: Conditional early exiting for effi- cient video recognition

Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhos- sein Habibian. Frameexit: Conditional early exiting for effi- cient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15608–15618, 2021. 3

work page 2021

[10] [10]

Smart frame selection for action recognition

Shreyank N Gowda, Marcus Rohrbach, and Laura Sevilla- Lara. Smart frame selection for action recognition. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , pages 1451–1459, 2021. 3

work page 2021

[11] [11]

Temporal alignment networks for long-term video

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022. 3

work page 2022

[12] [12]

Jing Hu, Yiming Yang, Huan Wang, and Eric P. Xing. Lora: Low-rank adaptation of large language models. In Proceed- ings of the 2023 Conference on Neural Information Process- ing Systems (NeurIPS 2023), 2023. 5

work page 2023

[13] [13]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 2, 3

work page 2024

[14] [14]

Lita: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 2, 3

work page 2024

[15] [15]

Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachi- dambaresan. Pytorch. Programming with TensorFlow: solu- tion for edge computing applications , pages 87–104, 2021. 5

work page 2021

[16] [16]

Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 2

work page 2024

[17] [17]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of naacL-HLT, page 2. Minneapolis, Minnesota, 2019. 2

work page 2019

[18] [18]

Movinets: Mobile video networks for efficient video recog- nition

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efficient video recog- nition. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 16020–16030,

work page

[19] [19]

Ian goodfellow

Deep Learning. Ian goodfellow. Yoshua Bengio, and Aaron Courville, 2016. 4

work page 2016

[20] [20]

Detecting mo- ments and highlights in videos via natural language queries

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems , 34: 11846–11858, 2021. 3, 5, 7

work page 2021

[21] [21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2

work page 2023

[22] [22]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation

Shijie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolu- tional network for action segmentation. IEEE transactions on pattern analysis and machine intelligence , 45(6):6647– 6658, 2020. 1, 3

work page 2020

[24] [24]

Video- teller: Enhancing cross-modal generation with fusion and decoupling

Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Video- teller: Enhancing cross-modal generation with fusion and decoupling. arXiv preprint arXiv:2310.04991, 2023. 2 15

work page arXiv 2023

[25] [25]

Bt-adapter: Video conversation is fea- sible without video instruction tuning

Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is fea- sible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024. 12

work page 2024

[26] [26]

Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion

Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross- attention temporal modeling for efficient action segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18175–18185,

work page

[27] [27]

Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training

Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23045–23055, 2023. 2, 3

work page 2023

[28] [28]

Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 ,

work page arXiv

[29] [29]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2, 3, 8, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Query-dependent video representa- tion for moment retrieval and highlight detection

WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representa- tion for moment retrieval and highlight detection. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023. 2, 3, 6, 9, 10

work page 2023

[31] [31]

Chatgpt: Generative pre-trained transformer, 2024

OpenAI. Chatgpt: Generative pre-trained transformer, 2024. 2

work page 2024

[32] [32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

work page 2022

[33] [33]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning. arXiv preprint arXiv:2402.11435, 2024. 2, 3

work page arXiv 2024

[34] [34]

Improving language understanding by gener- ative pre-training

Alec Radford. Improving language understanding by gener- ative pre-training. 2018. 2

work page 2018

[35] [35]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 2

work page 2020

[36] [36]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2, 3, 5, 6, 7, 8, 9, 11

work page 2024

[37] [37]

Temporal aggregate representations for long-range video understand- ing

Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understand- ing. In Computer Vision–ECCV 2020: 16th European Con- ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 154–171. Springer, 2020. 3

work page 2020

[38] [38]

An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

Gilad Sharir, Asaf Noy, and Lihi Zelnik-Manor. An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021. 2

work page arXiv 2021

[39] [39]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024

[40] [40]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Alpaca: A strong, replicable instruction-following model

Aakarsh Taori, Evan Chen, Sainbayar Sukhbaatar, et al. Alpaca: A strong, replicable instruction-following model. arXiv preprint arXiv:2303.11347, 2023. 2

work page arXiv 2023

[42] [42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 4, 8

work page 2017

[45] [45]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

work page 2015

[46] [46]

End-to-end dense video captioning with parallel decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF international conference on computer vision , pages 6847– 6857, 2021. 1, 3

work page 2021

[47] [47]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Benchmarking generalization via in-context instructions on 1,600+ language tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Ar- jun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2, 2022. 2

work page arXiv 2022

[49] [49]

Negative sample matters: A renaissance of met- ric learning for temporal grounding

Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of met- ric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2613– 2623, 2022. 3, 6 16

work page 2022

[50] [50]

Longvlm: Efficient long video un- derstanding via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video un- derstanding via large language models. arXiv preprint arXiv:2404.03384, 2024. 3, 4, 5, 6, 7, 8, 9, 11, 12

work page arXiv 2024

[51] [51]

Towards long-form video understanding

Chao-Yuan Wu and Philipp Krahenbuhl. Towards long-form video understanding. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021. 3

work page 2021

[52] [52]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022. 3

work page 2022

[53] [53]

Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, An- toine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a vi- sual language model for dense video captioning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 1, 2, 3, 9, 10

work page 2023

[54] [54]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. arXiv preprint, 2021. 1, 3

work page 2021

[55] [55]

Merlot reserve: Neural script knowledge through vision and language and sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16375–16387, 2022. 5, 11

work page 2022

[56] [56]

Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023. 8

work page 2023

[57] [57]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 3, 5, 7, 8, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Streaming video model

Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, and Zheng-Jun Zha. Streaming video model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 14602–14612, 2023. 3

work page 2023

[59] [59]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 5, 7, 8

work page 2018

[60] [60]

End-to-end dense video captioning as sequence generation

Wanrong Zhu, Bo Pang, Ashish V Thapliyal, William Yang Wang, and Radu Soricut. End-to-end dense video captioning as sequence generation. arXiv preprint arXiv:2204.08121 ,

work page arXiv