arxiv: 2307.06942 · v2 · submitted 2023-07-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang , Yinan He , Yizhuo Li , Kunchang Li , Jiashuo Yu , Xin Ma , Xinhao Li , Guo Chen

show 8 more authors

Xinyuan Chen Yaohui Wang Conghui He Ping Luo Ziwei Liu Yali Wang Limin Wang Yu Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords video-text datasetmultimodal understandingcontrastive learningzero-shot action recognitionLLM generated descriptionsvideo retrievalViCLIP modeldataset construction

0 comments

The pith

A scalable LLM-based method creates a 7 million video dataset that trains models with leading zero-shot action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVid, a video-text dataset of over 7 million videos spanning nearly 760,000 hours and 234 million clips, each paired with detailed descriptions generated by large language models at multiple scales for a total of 4.1 billion words. The core method shows how this autonomous, scalable construction produces high-quality training data that supports contrastive learning of transferable video-language representations. A model called ViCLIP, built on a ViT-L backbone and trained on InternVid, reaches top zero-shot performance on action recognition while remaining competitive on video retrieval. The dataset further enables new uses in video dialogue systems and text-to-video generation by supplying rich interleaved video-text pairs.

Core claim

InternVid is built by applying large language models in a multi-scale process to generate accurate and diverse descriptions for 7 million videos, yielding 234 million video clips with 4.1 billion words of text. Training ViCLIP via contrastive learning on this dataset produces video-text representations that achieve leading zero-shot action recognition and competitive retrieval results, confirming that the scalable LLM-driven construction delivers powerful and transferable multimodal features for understanding and generation tasks.

What carries the argument

The multi-scale LLM-generated descriptions that automatically annotate 7 million videos to create the InternVid dataset for contrastive video-text training in ViCLIP.

If this is right

ViCLIP representations can generate interleaved video-text sequences to train video-centric dialogue systems.
The dataset directly advances research on video-to-text captioning and text-to-video synthesis by providing large-scale aligned pairs.
Contrastive training on InternVid yields features useful for broader multimodal tasks such as video question answering.
The autonomous multi-scale generation approach scales to create additional video-text resources without manual labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM scaling technique could be tested on long-form videos or domain-specific collections to check whether description quality holds for more complex content.
Combining InternVid with large image-text corpora might produce even stronger cross-modal transfer than video-only training.
If description quality varies by video category, targeted filtering of the dataset could further improve downstream results without changing the core method.

Load-bearing premise

Large language models can generate video descriptions at multiple scales that are sufficiently accurate, diverse, and free of systematic errors to support strong transferable representations.

What would settle it

Training an identical ViCLIP model on a comparable volume of existing video-text data without the multi-scale LLM descriptions and finding that it matches or exceeds InternVid performance on zero-shot action recognition benchmarks would show the new dataset does not deliver superior representations.

read the original abstract

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternVid gives a genuinely large video-text corpus via LLM captions, but the zero-shot claims rest on unvalidated caption quality.

read the letter

The core thing to know is that this paper ships a 7M-video, 760K-hour dataset with multi-scale LLM-generated descriptions and trains a ViCLIP model on it via contrastive learning, reporting leading zero-shot action recognition plus competitive retrieval. The scale plus the autonomous captioning pipeline is the actual new piece; earlier video-text sets were smaller or relied on templates and manual work, so this approach removes a practical bottleneck for building training data at this size. The dataset also targets downstream uses like interleaved video-text for dialogue systems and generation tasks, which is a reasonable extension. What holds up is the empirical construction: they describe a scalable LLM process that produces 234M clips and 4.1B words, and the model is a standard ViT-L contrastive setup. That part is straightforward and reproducible in principle if the data is released. The soft spot is exactly the one flagged in the stress test. The abstract gives no human evaluation, no error statistics on the captions, and no held-out comparison against ground-truth descriptions, so it remains possible that hallucinations or missing temporal details are being learned and inflating the zero-shot numbers. Without those checks, the performance edge cannot be cleanly attributed to better video understanding. The full paper may contain more, but nothing in the provided text closes this gap. This work is for groups that need large video-language training resources and are willing to do their own validation on the captions. A reader building zero-shot or retrieval models would find the data useful if it ships cleanly. It deserves peer review because the dataset scale is a real contribution even if the model results need tighter evidence on caption fidelity.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces InternVid, a large-scale video-text dataset containing over 7 million videos (nearly 760K hours) and 234M clips paired with 4.1B words of multi-scale LLM-generated descriptions. It proposes an autonomous LLM-based pipeline for dataset construction and trains ViCLIP (a ViT-L model) via contrastive learning, claiming leading zero-shot action recognition and competitive video retrieval performance. Additional applications to interleaved video-text generation, video-to-text, and text-to-video tasks are discussed.

Significance. If the LLM-generated captions prove accurate and unbiased, InternVid would represent a substantial resource for video-language pretraining at unprecedented scale, potentially improving zero-shot transfer and enabling new generative video-centric models. The scale (760K hours) exceeds many existing video-text corpora and could support more robust multimodal representations.

major comments (2)

[§3] §3 (Dataset Construction): The central claim that the multi-scale LLM approach produces a 'high-quality' dataset enabling 'powerful and transferable' representations is unsupported by any human evaluation, caption accuracy metrics, error-rate statistics, or comparison against ground-truth captions on a held-out subset. Without this, it remains possible that hallucinations or systematic biases in the descriptions explain the reported downstream gains.
[§4] §4 (Experiments): The abstract and results claim 'leading zero-shot action recognition' and 'competitive video retrieval' but the provided text contains no quantitative tables, ablation studies on the multi-scale generation, error bars, or direct comparisons with prior datasets/models, preventing verification of the performance claims.

minor comments (1)

[Abstract] The total of 4.1B words should be clarified (e.g., whether this counts tokens or unique vocabulary) to allow readers to assess description density.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The central claim that the multi-scale LLM approach produces a 'high-quality' dataset enabling 'powerful and transferable' representations is unsupported by any human evaluation, caption accuracy metrics, error-rate statistics, or comparison against ground-truth captions on a held-out subset. Without this, it remains possible that hallucinations or systematic biases in the descriptions explain the reported downstream gains.

Authors: We agree that direct human evaluations, accuracy metrics, and bias analysis would provide stronger support for the dataset quality claims. The current manuscript primarily validates quality via downstream performance of ViCLIP. In the revision we will add a human evaluation study on a held-out subset of captions, including accuracy rates, error statistics, and comparison to ground-truth descriptions where available. revision: yes
Referee: [§4] §4 (Experiments): The abstract and results claim 'leading zero-shot action recognition' and 'competitive video retrieval' but the provided text contains no quantitative tables, ablation studies on the multi-scale generation, error bars, or direct comparisons with prior datasets/models, preventing verification of the performance claims.

Authors: We will revise Section 4 to include all quantitative tables with the reported metrics, ablation studies specifically on the multi-scale caption generation, error bars on key results, and direct side-by-side comparisons against prior video-text datasets and models to allow full verification of the zero-shot and retrieval claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset construction or empirical claims

full rationale

The paper presents an empirical contribution: autonomous construction of InternVid via multi-scale LLM captions followed by standard contrastive training of ViCLIP. No derivation, equation, or central claim reduces by construction to its own inputs, fitted parameters renamed as predictions, or load-bearing self-citation chains. Performance results are reported on external benchmarks without internal loops. This is the common case of self-contained empirical work; the reader's noted concern about caption validation is a correctness/falsifiability issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated captions are high-quality training signals; no free parameters, axioms, or invented entities are introduced beyond standard contrastive learning.

axioms (1)

domain assumption Contrastive video-text alignment produces transferable representations
Invoked implicitly when claiming zero-shot transfer from InternVid training.

pith-pipeline@v0.9.0 · 5553 in / 1074 out tokens · 28229 ms · 2026-05-15T06:25:54.338059+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
OZ-TAL: Online Zero-Shot Temporal Action Localization
cs.CV 2026-05 unverdicted novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
InstrAct: Towards Action-Centric Understanding in Instructional Videos
cs.CV 2026-04 unverdicted novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
cs.CV 2026-04 unverdicted novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
cs.CV 2026-04 unverdicted novelty 7.0

MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Seeing Fast and Slow: Learning the Flow of Time in Videos
cs.CV 2026-04 unverdicted novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
cs.CV 2026-04 unverdicted novelty 5.0

A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
cs.CV 2026-04 unverdicted novelty 5.0

The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 20 Pith papers · 11 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020

work page 1901
[2]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019

work page 2019
[3]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, pages 5036–5045, 2022

work page 2022
[4]

Merlot: Multimodal neural script knowledge models

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. NeurIPS, 34:23634–23651, 2021

work page 2021
[5]

Merlot reserve: Neural script knowledge through vision and language and sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, pages 16375–16387, 2022

work page 2022
[6]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021

work page 2021
[7]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Openflamingo, 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, et al. Openflamingo, 2023

work page 2023
[9]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023

work page arXiv 2023
[12]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016

work page 2016
[13]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018

work page 2018
[14]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021

work page 2021
[15]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In CVPR, pages 17980–17989, 2022

work page 2022
[16]

Redcaps: Web-curated image-text data created by the people, for the people

Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021. 19

work page arXiv 2021
[17]

Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models

Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, JiaQi Wang, and Dahua Lin. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023

work page arXiv 2023
[18]

Opendatalab: Empowering general artificial intelligence with open datasets

Conghui He, Wei Li, Zhenjiang Jin, Wang Wang, Chao Xu, and Dahua Lin. Opendatalab: Empowering general artificial intelligence with open datasets. https://opendatalab.com, 2022

work page 2022
[19]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Wit: Wikipedia- based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia- based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 2443–2449, 2021

work page 2021
[21]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023

work page arXiv 2023
[22]

Learning audio-video modalities from image captions

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In ECCV, pages 407–426. Springer, 2022

work page 2022
[23]

End-to-end learning of visual representations from uncurated instructional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020

work page 2020
[24]

Learning spatiotemporal features via video and text pair discrimination

Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020

work page arXiv 2001
[25]

Videoclip: Contrastive pre-training for zero-shot video-text understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021

work page arXiv 2021
[26]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022
[27]

An empirical study of training end-to-end vision-and- language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and- language transformers. In CVPR, 2022

work page 2022
[28]

How much can clip beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

work page arXiv 2021
[29]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

work page arXiv 2021
[30]

Murphy, and Cordelia Schmid

Chen Sun, Austin Myers, Carl V ondrick, Kevin P. Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. ICCV, 2019

work page 2019
[31]

Actbert: Learning global-local video-text representations

Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020

work page 2020
[32]

Internvideo: General video foundation models via generative and discriminative learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022

work page arXiv 2022
[33]

Internvideo-ego4d: A pack of champion solutions to ego4d challenges

Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022

work page arXiv 2022
[34]

Learning transferable spatiotemporal representations from natural script knowledge

Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, and Yixiao Ge. Learning transferable spatiotemporal representations from natural script knowledge. In CVPR, pages 23079–23089, 2023

work page 2023
[35]

Tvtsv2: Learning out-of-the- box spatiotemporal visual representations at scale

Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, and Ying Shan. Tvtsv2: Learning out-of-the- box spatiotemporal visual representations at scale. arXiv preprint arXiv:2305.14173, 2023. 20

work page arXiv 2023
[36]

Videollm: Modeling video sequence with large language models

Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023

work page arXiv 2023
[37]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022

work page 2022
[38]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023

work page 2023
[39]

Lavender: Unifying video-language understanding as masked language modeling

Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022

work page arXiv 2022
[40]

All in one: Exploring unified video-language pre-training

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022

work page arXiv 2022
[41]

Violet: End- to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End- to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021

work page arXiv 2021
[42]

Valor: Vision-audio-language omni-perception pretraining model and dataset

Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023

work page arXiv 2023
[43]

mplug-2: A modularized multi-modal foundation model across text, image and video

Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023

work page arXiv 2023
[44]

Vlab: Enhancing video language pre-training by feature adapting and blending

Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, and Jiashi Feng. Vlab: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167, 2023

work page arXiv 2023
[45]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016

work page 2016
[46]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, pages 5803–5812, 2017

work page 2017
[47]

Movie description

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. IJCV, 123:94–120, 2017

work page 2017
[48]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018

work page 2018
[49]

How2: A Large-scale Dataset for Multimodal Language Understanding

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In ICCV, pages 706–715, 2017

work page 2017
[51]

Learning video representations from textual web supervision

Jonathan C Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, and David A Ross. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020

work page arXiv 2007
[52]

Activitynet: A large- scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large- scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), pages 961–970. IEEE, 2015

work page 2015
[53]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017
[54]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017. 21

work page 2017
[55]

On the effectiveness of task granularity for transfer learning

Farzaneh Mahdisoltani, Guillaume Berger, Waseem Gharbieh, David Fleet, and Roland Memisevic. On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[57]

Co-grounding networks with semantic attention for referring expression comprehension in videos

Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, and Shih-Fu Chang. Co-grounding networks with semantic attention for referring expression comprehension in videos. In CVPR, June 2021

work page 2021
[58]

Tubedetr: Spatio-temporal video grounding with transformers

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, pages 16442–16453, 2022

work page 2022
[59]

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Tracking by natural language specification. CVPR, 2017

work page 2017
[60]

Revealing single frame bias for video-and-language learning

Jie Lei, Tamara L Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022

work page arXiv 2022
[61]

Tag2text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023

work page arXiv 2023
[62]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(1):5485–5551, 2020

work page 2020
[63]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

work page 2023
[64]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page arXiv 2023
[66]

Multimodal c4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023

work page arXiv 2023
[67]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021

work page 2021
[68]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021
[69]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

Flashattention: Fast and memory- efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022

work page 2022
[71]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023

work page arXiv 2023
[72]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022

work page 2022
[74]

A dataset for movie description

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3202–3212, 2015. 22

work page 2015
[75]

Collecting highly parallel data for paraphrase evaluation

David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-V olume 1, pages 190–200. Association for Computational Linguistics, 2011

work page 2011
[76]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022

work page 2022
[77]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023

work page 2023
[78]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Mimic-it: Multi-modal in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023

work page arXiv 2023
[80]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022

work page arXiv 2022

Showing first 80 references.