pith. machine review for the scientific record. sign in

arxiv: 2307.06942 · v2 · submitted 2023-07-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords video-text datasetmultimodal understandingcontrastive learningzero-shot action recognitionLLM generated descriptionsvideo retrievalViCLIP modeldataset construction
0
0 comments X

The pith

A scalable LLM-based method creates a 7 million video dataset that trains models with leading zero-shot action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVid, a video-text dataset of over 7 million videos spanning nearly 760,000 hours and 234 million clips, each paired with detailed descriptions generated by large language models at multiple scales for a total of 4.1 billion words. The core method shows how this autonomous, scalable construction produces high-quality training data that supports contrastive learning of transferable video-language representations. A model called ViCLIP, built on a ViT-L backbone and trained on InternVid, reaches top zero-shot performance on action recognition while remaining competitive on video retrieval. The dataset further enables new uses in video dialogue systems and text-to-video generation by supplying rich interleaved video-text pairs.

Core claim

InternVid is built by applying large language models in a multi-scale process to generate accurate and diverse descriptions for 7 million videos, yielding 234 million video clips with 4.1 billion words of text. Training ViCLIP via contrastive learning on this dataset produces video-text representations that achieve leading zero-shot action recognition and competitive retrieval results, confirming that the scalable LLM-driven construction delivers powerful and transferable multimodal features for understanding and generation tasks.

What carries the argument

The multi-scale LLM-generated descriptions that automatically annotate 7 million videos to create the InternVid dataset for contrastive video-text training in ViCLIP.

If this is right

  • ViCLIP representations can generate interleaved video-text sequences to train video-centric dialogue systems.
  • The dataset directly advances research on video-to-text captioning and text-to-video synthesis by providing large-scale aligned pairs.
  • Contrastive training on InternVid yields features useful for broader multimodal tasks such as video question answering.
  • The autonomous multi-scale generation approach scales to create additional video-text resources without manual labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LLM scaling technique could be tested on long-form videos or domain-specific collections to check whether description quality holds for more complex content.
  • Combining InternVid with large image-text corpora might produce even stronger cross-modal transfer than video-only training.
  • If description quality varies by video category, targeted filtering of the dataset could further improve downstream results without changing the core method.

Load-bearing premise

Large language models can generate video descriptions at multiple scales that are sufficiently accurate, diverse, and free of systematic errors to support strong transferable representations.

What would settle it

Training an identical ViCLIP model on a comparable volume of existing video-text data without the multi-scale LLM descriptions and finding that it matches or exceeds InternVid performance on zero-shot action recognition benchmarks would show the new dataset does not deliver superior representations.

read the original abstract

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces InternVid, a large-scale video-text dataset containing over 7 million videos (nearly 760K hours) and 234M clips paired with 4.1B words of multi-scale LLM-generated descriptions. It proposes an autonomous LLM-based pipeline for dataset construction and trains ViCLIP (a ViT-L model) via contrastive learning, claiming leading zero-shot action recognition and competitive video retrieval performance. Additional applications to interleaved video-text generation, video-to-text, and text-to-video tasks are discussed.

Significance. If the LLM-generated captions prove accurate and unbiased, InternVid would represent a substantial resource for video-language pretraining at unprecedented scale, potentially improving zero-shot transfer and enabling new generative video-centric models. The scale (760K hours) exceeds many existing video-text corpora and could support more robust multimodal representations.

major comments (2)
  1. [§3] §3 (Dataset Construction): The central claim that the multi-scale LLM approach produces a 'high-quality' dataset enabling 'powerful and transferable' representations is unsupported by any human evaluation, caption accuracy metrics, error-rate statistics, or comparison against ground-truth captions on a held-out subset. Without this, it remains possible that hallucinations or systematic biases in the descriptions explain the reported downstream gains.
  2. [§4] §4 (Experiments): The abstract and results claim 'leading zero-shot action recognition' and 'competitive video retrieval' but the provided text contains no quantitative tables, ablation studies on the multi-scale generation, error bars, or direct comparisons with prior datasets/models, preventing verification of the performance claims.
minor comments (1)
  1. [Abstract] The total of 4.1B words should be clarified (e.g., whether this counts tokens or unique vocabulary) to allow readers to assess description density.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The central claim that the multi-scale LLM approach produces a 'high-quality' dataset enabling 'powerful and transferable' representations is unsupported by any human evaluation, caption accuracy metrics, error-rate statistics, or comparison against ground-truth captions on a held-out subset. Without this, it remains possible that hallucinations or systematic biases in the descriptions explain the reported downstream gains.

    Authors: We agree that direct human evaluations, accuracy metrics, and bias analysis would provide stronger support for the dataset quality claims. The current manuscript primarily validates quality via downstream performance of ViCLIP. In the revision we will add a human evaluation study on a held-out subset of captions, including accuracy rates, error statistics, and comparison to ground-truth descriptions where available. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and results claim 'leading zero-shot action recognition' and 'competitive video retrieval' but the provided text contains no quantitative tables, ablation studies on the multi-scale generation, error bars, or direct comparisons with prior datasets/models, preventing verification of the performance claims.

    Authors: We will revise Section 4 to include all quantitative tables with the reported metrics, ablation studies specifically on the multi-scale caption generation, error bars on key results, and direct side-by-side comparisons against prior video-text datasets and models to allow full verification of the zero-shot and retrieval claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset construction or empirical claims

full rationale

The paper presents an empirical contribution: autonomous construction of InternVid via multi-scale LLM captions followed by standard contrastive training of ViCLIP. No derivation, equation, or central claim reduces by construction to its own inputs, fitted parameters renamed as predictions, or load-bearing self-citation chains. Performance results are reported on external benchmarks without internal loops. This is the common case of self-contained empirical work; the reader's noted concern about caption validation is a correctness/falsifiability issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated captions are high-quality training signals; no free parameters, axioms, or invented entities are introduced beyond standard contrastive learning.

axioms (1)
  • domain assumption Contrastive video-text alignment produces transferable representations
    Invoked implicitly when claiming zero-shot transfer from InternVid training.

pith-pipeline@v0.9.0 · 5553 in / 1074 out tokens · 28229 ms · 2026-05-15T06:25:54.338059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

  2. GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

  3. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  4. OZ-TAL: Online Zero-Shot Temporal Action Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

  5. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  6. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

  7. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.

  8. InstrAct: Towards Action-Centric Understanding in Instructional Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...

  9. InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

  10. MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

  11. Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

  12. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  13. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  14. Seeing Fast and Slow: Learning the Flow of Time in Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

  15. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  16. UniMesh: Unifying 3D Mesh Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.

  17. How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

    cs.CV 2026-04 unverdicted novelty 5.0

    A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.

  18. Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

    cs.CV 2026-04 unverdicted novelty 5.0

    The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

  19. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  20. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  21. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 20 Pith papers · 11 internal anchors

  1. [1]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020

  2. [2]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019

  3. [3]

    Advancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, pages 5036–5045, 2022

  4. [4]

    Merlot: Multimodal neural script knowledge models

    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. NeurIPS, 34:23634–23651, 2021

  5. [5]

    Merlot reserve: Neural script knowledge through vision and language and sound

    Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, pages 16375–16387, 2022

  6. [6]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021

  7. [7]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

  8. [8]

    Openflamingo, 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, et al. Openflamingo, 2023

  9. [9]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  10. [10]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  11. [11]

    Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

    Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023

  12. [12]

    Yfcc100m: The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM , 59(2):64–73, 2016

  13. [13]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018

  14. [14]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021

  15. [15]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In CVPR, pages 17980–17989, 2022

  16. [16]

    Redcaps: Web-curated image-text data created by the people, for the people

    Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021. 19

  17. [17]

    Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models

    Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, JiaQi Wang, and Dahua Lin. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023

  18. [18]

    Opendatalab: Empowering general artificial intelligence with open datasets

    Conghui He, Wei Li, Zhenjiang Jin, Wang Wang, Chao Xu, and Dahua Lin. Opendatalab: Empowering general artificial intelligence with open datasets. https://opendatalab.com, 2022

  19. [19]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  20. [20]

    Wit: Wikipedia- based image text dataset for multimodal multilingual machine learning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia- based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 2443–2449, 2021

  21. [21]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023

  22. [22]

    Learning audio-video modalities from image captions

    Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. In ECCV, pages 407–426. Springer, 2022

  23. [23]

    End-to-end learning of visual representations from uncurated instructional videos

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020

  24. [24]

    Learning spatiotemporal features via video and text pair discrimination

    Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020

  25. [25]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021

  26. [26]

    Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022

  27. [27]

    An empirical study of training end-to-end vision-and- language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and- language transformers. In CVPR, 2022

  28. [28]

    How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

  29. [29]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

  30. [30]

    Murphy, and Cordelia Schmid

    Chen Sun, Austin Myers, Carl V ondrick, Kevin P. Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. ICCV, 2019

  31. [31]

    Actbert: Learning global-local video-text representations

    Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020

  32. [32]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022

  33. [33]

    Internvideo-ego4d: A pack of champion solutions to ego4d challenges

    Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022

  34. [34]

    Learning transferable spatiotemporal representations from natural script knowledge

    Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, and Yixiao Ge. Learning transferable spatiotemporal representations from natural script knowledge. In CVPR, pages 23079–23089, 2023

  35. [35]

    Tvtsv2: Learning out-of-the- box spatiotemporal visual representations at scale

    Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, and Ying Shan. Tvtsv2: Learning out-of-the- box spatiotemporal visual representations at scale. arXiv preprint arXiv:2305.14173, 2023. 20

  36. [36]

    Videollm: Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023

  37. [37]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022

  38. [38]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023

  39. [39]

    Lavender: Unifying video-language understanding as masked language modeling

    Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022

  40. [40]

    All in one: Exploring unified video-language pre-training

    Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022

  41. [41]

    Violet: End- to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End- to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021

  42. [42]

    Valor: Vision-audio-language omni-perception pretraining model and dataset

    Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023

  43. [43]

    mplug-2: A modularized multi-modal foundation model across text, image and video

    Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023

  44. [44]

    Vlab: Enhancing video language pre-training by feature adapting and blending

    Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, and Jiashi Feng. Vlab: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167, 2023

  45. [45]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016

  46. [46]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, pages 5803–5812, 2017

  47. [47]

    Movie description

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. IJCV, 123:94–120, 2017

  48. [48]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018

  49. [49]

    How2: A Large-scale Dataset for Multimodal Language Understanding

    Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018

  50. [50]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In ICCV, pages 706–715, 2017

  51. [51]

    Learning video representations from textual web supervision

    Jonathan C Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, and David A Ross. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020

  52. [52]

    Activitynet: A large- scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large- scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), pages 961–970. IEEE, 2015

  53. [53]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

  54. [54]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017. 21

  55. [55]

    On the effectiveness of task granularity for transfer learning

    Farzaneh Mahdisoltani, Guillaume Berger, Waseem Gharbieh, David Fleet, and Roland Memisevic. On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235, 2018

  56. [56]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  57. [57]

    Co-grounding networks with semantic attention for referring expression comprehension in videos

    Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, and Shih-Fu Chang. Co-grounding networks with semantic attention for referring expression comprehension in videos. In CVPR, June 2021

  58. [58]

    Tubedetr: Spatio-temporal video grounding with transformers

    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, pages 16442–16453, 2022

  59. [59]

    Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Tracking by natural language specification. CVPR, 2017

  60. [60]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara L Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022

  61. [61]

    Tag2text: Guiding vision-language model via image tagging

    Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023

  62. [62]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(1):5485–5551, 2020

  63. [63]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  64. [64]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  65. [65]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  66. [66]

    Multimodal c4: An open, billion-scale corpus of images interleaved with text

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023

  67. [67]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021

  68. [68]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  69. [69]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  70. [70]

    Flashattention: Fast and memory- efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022

  71. [71]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023

  72. [72]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  73. [73]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022

  74. [74]

    A dataset for movie description

    Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3202–3212, 2015. 22

  75. [75]

    Collecting highly parallel data for paraphrase evaluation

    David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-V olume 1, pages 190–200. Association for Computational Linguistics, 2011

  76. [76]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022

  77. [77]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023

  78. [78]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023

  79. [79]

    Mimic-it: Multi-modal in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023

  80. [80]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022

Showing first 80 references.