LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Boyuan Sun; Jiaxing Zhao; Qibin Hou; Xiang Chen; Xihan Wei

arxiv: 2501.05067 · v3 · submitted 2025-01-09 · 💻 cs.CV · cs.AI

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Boyuan Sun , Jiaxing Zhao , Xiang Chen , Xihan Wei , Qibin Hou This is my paper

Pith reviewed 2026-05-23 05:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video multimodal large language modeladaptive projector fusioninstruction-driven weightingvisual projectorsvideo question answeringlong video understandingfeature fusion

0 comments

The pith

LLaVA-Octopus weights features from multiple visual projectors dynamically according to user instructions to improve video multimodal performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLaVA-Octopus as a video multimodal large language model that adaptively weights and fuses outputs from different visual projectors based on the content of the user's instruction. Different projectors are observed to handle static details, temporal motion, or coherence at varying levels of effectiveness. The adaptive mechanism lets the model emphasize the projector best matched to the current query rather than applying a fixed combination. This yields measurable gains on video question answering, long-video understanding, and multi-choice benchmarks while using existing projectors without extra per-projector fine-tuning.

Core claim

LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling the model to leverage the complementary strengths of each projector. Some projectors excel at static details, others at temporal information, and still others at temporal coherence. By dynamically adjusting feature weights, the model selects and combines the most suitable features for each task, leading to stronger results across video question answering, long video understanding, and comprehensive multi-choice benchmarks.

What carries the argument

Instruction-conditioned adaptive weighting that fuses outputs from multiple visual projectors

If this is right

Stronger accuracy on video question answering benchmarks
Improved handling of long video sequences
Better results on multi-choice video understanding tasks
Utilization of complementary projector capabilities without separate fine-tuning

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same instruction-driven selection could be tested on image-only or audio-video joint tasks to check broader applicability
If the weighting remains stable, it may reduce the engineering cost of maintaining multiple specialized projectors
Models using more than two projectors could reveal whether the gains scale with the number of available feature extractors

Load-bearing premise

The distinct strengths of different visual projectors can be reliably detected and combined through instruction-based weighting without introducing instability.

What would settle it

An ablation showing equal or lower accuracy and higher variance when the instruction-driven weighting is replaced by fixed or uniform fusion on the same video QA and long-video benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2501.05067 by Boyuan Sun, Jiaxing Zhao, Qibin Hou, Xiang Chen, Xihan Wei.

**Figure 1.** Figure 1: Comparison of Different MLLM Paradigms. In the classical paradigm, user instructions are fed into the LLM solely as text tokens. While the instruction-involved paradigm facilitates interaction between instructions and visual features, it is constrained by a single projector. Our proposed instruction-driven projector fusion paradigm designs a projector fusion router, which dynamically adjusts the weights … view at source ↗

**Figure 2.** Figure 2: Comparisons of three representative methods under different video understanding scenarios. VideoChat2-HD [33] uses image-based projector while VideoLLaMa2 [17] and LLaMA-VID [35] use spatial-temporal projector and token-compress projector, respectively. The results indicate that different visual projectors perform well in their appropriate domains while exhibiting poorer performance in other scenarios. M… view at source ↗

**Figure 3.** Figure 3: Pipeline of the proposed LLaVA-Octopus model. Our LLaVA-Octopus proposes an instruction-driven adaptive projector that involves three types of visual projectors to enhance the model’s ability in multimodal tasks. actions accordingly, thus achieving comprehensive understanding and processing of visual and linguistic inputs. LLaVA1.5 [38] encodes different types of data into vectors of the same dimension, a… view at source ↗

**Figure 4.** Figure 4: Multimodal Data Distribution and Data Format. <image> and <video> represent visual tokens from image and video data, respectively. tages stemming from the model architecture rather than the aggregation of large-scale training data, we not only utilize the aforementioned dataset for multi-task pre-training and instruction tuning but also introduce a simplified setup where only the Video-LLaVA dataset (rela… view at source ↗

**Figure 5.** Figure 5: Qualitative Results of LLaVA-Octopus. Compared to using a single type of projector, LLaVA-Octopus is capable of leveraging the strengths of different projectors, thereby transcending the limited advantages of a single projector. This enables LLaVA-Octopus to achieve excellent performance across various tasks. Projector Method MVBench VideoMME Image-based Single 48.6 50.5 Stacked 49.2 51.0 Spatial-temporal … view at source ↗

**Figure 6.** Figure 6: More examples of scene details related question. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: More examples of spatial-temporal related question. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: More examples of dynamic counting question. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative results of LLaVA-Octopus. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: More qualitative results of LLaVA-Octopus. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model's performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as video question answering, long video understanding, and comprehensive multi-choices benchmarks, highlighting its broad application potential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaVA-Octopus adds instruction-conditioned weighting over multiple visual projectors for video MLLMs, but the abstract gives no numbers or ablations to back the performance claims.

read the letter

The main point is that this work introduces a gating mechanism that lets the model reweight features from several visual projectors on the fly according to the user's instruction. That specific combination for video tasks is new relative to earlier LLaVA papers, even though the overall architecture stays in the same family. The observation that different projectors are stronger on static detail versus temporal coherence is reasonable and worth testing. If the full paper shows clean implementation and the gating actually exploits those differences without extra instability, the idea could be picked up by groups already running multiple encoders. The central weakness is the missing evidence. The abstract asserts excellent results on video QA and long-video benchmarks yet supplies no baselines, no ablation on the weighting network itself, and no training details. Without those, it is impossible to know whether the gains come from the adaptive fusion or simply from having more projectors available. The assumption that instructions alone can reliably surface the right projector strengths also needs direct verification in the experiments. This is aimed at people already working on video multimodal models who might want to try projector fusion in their own setups. A reader in that niche could extract the core idea quickly. It deserves a serious referee because the mechanism is practical and the claim is falsifiable once the numbers and ablations are on the table.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces LLaVA-Octopus, a video multimodal large language model that adaptively weights and fuses features from multiple visual projectors conditioned on user instructions, claiming to exploit complementary projector strengths (e.g., static details vs. temporal information) and achieve excellent performance on video question answering, long-video understanding, and multi-choice benchmarks.

Significance. If the empirical claims hold and the weighting mechanism proves stable, the approach could offer a practical route to combining heterogeneous visual encoders in MLLMs without projector-specific fine-tuning, potentially improving instruction-following video tasks.

major comments (2)

[Abstract] Abstract: the central claim that LLaVA-Octopus 'achieves excellent performance across multiple benchmarks' is unsupported; the manuscript supplies no quantitative results, baselines, ablation studies, training details, or metrics, rendering the performance assertion unverifiable.
[Abstract] Abstract: the description of instruction-conditioned weighting lacks any equations, gating-network architecture, or loss formulation, so it is impossible to assess whether the mechanism actually detects and exploits projector-specific characteristics or merely adds parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback. We address the major comments point by point below. The full manuscript contains the requested details in dedicated sections, but we agree the abstract can be improved for clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LLaVA-Octopus 'achieves excellent performance across multiple benchmarks' is unsupported; the manuscript supplies no quantitative results, baselines, ablation studies, training details, or metrics, rendering the performance assertion unverifiable.

Authors: The abstract is a high-level summary. The full manuscript includes quantitative results on video question answering, long-video understanding, and multi-choice benchmarks, along with baselines, ablations, training details, and metrics in the Experiments section. To address the concern that the abstract alone does not support the claim, we will revise the abstract to include key performance numbers and benchmark references. revision: yes
Referee: [Abstract] Abstract: the description of instruction-conditioned weighting lacks any equations, gating-network architecture, or loss formulation, so it is impossible to assess whether the mechanism actually detects and exploits projector-specific characteristics or merely adds parameters.

Authors: The abstract provides a concise textual description of the adaptive weighting. The full manuscript details the gating network architecture, fusion equations, and loss formulation in the Method section. We will revise the abstract to include a brief reference to the mechanism or a high-level equation to improve verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an architectural design for instruction-driven adaptive weighting of visual projector features in a video MLLM. No equations, parameter-fitting procedures, predictions derived from fitted inputs, or self-citation chains appear in the provided abstract or description. The central claim is an empirical performance gain from the proposed mechanism rather than any derivation that reduces to its own inputs by construction. The work is self-contained as an engineering contribution evaluated on external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level model description.

pith-pipeline@v0.9.0 · 5686 in / 883 out tokens · 42140 ms · 2026-05-23T05:58:46.297166+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
cs.CV 2026-05 unverdicted novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 1 Pith paper · 31 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page
[3]

Claude-3.5, 2024

Anthropic. Claude-3.5, 2024. 2

work page 2024
[4]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed El- hoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 3

work page arXiv 2024
[5]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 5

work page 2022
[7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021. 5

work page 2021
[9]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2005
[10]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

work page 2015
[11]

Collecting highly paral- lel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly paral- lel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies , pages 190–200, 2011. 6

work page 2011
[12]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 5, 6

work page arXiv 2024
[15]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms. arXiv preprint arXiv:2406.07476, 2024. 1, 2, 3, 4, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

work page 2023
[19]

InstructBLIP: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general- purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Process- ing Systems, 2023. 3

work page 2023
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning 15 capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv:2401.04088, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. arXiv preprint arXiv:2311.08046, 2023. 6

work page arXiv 2023
[27]

Llava-next: What else influences visual instruction tun- ing beyond data?, 2024

Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tun- ing beyond data?, 2024. 2

work page 2024
[28]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 3

work page 2024
[29]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Inter- national Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742. PMLR, 2023. 3

work page 2023
[32]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Mvbench: A comprehensive multi- modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 2, 3, 6, 7

work page 2024
[34]

Tgif: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 5

work page 2016
[35]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. 2024. 2, 3, 4, 6, 7, 9

work page 2024
[36]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2, 3, 5

work page 2023
[40]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 6

work page 2024
[41]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

work page 2024
[42]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Bt-adapter: Video conversation is fea- sible without video instruction tuning

Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is fea- sible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024. 6

work page 2024
[44]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. arXiv preprint arXiv:2404.00308 , 2024. 6, 7

work page arXiv 2024
[45]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 5

work page arXiv 2024
[46]

Valley: Video assistant with large language model enhanced ability,

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability,

work page
[47]

Vista-llama: Reliable video narrator via equal distance to visual tokens

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870,

work page arXiv
[48]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 3, 5, 6, 7

work page 2024
[50]

Videogpt+: Integrating image and video encoders for enhanced video understanding

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arxiv, 2024. 5

work page 2024
[51]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural In- formation Processing Systems, 36, 2024. 6

work page 2024
[52]

OpenAI. ChatGPT. https://openai.com/blog/ chatgpt/, 2023. 1, 2

work page 2023
[53]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. 2023. 1, 5, 6, 7

work page 2023
[54]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1, 2

work page 2023
[55]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. 5, 6 16

work page 2024
[56]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

work page 2022
[57]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

work page 2023
[58]

Cinepile: A long video question answering dataset and benchmark

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. 5

work page arXiv 2024
[59]

Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles

Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles. xgen-mm-vid (blip-3- video): You only need 32 tokens to represent a video even in vlms, 2024. 3

work page 2024
[60]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 5

work page 2018
[61]

Hollywood in homes: Crowdsourcing data collection for activity under- standing

Gunnar A Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity under- standing. In Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11– 14, 2016, Proceedings, Part I 14 , pages 510–526. Springer,

work page 2016
[62]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Moviechat: From dense to- ken to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to- ken to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 6

work page arXiv 2023
[64]

Gemini: A family of highly capable multi- modal models, 2024

Gemini Team. Gemini: A family of highly capable multi- modal models, 2024. 1, 2

work page 2024
[65]

Qwen2-vl

Qwen team. Qwen2-vl. 2024. 3

work page 2024
[66]

Qwen2.5: A party of foundation models, 2024

Qwen Team. Qwen2.5: A party of foundation models, 2024. 1, 5

work page 2024
[67]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Cogvlm: Visual expert for pretrained language models, 2023

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023. 3

work page 2023
[69]

Videoagent: Long-form video understanding with large language model as agent, 2024

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent, 2024. 3

work page 2024
[70]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 1

work page arXiv 2024
[71]

Videollamb: Long video understanding with recurrent mem- ory bridges

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent mem- ory bridges. arxiv, 2024. 3, 7

work page 2024
[72]

Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos, 2024

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos, 2024. 3

work page 2024
[73]

Grok 1.5 vision

X. Grok 1.5 vision. https://x.ai/blog/grok-1. 5v, 2024. 5

work page 2024
[74]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9777–9786, 2021. 5

work page 2021
[75]

Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3, 6, 7, 9

work page 2024
[76]

xgen-mm (blip-3): A family of open large multimodal models, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming ...

work page 2024
[77]

Zero-shot video question answering via frozen bidirectional language models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022. 6

work page 2022
[78]

mplug- owl3: Towards long image-sequence understanding in multi- modal large language models, 2024

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug- owl3: Towards long image-sequence understanding in multi- modal large language models, 2024. 3, 5

work page 2024
[80]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 3 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023. 2, 3

work page 2023
[82]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1910

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page

[3] [3]

Claude-3.5, 2024

Anthropic. Claude-3.5, 2024. 2

work page 2024

[4] [4]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed El- hoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 3

work page arXiv 2024

[5] [5]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 5

work page 2022

[6] [7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021. 5

work page 2021

[8] [9]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2005

[9] [10]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

work page 2015

[10] [11]

Collecting highly paral- lel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly paral- lel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies , pages 190–200, 2011. 6

work page 2011

[11] [12]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [13]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [14]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 5, 6

work page arXiv 2024

[14] [15]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [16]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [17]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms. arXiv preprint arXiv:2406.07476, 2024. 1, 2, 3, 4, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 1

work page 2023

[18] [19]

InstructBLIP: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general- purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Process- ing Systems, 2023. 3

work page 2023

[19] [20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [21]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [22]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning 15 capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv:2401.04088, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. arXiv preprint arXiv:2311.08046, 2023. 6

work page arXiv 2023

[26] [27]

Llava-next: What else influences visual instruction tun- ing beyond data?, 2024

Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tun- ing beyond data?, 2024. 2

work page 2024

[27] [28]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 3

work page 2024

[28] [29]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [30]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Inter- national Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742. PMLR, 2023. 3

work page 2023

[31] [32]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

Mvbench: A comprehensive multi- modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 2, 3, 6, 7

work page 2024

[33] [34]

Tgif: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 5

work page 2016

[34] [35]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. 2024. 2, 3, 4, 6, 7, 9

work page 2024

[35] [36]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [37]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2, 3, 5

work page 2023

[39] [40]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 6

work page 2024

[40] [41]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

work page 2024

[41] [42]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [43]

Bt-adapter: Video conversation is fea- sible without video instruction tuning

Ruyang Liu, Chen Li, Yixiao Ge, Thomas H Li, Ying Shan, and Ge Li. Bt-adapter: Video conversation is fea- sible without video instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13658–13667, 2024. 6

work page 2024

[43] [44]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. arXiv preprint arXiv:2404.00308 , 2024. 6, 7

work page arXiv 2024

[44] [45]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 5

work page arXiv 2024

[45] [46]

Valley: Video assistant with large language model enhanced ability,

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability,

work page

[46] [47]

Vista-llama: Reliable video narrator via equal distance to visual tokens

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870,

work page arXiv

[47] [48]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [49]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 3, 5, 6, 7

work page 2024

[49] [50]

Videogpt+: Integrating image and video encoders for enhanced video understanding

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arxiv, 2024. 5

work page 2024

[50] [51]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural In- formation Processing Systems, 36, 2024. 6

work page 2024

[51] [52]

OpenAI. ChatGPT. https://openai.com/blog/ chatgpt/, 2023. 1, 2

work page 2023

[52] [53]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. 2023. 1, 5, 6, 7

work page 2023

[53] [54]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1, 2

work page 2023

[54] [55]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. 5, 6 16

work page 2024

[55] [56]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

work page 2022

[56] [57]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

work page 2023

[57] [58]

Cinepile: A long video question answering dataset and benchmark

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. 5

work page arXiv 2024

[58] [59]

Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles

Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles. xgen-mm-vid (blip-3- video): You only need 32 tokens to represent a video even in vlms, 2024. 3

work page 2024

[59] [60]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 5

work page 2018

[60] [61]

Hollywood in homes: Crowdsourcing data collection for activity under- standing

Gunnar A Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity under- standing. In Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11– 14, 2016, Proceedings, Part I 14 , pages 510–526. Springer,

work page 2016

[61] [62]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018

[62] [63]

Moviechat: From dense to- ken to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to- ken to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 6

work page arXiv 2023

[63] [64]

Gemini: A family of highly capable multi- modal models, 2024

Gemini Team. Gemini: A family of highly capable multi- modal models, 2024. 1, 2

work page 2024

[64] [65]

Qwen2-vl

Qwen team. Qwen2-vl. 2024. 3

work page 2024

[65] [66]

Qwen2.5: A party of foundation models, 2024

Qwen Team. Qwen2.5: A party of foundation models, 2024. 1, 5

work page 2024

[66] [67]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [68]

Cogvlm: Visual expert for pretrained language models, 2023

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023. 3

work page 2023

[68] [69]

Videoagent: Long-form video understanding with large language model as agent, 2024

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent, 2024. 3

work page 2024

[69] [70]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 1

work page arXiv 2024

[70] [71]

Videollamb: Long video understanding with recurrent mem- ory bridges

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent mem- ory bridges. arxiv, 2024. 3, 7

work page 2024

[71] [72]

Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos, 2024

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos, 2024. 3

work page 2024

[72] [73]

Grok 1.5 vision

X. Grok 1.5 vision. https://x.ai/blog/grok-1. 5v, 2024. 5

work page 2024

[73] [74]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9777–9786, 2021. 5

work page 2021

[74] [75]

Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3, 6, 7, 9

work page 2024

[75] [76]

xgen-mm (blip-3): A family of open large multimodal models, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming ...

work page 2024

[76] [77]

Zero-shot video question answering via frozen bidirectional language models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022. 6

work page 2022

[77] [78]

mplug- owl3: Towards long image-sequence understanding in multi- modal large language models, 2024

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug- owl3: Towards long image-sequence understanding in multi- modal large language models, 2024. 3, 5

work page 2024

[78] [80]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 3 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [81]

mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023. 2, 3

work page 2023

[80] [82]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1910