arxiv: 2311.10122 · v3 · submitted 2023-11-16 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin , Yang Ye , Bin Zhu , Jiaxi Cui , Munan Ning , Peng Jin , Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified visual representationlarge vision-language modelimage video alignmentmulti-modal LLMvideo understandingmutual enhancement

0 comments

The pith

By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies misalignment of image and video features before projection as the core obstacle preventing an LLM from learning joint multi-modal interactions. It shows that first mapping both into the same language feature space removes this barrier and allows training on a combined image-video dataset. The resulting Video-LLaVA model then exhibits mutual gains: image data helps video understanding and video data helps image understanding. This produces a simple baseline that beats prior specialized systems on nine image benchmarks and on four video datasets.

Core claim

We unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits, and outperforms Video-ChatGPT by 5.8 percent, 9.9 percent, 18.6 percent, and 10.1 percent on MSRVTT, MSVD, TGIF, and ActivityNet respectively.

What carries the argument

Alignment before projection, the step that places image and video features into a common language feature space prior to the LLM projection layers so that a single model can learn from mixed data.

If this is right

A single model trained on mixed image-video data outperforms models built specifically for images on nine image benchmarks.
The same model outperforms Video-ChatGPT by 5.8 to 18.6 percent on four standard video datasets.
Images and videos improve each other's performance when processed inside one unified representation.
A straightforward alignment step before projection is sufficient to create a working unified LVLM baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-projection alignment idea could be tested with additional modalities such as audio or depth maps.
If alignment before projection is the decisive factor, then future work could reduce emphasis on ever-more-complex projection layers.
Scaling the mixed dataset size while keeping the unified representation fixed would test whether the mutual-benefit effect grows or saturates.

Load-bearing premise

The main difficulty for an LLM with multi-modal inputs is the absence of unified tokenization for images and videos before the projection layers are applied.

What would settle it

Train a non-unified model that still uses separate image and video encoders but receives the same mixed dataset and check whether it matches or exceeds Video-LLaVA on both image and video benchmarks.

read the original abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-LLaVA shows that aligning image and video features before the projection layer lets one model train on mixed data and pick up gains on both modalities.

read the letter

Video-LLaVA's core move is to align image and video features in a shared space before they reach the projection layer into the LLM. This lets the model train on a combined image-video dataset and report that each modality helps the other. The abstract claims this produces a simple baseline that beats Video-ChatGPT by 5.8-18.6% on four video datasets and holds its own or better across nine image benchmarks. The code release is a plus for anyone who wants to test the setup themselves. What the work does cleanly is demonstrate that separate encoders are not required once early alignment removes the token mismatch. The mutual-benefit result is presented as an empirical outcome rather than a deep theoretical claim, and the numbers line up with that framing. The soft spot is that the abstract gives little detail on how the alignment is actually done or on ablations that isolate its contribution from extra data or training tweaks. Without those tables it is hard to judge how load-bearing the pre-projection step really is. The benchmarks themselves are standard, so there is no circularity in the evaluation. This paper is aimed at groups building unified vision-language models who need a single checkpoint that handles both static images and short videos without switching architectures. It is the kind of incremental but practical baseline that deserves a serious referee to check the methods and run the numbers. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Video-LLaVA, an LVLM that aligns image and video features into a shared language feature space prior to projection into the LLM. This unified representation enables joint training on mixed image-video datasets, yielding mutual performance gains. The model reports state-of-the-art results on 9 image benchmarks (across 5 QA datasets and 4 toolkits) and outperforms Video-ChatGPT by 5.8–18.6% on four video datasets (MSRVTT, MSVD, TGIF, ActivityNet).

Significance. If the empirical link between pre-projection alignment and the observed mutual enhancement holds, the work supplies a simple, reproducible baseline for unified LVLMs. The public code release strengthens the contribution by enabling direct verification of the mixed-training protocol and benchmark numbers.

major comments (3)

[§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
[§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
[Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.

minor comments (2)

[Abstract] The abstract states '9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits'; the exact mapping between these counts and the tables in §4.1 should be clarified for reproducibility.
[§3.1] Notation for the unified visual token space (e.g., the symbol used for the aligned feature before the LLM projector) is introduced inconsistently between §3.1 and Figure 2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major point below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.

Authors: We thank the referee for highlighting this omission. The alignment is performed with a contrastive loss between the visual features and language embeddings while jointly optimizing the encoders; the encoders are not frozen. We will revise Section 3 to include the precise loss formulation, optimization schedule, and training details so that the contribution of the alignment step can be more clearly isolated. revision: yes
Referee: [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.

Authors: We agree that a controlled ablation would strengthen the central claim. In the revised manuscript we will add an ablation study in Section 4.2 that directly compares the pre-projection unified alignment against (i) post-projection fusion and (ii) separate image/video projectors while keeping all other factors fixed. revision: yes
Referee: [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.

Authors: We acknowledge that variance statistics would be preferable. Due to the high computational cost of LVLM training we report single-run results, which is standard practice in the field. We will add a clarifying note in the revised paper stating this limitation and pointing out that the observed gains are consistent across four distinct video benchmarks and are accompanied by mutual improvements on image tasks, supporting attribution to the unified representation rather than hyper-parameter differences alone. revision: partial

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of early alignment rather than on new theoretical axioms or invented entities.

free parameters (1)

projection and training hyperparameters
Standard deep-learning hyperparameters required to train the model; not enumerated in the abstract.

axioms (1)

domain assumption Transformer-based LLMs can integrate aligned visual tokens effectively
Background assumption inherited from prior LVLM work.

pith-pipeline@v0.9.0 · 5584 in / 1101 out tokens · 57514 ms · 2026-05-14T18:00:44.719539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language
cs.NI 2026-05 unverdicted novelty 7.0

WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Grounding Video Reasoning in Physical Signals
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
cs.LG 2026-04 unverdicted novelty 6.0

UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Spatio-Temporal Grounding of Large Language Models from Perception Streams
cs.RO 2026-04 unverdicted novelty 6.0

FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 26 Pith papers · 23 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736

work page 2022
[3]

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738

work page 2021
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[6]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200

work page 2011
[8]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)

work page 2023
[9]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190

work page 2023
[14]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913

work page 2017
[15]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617

work page 2018
[17]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009

work page 2022
[19]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

work page 2019
[21]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766

work page 2017
[23]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR

work page 2021
[24]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527

work page arXiv 2023
[27]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR

work page 2022
[28]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705

work page 2021
[35]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521

work page 2022
[39]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

work page 2022
[42]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565

work page 2018
[44]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

work page 2019
[46]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model

work page 2023
[50]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

work page 2016
[55]

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134

work page 2019
[60]

Improved Baselines with Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Visual Instruction Tuning

Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=

work page internal anchor Pith review arXiv
[63]

VideoChat: Chat-Centric Video Understanding

Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2306.07207 , year=

Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=

work page arXiv
[65]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[66]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[67]

Stanford alpaca: An instruction-following llama model , author=

work page
[68]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

work page 2023
[71]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[73]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Vipergpt: Visual infer- ence via python execution for reasoning

Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=

work page arXiv
[77]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023
[78]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=

work page Pith review arXiv
[82]

Llama-adapter v2: Parameter-efficient vi- sual instruction model

Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=

work page arXiv
[83]

arXiv preprint arXiv:2309.03905 , year=

Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=

work page arXiv
[84]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[85]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[86]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[87]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[88]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[89]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[90]

Advances in Neural Information Processing Systems , volume=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=

work page
[91]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[92]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[93]

MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[94]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[95]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[96]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

work page
[97]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[98]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[99]

arXiv preprint arXiv:2004.07159 , year=

Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=

work page arXiv 2004
[100]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[101]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[102]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[103]

Advances in neural information processing systems , volume=

Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

work page
[104]

International Conference on Machine Learning , pages=

Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[105]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[106]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

work page internal anchor Pith review arXiv
[107]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=

work page arXiv
[108]

arXiv preprint arXiv:2305.04790 , year=

Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=

work page arXiv
[109]

arXiv preprint arXiv:2311.08046 , year=

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=

work page arXiv
[110]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023
[111]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[112]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[113]

arXiv preprint arXiv:2305.04160 , year=

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=

work page arXiv
[114]

arXiv preprint arXiv:2306.09093 , year=

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=

work page arXiv
[115]

2021.OpenCLIP

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773

Showing first 80 references.